Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center
|
|
- Muriel Wilkinson
- 6 years ago
- Views:
Transcription
1 Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York fmschen, Abstract In this paper, we develop an algorithm, called PDM, to conduct parallel data mining for association rules. Consider a transaction as a collection of items, and a large itemset is a set of items such that the number of transactions containing it exceeds a pre-specied threshold. PDM is so designed that the global set of large itemsets can be identi- ed eciently and the amount of inter-node data exchange required is minimized. Specically, with a given database partition, each processing node will collect (count) information on each itemset from its local database eciently via a hashing method. The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue-andpoll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large itemsets can hence be determined based on the aggregate count of itemsets. It is experimentally shown that PDM not only attains very good parallelization eciencies, but also provides robust performance for various iut patterns. 1 Introduction Recently, the importance of database mining is growing at an extremely fast pace due to the increasing use of computing for various applications. One of the most important data mining problems is mining association rules. Given a database of sales transactions, it would be interesting to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. In [3], it is shown that mining association rules can be decomposed into two subproblems. First, we need to identify all sets of items (itemsets) that are contained in a sucient number of transactions above the minimum (support) requirement. These itemsets are Visiting from the department of computer science, Sungshin Women's University and partially supported by KOSEF, Korea. referred to as large itemsets. Once all large itemsets are obtained, the desired association rules can be generated in a straightforward manner. Subsequent work in the literature [5, 10] followed this approach and focused on the large itemset generations. Various other data mining capabilities were also explored in the literature, including classication [2, 7, 8], similar sequences [1], and sequential patterns [4]. Notice that most of the prior work on data mining focused on exploring various mining capabilities and on improving their execution in a sequential processing environment. However, there was little result reported thus far on exploiting parallel execution of data mining. It is noted that data mining in general requires progressive knowledge collection and revision based on a huge transaction database. Such a knowledge learning process calls for numerous iterations of data comparison and analysis. Due to this very nature of data mining, how to achieve ecient parallel data mining is a very challenging issue, since, with the transaction database being partitioned across all nodes, the amount of inter-node data transmission required for reaching global decisions can be prohibitively large, thus signicantly compromising the benet achievable from parallelization. As a result, although a signicant amount of research results have been reported on data mining, the study on parallel data mining is still in its infancy. Consequently, we develop in this paper an algorithm for parallel data mining, called PDM, which is a generalized version of algorithm DHP reported in [10]. Consider a transaction as a collection of items, and a large itemset as a set of items such that the number of transactions containing it exceeds a pre-specied threshold. In essence, PDM is so designed that the identication of the global set of large itemsets can be parallelized eciently and the amount of internode data exchange required is minimized. With the entire transaction database being partitioned across all nodes, each node will employ a hashing method, similar to that in DHP, to identify candidate k-itemsets (i.e., itemsets consisting of k items) from its local database eciently. The information discovered by each node is next shared with other nodes via some communication schemes. It is noted that in a shared nothing type parallel environment, each node can only directly collect information from its local database partition. The process to parallelize the knowledge discovery across all nodes and to deal with partial knowledge collected could itself be complicated and communication intensive. To address this issue, we devise a technique, called clue-and-poll, to resolve the uncertainty due to the partial knowledge col-
2 lected at each node by judiciously selecting a small fraction of the itemsets for the information exchange among nodes, thus minimizing the communication cost incurred. The global set of large k-itemsets is then determined via proper communication schemes in accordance with the aggregate count of itemsets and the pre-determined minimal transaction support. To provide more insights into the performance of PDM, we evaluate another parallel algorithm, denoted by PF, that does not utilize the hashing technique, and comparatively analyze the performance of PDM and PF via simulations. It is observed that due to the use of hashing techniques in knowledge collection, work parallelization and distribution across all nodes can be achieved very eciently within PDM, showing the very advantage of PDM over PF. It is experimentally shown that PDM not only attains very good parallelization eciencies, but also performs very robustly for various iut patterns. This paper is organized as follows. Preliminaries are given in Section 2. The proposed algorithm PDM for parallel data mining is described in Section 3. Performance results are presented in Section 4. Section 5 contains the summary. 2 Preliminaries Note that PDM is a parallel version of algorithm DHP devised in [10] and the latter is a revised version of Apriori reported in [5]. Due to the page limitation, we refer readers to [5] and [10] for the details of Apriroi and DHP. 2.1 Mining Association Rules Let I=fi 1, i 2, :::, i mg be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. Note that the quantities of items bought in a transaction are not considered, meaning that each item is a binary variable representing if an item was bought. Each transaction is associated with an identier, called TID. Let X be a set of items. A transaction T is said to contain X if and only if X T. An association rule is an implication of the form X =) Y, where X I, Y I and X T Y =. The rule X =) Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule X =) Y has support s in the transaction set D if s% of transactions in D contain X S Y. As mentioned before, the problem of mining association rules is composed of the following two steps: 1. Discover the large itemsets, i.e., all sets of itemsets that have transaction support above a pre-determined minimum support s. 2. Use the large itemsets to generate the association rules for the database. The overall performance of mining association rules is in fact determined by the rst step. After the large itemsets are identied, the corresponding association rules can be derived in a straightforward manner. Apriori, DHP, and also PDM proposed in this paper mainly address the issue of discovering large itemsets. 2.2 Algorithm DHP Same as Apriori, DHP also generates the candidate set of large k-itemsets, C k, from the set of large (k? 1)-itemsets, L k?1. However, DHP is unique in that it employs a hash table, which is built in the previous pass, to test the eligibility of a k-itemset. Instead of including all k-itemsets from L k?1 L k?1 into C k, DHP adds a k-itemset into C k only if that k-itemset is hashed into a hash entry whose value is larger than or equal to s. DHP also reduces the database size progressively by not only trimming each individual transaction size but also pruning the number of transactions in the database. We note that both DHP and Apriori are iterative algorithms on the large itemset size in the sense that the large k-itemsets are derived from the large (k? 1)-itemsets. As shown in [10], the hash ltering in DHP can drastically reduce the size of C k for small k values, especially for k = 2. The signicant size reduction on C 2 by DHP will in turn make the trimming and pruning of transactions very eective in later passes. 3 Parallel Data Mining In this section we describe a parallel data mining algorithm, PDM, which eectively parallelizes algorithm DHP for mining association rules in a multiprocessor environment. 3.1 Algorithm for Parallel Data Mining Basically, algorithm PDM consists of parallel generation of candidate itemsets and parallel determination of large itemsets. The algorithm uses a hash table to generate candidate itemsets at early passes, and then uses a method to generate candidate itemsets directly from the previous large itemsets at later passes. Let n p be the number of nodes and D 1i denote a partitioned database located at node i. Note that S?1 D 1i = D 1 where D 1 is an initial transaction database, and that D 2i = D 1i. Each node i starts with counting the support for each item in D 1i. We use L 1i to represent the set of potential 1-itemsets, each of which has an associated count for its occurrences in D 1i. Each node also builds a hash table to count the support for each 2-itemset. We use H 2i to represent the hash table and its associated counts for hash buckets. Note that in contrast to the sequential case, the counts at each node here only contain information on a single database partition. To obtain the global counts, we need to sum L 1i and H 2i over all nodes so as to get L 1 and H 2. With a moderate size L 1i, nodes can exchange information with each other directly to determine L 1. However, the size of H 2i is usually large and the cost of direct exchange could be very high. To remedy this, we devise a clue-and-poll approach to judiciously select a small number of hash buckets, which have the potential to meet the support requirement, for information exchange. Specically, instead of constructing H 2, we only need to gure out its subset which has hash counts larger than or equal to s, i.e., the set fxjh 2[x] sg. After fxjh 2[x] sg is derived, we can divide it into n p equal partitions, and have each node generate a partial set of candidate 2-itemsets in parallel from its own database partition. This partial set of candidate 2-itemsets, C 2i, from node i is the set of 2-itemsets whose hash values belong to the i-th partition of fxjh 2[x] sg. By exchanging C 2i among all nodes, we then get C 2. With C 2, each node can next scan its local database partition, in parallel, to derive L 2i, thus
3 n N0 D 10! L 10 and H 20 = [2, 17, 15, 0, 16, 24, 14, 0] n N1 n N0 D 11! L 11 and L k?1! C k0 H 21 = [4, 13, 17, 0, 18, 23, 7, 1] C k; D k0! L k0; D (k+1)0 n N1 L k?1! C k1 C k; D k1! L k1; D (k+1)1 n N2 D 12! L 12 and H 22 = [3, 12, 18, 16, 17, 22, 0, 3] n N3 n N2 D 13! L 13 and L k?1! C k2 H 23 = [0, 15, 16, 9, 14, 14, 8, 0] C k; D k2! L k2; D (k+1)2 n N3 L k?1! C k3 C k; D k3! L k3; D (k+1)3 Figure 1: Local processing for L 1i and H 2i of Step P-1. Figure 2: Generation of C k and L ki's. associating each 2-itemset in C 2 with its corresponding support count in the local database D 2i. Similarly to DHP, when a transaction is scanned to evaluate the support counts for 2-itemsets in C 2, we can simultaneously trim items which do not appear in any potential large itemset from the transactions as well as prune transactions that do not contain potential large itemsets, thereby leading to a smaller database partition D 3i to scan in later passes. By exchanging L 2i among all nodes, the set of large 2-itemsets, L 2, can therefore be determined. Note that C 3, which equals to L 2 L 2, is in fact a join between L 2 and itself. We partition and parallelize the (nested loop) join operation on L 2 so that each node generates a subset C 3i of the results in parallel. By exchanging C 3i, we hence get C 3. As such, the process repeats for later passes till all large itemsets are identied. More formally, PDM is described in its algorithmic form below. Algorithm PDM: Parallel algorithm for data mining. P-1) Each node generates L 1i and H 2i by scanning D 1i. P-2) Determine L 1 by exchanging L 1i through all-to-all broadcast. P-3) Call procedure Clue-and-Poll to nd fxjh 2[x] sg from all H 2i's. P-4) k = 2 P-5) While (L k?1 k) do P-6) If (k = 2), each node generates C 2i from L 1 and fxjh 2[x] sg. P-7) Otherwise, each node generates C ki directly from L k?1.?1 P-8) Determine C k = C ki by exchanging C ki where x:count i is the count information of the bucket address x at node i. Obviously, as a necessary condition for among all nodes. P-9) Each node generates L ki by scanning and trimming D ki. a hash address to have a global count reaching the minimal P-10) Determine L k by exchanging L ki among all nodes. support requirement, there must exist at least one node j P-11) k = k + 1. with a count of that bucket address, x:count j, greater than or equal to b s c. In light of this, if each node only exchanges Figure 1 illustrates the mining procedure for determining L 1i and H 2i from D 1i in a multiprocessor system of 4 bucket addresses with a count greater than or equal to b s nodes. Each square bracket in Figure 1 contains counts of no hash bucket with the minimal support will be left out. the 8 buckets in H 2i. Due to its complication, the description of P-3, i.e., the clue-and-poll method of using H 2i's to count by adding together the count values on each bucket ad- After the exchange, each node can obtain an accumulated obtain fxjh 2[x] sg is deferred to Section 3.2. In Steps P-2 dress. We maintain for each hash bucket x its accumulated and P-10, when these L ki's are exchanged, the transaction count, x:count, and the number of nodes, denoted by x:n b, supports of these candidate k-itemsets in L ki's are added that send information on the bucket. These hash buckets of together as a process to reach the global count. After exchanging the resulting hash table can be classied into two cases: all L ki's, we get L k = fxjx:count s and x 2 C kg. In Step P-6, we split fxjh 2[x] sg into n p partitions, and each node generates a partial set of candidate S 2-itemsets,?1 C 2i, from each of these partitions, where C 2 = C 2i: This method generates a much smaller C 2 than the corresponding candidate itemset generation in Apriori. Figure 2 shows conceptually Steps P-7, P-8, and P-9 at each node. In Step P-7, we parallelize the generation of candidate k- itemsets by having S each node generate a subset C ki. Then?1 we get C k = C ki in Step P-8. In Step P-9, node i counts the transaction support of each k-itemset in C k by scanning D ki. When node i scans D ki to obtain L ki, it trims the sizes of transactions in D ki by removing from each transaction those items which are deemed not useful for later large itemset generation, thus resulting in D (k+1)i for L k Procedure Clue-and-Poll We describe here a procedure which is able to eciently determine a small number of elements from many large distributed data sets. Explicitly, we need to identify bucket addresses with the minimum support, fxjh 2[x] sg, for Step P-6 from those H 2i's located at distributed nodes. A naive method would be to exchange H 2i among all nodes. However, this method incurs substantial communication costs since the size of H 2i is usually very large, say, approximately 512K. To minimize the communication cost, we devise procedure Clue-and-Poll to judiciously exchange only a small set of elements which could possibly reach the minimum support. Recall that x 2 fyjh 2[y] sg means?1 X x:count i s; 1. x:count s, where x is a bucket address reaching the minimum support requirement. 2. x:count < s. For the rst case, it is concluded that the count in this hash bucket has reached the minimum support and this hash bucket will be involved in later processing. Note, however,
4 that the accumulated count for hash bucket x does not necessarily include the counts from all n p nodes because only those nodes with their counts for hash bucket x reaching the exchange threshold (i.e., b s c) provide the count information. In view of this, for the second case, we need to determine whether or not the bucket could exceed the support requirement if all the remaining nodes which did not send in their bucket counts are polled. We now describe the polling criterion. There are (n p? x:n b) nodes which did not send information on the bucket address x and each of these nodes can at most have a count value of (b s c? 1). Thus, the sum of the counts from these nodes on that bucket address has an upper bounded of (b s c? 1)(? x:nb). The maximal value of the count after polling this set of nodes is hence x:count + (b s c? 1)(n p? x:n b). If this maximal count value exceeds the minimum support, we should poll these (n p? x:n b) nodes. Each of these (n p? x:n b) nodes, upon receiving the request, exchanges its count on the bucket address x with others. b s It is noted that if we use b s c for > 1:0, instead of c, as the threshold in the rst step of the communication, we are lowering the threshold on the bucket count to exchange information. Although increasing the amount of communication in the rst step, the use of a larger will reduce the communication cost in the polling step. The value of was varied to study this trade-o on the communication cost in [9]. Formally, procedure Clue-and-Poll can be described as follows. Procedure Clue-and-Poll: An ecient method to identify elements with the minimal support required. /* Iut: fs i; i = 0; ::; n p? 1g, where S i contains count information and is stored at node i. */ /* Output: S T = fxjx 2 S and x:count sg, where S = S?1 S i. */ C-1) Node i sends information on elements with a count reaching the threshold requirement, i.e., R i = fxjx 2 S i and x:count b s cg. C-2) Determine S T and R poll = x = fxjx:count s and x 2 S?1 R ig s s?? 1 (n p? x:n b) x:count < s n p and x 2?1 [ R i ) where x:n b is the number of nodes which sent element x. C-3) If R poll = then stop. Otherwise, do this step: Poll the count information of each element in R poll from nodes which did not send information on the element in Step C-1, and this count information is added to the corresponding element in R poll. S T = S T [fxjx:count s and x 2 R pollg. The operations of procedure Clue-and-Poll to determine bucket addresses with the minimum support can be best understood by a numerical example given in Figure 1. For Step P-3 of PDM, S i in procedure Clue-and-Poll corresponds to ; Items N = 1000 PICK ITEM Potentially Large Itemsets L = 2000 I = 4 PICK PLI Transactions D 1 = 500,000 T = 20 Figure 3: The process of generating synthetic transactions H 2i. Suppose that n p = 4, jd 1j = 10; 000, s = 0:6% (i.e., s = 60), and = 1:0. In Step C-1, node 0 sends out count information on bucket addresses 1, 2, 4, and 5, and receives information on bucket addresses with counts greater than or equal to b s c=15 from other nodes. As a result of Step C- S?1 1, all nodes have R i = f(1; 32; 2); (2; 66; 4); (3; 16; 1), (4; 51; 3); (5; 69; 3)g, where the triplet in each parenthesis represents (x; x:count; x:n b), i.e., the bucket address, the accumulated counts from the exchange, and the number of nodes sending information on the bucket, respectively. Step C-2 determines that S T = fxjh 2[x] sg = f2; 5g and R poll = f1; 4g. In Step C-3, since R poll 6=, the count information of each element in R poll is polled, and consequently, information on buckets 1 and 4 gets updated to become f(1; 57; 4); (4; 65; 4)g. Finally, we have S T = f2; 4; 5g. Note that some steps of PDM need to exchange information for nding candidate itemsets and large itemsets. The communication scheme used in this paper is based on the all-to-all broadcast scheme [6]. The performance impact of using other communication schemes was evaluated in [9]. 4 Simulation Results We evaluated the performance of PDM based on a synthetically generated transaction database. A sales transaction consists of a collection of sales items. Dierent methods of generating the items in a transaction were considered. Sensitivity analysis on PDM was conducted on the database size, the transaction size, and the degree of parallelism. A shared nothing type of parallel computers is considered as a basic model to evaluate our algorithms. The methods used to generate synthetic transactions are described in Section 4.1. For comparison purposes, in Section 4.2 we also implement another parallel algorithm without using the technique of hashing in candidate set generation. This algorithm is referred to as PF. Performance comparison of PDM and PF algorithms is then conducted via simulation. As will be shown by simulation results, the eciency of PDM is significantly better than that of PF. 4.1 Generation of Synthetic Transactions Synthetic transactions are generated from an N item set with the purpose of emulating sales transactions in a retail industry. Figure 3 shows a basic generation method for the transaction database, which is in essence similar to those used in [5] and [10]. Table 1 summarizes the meaning of various parameters used in the experiments. Synthetic transactions consist of a series of potentially large itemsets (PLI's) which are chosen according to the weight of each PLI. The method of selecting PLI's to form a transaction is referred to as PICK-PLI. The weight of each PLI corresponds to the probability that this PLI will be included into these synthetic transactions.
5 Table 1: Meaning of various parameters. D k Set of transactions for large k-itemsets. jd kj The number of transactions in D k. H k Hash table containing jh kj buckets for C k. C k Set of candidate k-itemsets. L k Set of large k-itemsets. jt j Average size of the transactions. jij Ave. size of maximal potentially large itemsets. jlj Number of maximal potentially large itemsets. N Number of items. Table 2: Simulation results for varying the size of D 1 when n p = 32. jd 1j 100, , , ,000 1,000,000 D 1(KB) T T comm T cpu T We next consider the method of selecting items to form PLI's, denoted by PICK-ITEM. For the rst PLI, items are chosen by a random selection method in such a way that each item is chosen from some given distribution of N items, e.g., a uniform distribution. To capture the phenomenon that large itemsets often have common items, some fraction of items in subsequent PLI's is chosen from the previous PLI and the remaining items are chosen based on the above mentioned random selection. The common fraction is referred to as the correlation level and is generated from an exponentially distributed random variable with mean equal to a predetermined value. Same as in [5], the mean of the correlation level is set to 0:5 for our experiments. The transaction database is divided into n p partitions, each of which contains approximately the same number of transactions. The i-th partition of the database is allocated at node i. Transactions are generated under the assumption that the random selection method in PICK-ITEM chooses an item from a uniform distribution of N items. Also, in PICK- PLI a weight of each PLI is picked from an exponential distribution with unit mean. Furthermore, all the weights are normalized so that the sum of them equals to one. A study on using a Zipf-like distribution to model the data skew, i.e., skew on items contained in the transactions, can be found in [9]. 4.2 Performance of PDM We now analyze the sensitivity of the performance of PDM to the database size. Table 2 shows that not only the execution times, but also the eciencies are aected by the size of the database. The parallelization eciency of performing a task in parallel is dened as: = T 1 n p T ; where T i is the average execution time when i nodes are employed to perform the task, i.e., T 1 is the execution time in a single processing node and T is the execution time in a system of n p nodes. It is noted that when n p nodes are employed to perform a given task in parallel, the execution time at each node is dierent from one to another because the database partition at each site is not the identical. T cpu is determined from adding up, along all steps, the maximum execution times among n p nodes. Note that T cpu also contains the time to receive and process the messages in the communication steps. Thus, T equals to the sum of T cpu and T comm, where T comm is obtained as follows. Given the message latency A (in seconds) and point-to-point bandwidth B (in bytes/second), the message passing time is modeled by f(x) = A+bx, where b = 1=B and x is the message size in bytes. In a multiprocessor system considered, A is assumed to be 30 microsecond and B = 14MB/second for full duplex communication. As shown in Table 2, when the number of transactions increases, the communication cost decreases, thus improving. This is due to the fact that the statistical variation among the dierent nodes on the count value at each hash class decreases as the number of transaction increases. As mentioned before, we also implement another parallel mining algorithm, called PF, which does not use the technique of hashing. Since the number of candidate 2-itemsets, i.e., jl 1j(jL1 j?1), is very large, procedure PF generates C 2 2 at each node, and then utilizes procedure Clue-and-Poll to determine L 1 2. Procedure PF: A direct parallelization without using the hashing technique. F-1) Each node generates L 1i by scanning D 1i. F-2) Determine L 1 by exchanging L 1i among all nodes. F-3) Each node generates C 2, and then derives L 2i. L 2i is generated by obtaining the transaction support of each 2-itemset in C 2 through scanning D 1i. (Note that similarly to [5] some auxiliary data structure on C 2 is built to facilitate the matching between D 1i and the 2-itemsets in C 2.) F-4) Call procedure Clue-and-Poll to derive L 2 from all L 2i's. F-5) k = 3. F-6) Do Step P-5 to Step P-11 of algorithm PDM. Performance comparisons were conducted between algorithms PDM and PF, and the results are summarized in Table 3 and Figure 4 for jd 1j = 250; 000. Two dierent transaction sizes, 10 and 20, are considered. It is noted that procedure PF performs poorly due to the very large number of candidate 2-itemsets, i.e., jc 2j = jl 1 j(jl1 j?1). The large 2 size of C 2 increases not only the time to generate it from L 1, but also the time to scan the transactions and count the support for each 2-itemset in C 2. The communication time 1 Note that utilizing procedure Clue-and-Poll in PF will not only improve the performance of PF but also result in a fair comparison between PF and PDM, hence allowing us to better capture the advantage from using hashing in PDM.
6 Table 3: Performance comparisons between PDM and PF. jt j = 10; jij = 4 jt j = 20; jij = 4 n p T T P DM T P F P DM P F global set of large itemsets was hence determined based on the aggregate count of itemsets. It was experimentally shown that PDM not only attains very good parallelization eciencies, but also provides robust performance in the presence of data skew. References [1] R. Agrawal, C. Faloutsos, and A. Swami. Ecient Similarity Search in Sequence Databases. Proceedings of the 4th Intl. conf. on Foundations of Data Organization and Algorithms, October, execution time (in seconds) number of nodes T-PDM T-PF [2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An Interval Classier for Database Mining Applications. Proceedings of the 18th International Conference on Very Large Data Bases, pages 560{573, August [3] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. Proceedings of ACM SIGMOD, pages 207{216, May [4] R. Agrawal and R. Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering, pages 3{14, March Figure 4: Execution times of PDM and PF when jt j=10, jij=4, and jd 1j=250,000. to determine L 2 is also increased. Furthermore, the large C 2 leads to little trimming and pruning eect on transactions in D 1i, and results in the size of D 3i being nearly equal to the size of D 1i. It is worth mentioning that under PF, the time to generate C 2 and build the auxiliary data structure to facilitate the counting and comparison for itemsets among transactions is about 7.25% of total execution time when jt j = 20, jij = 4, and n p = 8 in Table 3. The corresponding time in PDM is about 0.33% of the total time. This very feature shows that the hashing approach of generating a small C 2 in PDM is very eective and parallelizable in a multiprocessor environment. Although still far worse than PDM, the performance of PF improves in the case when a larger number of passes incurred as its parallelization eciency for jt j = 20 is better than that for jt j = 10. (See Table 3.) This is because the effect of large C 2 reduces when the number of passes increases. 5 Conclusions We developed in this paper algorithm PDM to conduct parallel data mining for association rules. With a given database partition, each processing node collected (count) information on each itemset from its local database eciently via a hashing method. The discovered information by each node was next shared with other processors via some communication schemes. Then, PDM employed a clue-and-poll procedure to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets to exchange count information. The [5] R. Agrawal and S. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478{499, September [6] M.-S. Chen, P. S. Yu, and K.-L. Wu. Optimal NODUP All-To-All Broadcast Schemes in Distributed Systems. IEEE Transactions on Parallel and Distributed System, pages 1275{1285, December [7] J. Han, Y. Cai,, and N. Cercone. Knowledge Discovery in Databases: An Attribute-Oriented Approach. Proceedings of the 18th International Conference on Very Large Data Bases, pages 547{559, August [8] R.T. Ng and J. Han. Ecient and Eective Clustering Methods for Spatial Data Mining. Proceedings of the 18th International Conference on Very Large Data Bases, pages 144{155, September [9] J.-S. Park, M.-S. Chen, and P. S. Yu. Ecient Parallel Data Mining for Association Rules. IBM Research Report, RC 20156, August [10] J.-S. Park, M.-S. Chen, and P. S. Yu. An Eective Hash Based Algorithm for Mining Association Rules. Proceedings of ACM SIGMOD, pages 175{186, May, 1995.
Using a Hash-Based Method with Transaction Trimming for Mining Association Rules
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 5, SEPTEMBER/OCTOBER 1997 813 Using a Hash-Based Method with Transaction Trimming for Mining Association Rules Jong Soo Park, Member, IEEE,
More information620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others
Vol.15 No.6 J. Comput. Sci. & Technol. Nov. 2000 A Fast Algorithm for Mining Association Rules HUANG Liusheng (ΛΠ ), CHEN Huaping ( ±), WANG Xun (Φ Ψ) and CHEN Guoliang ( Ξ) National High Performance Computing
More informationAn Evolutionary Algorithm for Mining Association Rules Using Boolean Approach
An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,
More informationTransactions. Database Counting Process. : CheckPoint
An Adaptive Algorithm for Mining Association Rules on Shared-memory Parallel Machines David W. Cheung y Kan Hu z Shaowei Xia z y Department of Computer Science, The University of Hong Kong, Hong Kong.
More informationDMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE
DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE Saravanan.Suba Assistant Professor of Computer Science Kamarajar Government Art & Science College Surandai, TN, India-627859 Email:saravanansuba@rediffmail.com
More informationPARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.
PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com
More informationMining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,
Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk
More informationrule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.
Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules Anthony K. H. Tung 1 Hongjun Lu 2 Jiawei Han 1 Ling Feng 3 1 Simon Fraser University, British Columbia, Canada. fkhtung,hang@cs.sfu.ca
More informationMining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center
Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com
More informationData Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application
Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,
More informationImproved Frequent Pattern Mining Algorithm with Indexing
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.
More informationParallel Mining of Maximal Frequent Itemsets in PC Clusters
Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan
More informationMining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao Department of Computer Science, The University of Hong Kong, Hong Kong. fsdlee,dcheung,aog@cs.hu.h
More informationModel for Load Balancing on Processors in Parallel Mining of Frequent Itemsets
American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.
More informationMining Frequent Itemsets for data streams over Weighted Sliding Windows
Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology
More informationA mining method for tracking changes in temporal association rules from an encoded database
A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationUsing Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment
Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Ching-Huang Yun and Ming-Syan Chen Department of Electrical Engineering National Taiwan
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationAssociation Rule Mining. Entscheidungsunterstützungssysteme
Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
More informationAggregation and maintenance for database mining
Intelligent Data Analysis 3 (1999) 475±490 www.elsevier.com/locate/ida Aggregation and maintenance for database mining Shichao Zhang School of Computing, National University of Singapore, Lower Kent Ridge,
More informationAn Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing
An Ecient Algorithm for Mining Association Rules in Large Databases Ashok Savasere Edward Omiecinski Shamkant Navathe College of Computing Georgia Institute of Technology Atlanta, GA 3332 e-mail: fashok,edwardo,shamg@cc.gatech.edu
More informationAppropriate Item Partition for Improving the Mining Performance
Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National
More informationCHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.
119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched
More informationAn Algorithm for Frequent Pattern Mining Based On Apriori
An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior
More informationMining of Web Server Logs using Extended Apriori Algorithm
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
More informationA Fast Distributed Algorithm for Mining Association Rules
A Fast Distributed Algorithm for Mining Association Rules David W. Cheung y Jiawei Han z Vincent T. Ng yy Ada W. Fu zz Yongjian Fu z y Department of Computer Science, The University of Hong Kong, Hong
More informationAssociation Rules. Berlin Chen References:
Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A
More informationDiscovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923
More informationLecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics
More informationA Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *
A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology
More informationChapter 4: Mining Frequent Patterns, Associations and Correlations
Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent
More informationsignicantly higher than it would be if items were placed at random into baskets. For example, we
2 Association Rules and Frequent Itemsets The market-basket problem assumes we have some large number of items, e.g., \bread," \milk." Customers ll their market baskets with some subset of the items, and
More informationApplying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science
Applying Objective Interestingness Measures in Data Mining Systems Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada SS 0A fhilder,hamiltong@cs.uregina.ca
More informationParallelizing Frequent Itemset Mining with FP-Trees
Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas
More informationNull. Example A. Level 0
A Tree Projection Algorithm For Generation of Frequent Itemsets Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 E-mail: f agarwal, charu,
More informationNetwork. Department of Statistics. University of California, Berkeley. January, Abstract
Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,
More informationPerformance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced
Mining Sequential Patterns: Generalizations and Performance Improvements Ramakrishnan Srikant? and Rakesh Agrawal fsrikant, ragrawalg@almaden.ibm.com IBM Almaden Research Center 650 Harry Road, San Jose,
More informationTemporal Weighted Association Rule Mining for Classification
Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider
More informationAn Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationA New Method for Similarity Indexing of Market Basket Data. Charu C. Aggarwal Joel L. Wolf. Philip S.
A New Method for Similarity Indexing of Market Basket Data Charu C. Aggarwal Joel L. Wolf IBM T. J. Watson Research Center IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Yorktown Heights,
More informationHierarchical Online Mining for Associative Rules
Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining
More informationMining Temporal Association Rules in Network Traffic Data
Mining Temporal Association Rules in Network Traffic Data Guojun Mao Abstract Mining association rules is one of the most important and popular task in data mining. Current researches focus on discovering
More informationFIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran
FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Sanguthevar Rajasekaran Dept. of Computer Science Ohio Northern University Ada, OH 4581 Email: j-luo@onu.edu Dept. of
More informationData Mining Part 3. Associations Rules
Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets
More informationRoadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.
15-721 DB Sys. Design & Impl. Association Rules Christos Faloutsos www.cs.cmu.edu/~christos Roadmap 1) Roots: System R and Ingres... 7) Data Analysis - data mining datacubes and OLAP classifiers association
More informationCHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL
68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This
More informationSETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset
SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset Ye-In Chang and Yu-Ming Hsieh Dept. of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan, Republic
More informationSA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases
SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,
More informationAssociation mining rules
Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when
More informationEfficient Remining of Generalized Multi-supported Association Rules under Support Update
Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou
More informationA Literature Review of Modern Association Rule Mining Techniques
A Literature Review of Modern Association Rule Mining Techniques Rupa Rajoriya, Prof. Kailash Patidar Computer Science & engineering SSSIST Sehore, India rprajoriya21@gmail.com Abstract:-Data mining is
More informationPincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set
Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets
More informationDATA MINING II - 1DL460
Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for
More informationAn Improved Algorithm for Mining Association Rules Using Multiple Support Values
An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of
More informationChapter 4: Association analysis:
Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationMining Quantitative Association Rules on Overlapped Intervals
Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,
More informationParallel Association Rule Mining by Data De-Clustering to Support Grid Computing
Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing Frank S.C. Tseng and Pey-Yen Chen Dept. of Information Management National Kaohsiung First University of Science and Technology
More informationA Novel Texture Classification Procedure by using Association Rules
ITB J. ICT Vol. 2, No. 2, 2008, 03-4 03 A Novel Texture Classification Procedure by using Association Rules L. Jaba Sheela & V.Shanthi 2 Panimalar Engineering College, Chennai. 2 St.Joseph s Engineering
More informationTutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory
Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining Gyozo Gidofalvi Uppsala Database Laboratory Announcements Updated material for assignment 3 on the lab course home
More informationPerformance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms
Int. J. Advanced Networking and Applications 458 Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Puttegowda D Department of Computer Science, Ghousia
More informationMining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports
Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,
More informationA Further Study in the Data Partitioning Approach for Frequent Itemsets Mining
A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining Son N. Nguyen, Maria E. Orlowska School of Information Technology and Electrical Engineering The University of Queensland,
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationGraph Based Approach for Finding Frequent Itemsets to Discover Association Rules
Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery
More informationA Graph-Based Approach for Mining Closed Large Itemsets
A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and
More informationOptimization using Ant Colony Algorithm
Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department
More informationAN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011
International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR
More informationMining High Average-Utility Itemsets
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering
More informationANU MLSS 2010: Data Mining. Part 2: Association rule mining
ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements
More informationAn Efficient Algorithm for finding high utility itemsets from online sell
An Efficient Algorithm for finding high utility itemsets from online sell Sarode Nutan S, Kothavle Suhas R 1 Department of Computer Engineering, ICOER, Maharashtra, India 2 Department of Computer Engineering,
More informationA Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study
A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study Mirzaei.Afshin 1, Sheikh.Reza 2 1 Department of Industrial Engineering and
More informationON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION
ON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION Ioannis N. Kouris Department of Computer Engineering and Informatics, University of Patras 26500 Patras, Greece and
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationAPPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES
APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department
More informationSQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE)
SQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE) Takeshi Yoshizawa, Iko Pramudiono, Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 7-22-1 Roppongi,
More informationDiscovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree
Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania
More informationEvaluation of Sampling for Data Mining of Association Rules
Evaluation of Sampling for Data Mining of Association Rules Mohammed Javeed Zaki, Srinivasan Parthasarathy, Wei Li, Mitsunori Ogihara Computer Science Department, University of Rochester, Rochester NY
More informationMedical Data Mining Based on Association Rules
Medical Data Mining Based on Association Rules Ruijuan Hu Dep of Foundation, PLA University of Foreign Languages, Luoyang 471003, China E-mail: huruijuan01@126.com Abstract Detailed elaborations are presented
More informationChapter 2. Related Work
Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.
More informationDiscovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *
Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques
More informationRoadmap. PCY Algorithm
1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY
More informationCluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]
Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationA Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining
A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India
More informationMaintenance of the Prelarge Trees for Record Deletion
12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of
More informationMaterialized Data Mining Views *
Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61
More informationPerformance and Scalability: Apriori Implementa6on
Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:
More informationAllowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs
To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane
More informationParallel Algorithms for Discovery of Association Rules
Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN
More informationTransforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm
Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December
More informationMining Spatial Gene Expression Data Using Association Rules
Mining Spatial Gene Expression Data Using Association Rules M.Anandhavalli Reader, Department of Computer Science & Engineering Sikkim Manipal Institute of Technology Majitar-737136, India M.K.Ghose Prof&Head,
More informationAn Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *
An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * Jia-Ling Koh and Shu-Ning Shin Department of Computer Science and Information Engineering National Taiwan Normal University
More informationMining Quantitative Maximal Hyperclique Patterns: A Summary of Results
Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu
More information2. Discovery of Association Rules
2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining
More informationA Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin
A Fast Algorithm for Data Mining Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin Our Work Interested in finding closed frequent itemsets in large databases Large
More information