Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

Size: px
Start display at page:

Download "Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center"

Transcription

1 Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York fmschen, Abstract In this paper, we develop an algorithm, called PDM, to conduct parallel data mining for association rules. Consider a transaction as a collection of items, and a large itemset is a set of items such that the number of transactions containing it exceeds a pre-specied threshold. PDM is so designed that the global set of large itemsets can be identi- ed eciently and the amount of inter-node data exchange required is minimized. Specically, with a given database partition, each processing node will collect (count) information on each itemset from its local database eciently via a hashing method. The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue-andpoll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large itemsets can hence be determined based on the aggregate count of itemsets. It is experimentally shown that PDM not only attains very good parallelization eciencies, but also provides robust performance for various iut patterns. 1 Introduction Recently, the importance of database mining is growing at an extremely fast pace due to the increasing use of computing for various applications. One of the most important data mining problems is mining association rules. Given a database of sales transactions, it would be interesting to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. In [3], it is shown that mining association rules can be decomposed into two subproblems. First, we need to identify all sets of items (itemsets) that are contained in a sucient number of transactions above the minimum (support) requirement. These itemsets are Visiting from the department of computer science, Sungshin Women's University and partially supported by KOSEF, Korea. referred to as large itemsets. Once all large itemsets are obtained, the desired association rules can be generated in a straightforward manner. Subsequent work in the literature [5, 10] followed this approach and focused on the large itemset generations. Various other data mining capabilities were also explored in the literature, including classication [2, 7, 8], similar sequences [1], and sequential patterns [4]. Notice that most of the prior work on data mining focused on exploring various mining capabilities and on improving their execution in a sequential processing environment. However, there was little result reported thus far on exploiting parallel execution of data mining. It is noted that data mining in general requires progressive knowledge collection and revision based on a huge transaction database. Such a knowledge learning process calls for numerous iterations of data comparison and analysis. Due to this very nature of data mining, how to achieve ecient parallel data mining is a very challenging issue, since, with the transaction database being partitioned across all nodes, the amount of inter-node data transmission required for reaching global decisions can be prohibitively large, thus signicantly compromising the benet achievable from parallelization. As a result, although a signicant amount of research results have been reported on data mining, the study on parallel data mining is still in its infancy. Consequently, we develop in this paper an algorithm for parallel data mining, called PDM, which is a generalized version of algorithm DHP reported in [10]. Consider a transaction as a collection of items, and a large itemset as a set of items such that the number of transactions containing it exceeds a pre-specied threshold. In essence, PDM is so designed that the identication of the global set of large itemsets can be parallelized eciently and the amount of internode data exchange required is minimized. With the entire transaction database being partitioned across all nodes, each node will employ a hashing method, similar to that in DHP, to identify candidate k-itemsets (i.e., itemsets consisting of k items) from its local database eciently. The information discovered by each node is next shared with other nodes via some communication schemes. It is noted that in a shared nothing type parallel environment, each node can only directly collect information from its local database partition. The process to parallelize the knowledge discovery across all nodes and to deal with partial knowledge collected could itself be complicated and communication intensive. To address this issue, we devise a technique, called clue-and-poll, to resolve the uncertainty due to the partial knowledge col-

2 lected at each node by judiciously selecting a small fraction of the itemsets for the information exchange among nodes, thus minimizing the communication cost incurred. The global set of large k-itemsets is then determined via proper communication schemes in accordance with the aggregate count of itemsets and the pre-determined minimal transaction support. To provide more insights into the performance of PDM, we evaluate another parallel algorithm, denoted by PF, that does not utilize the hashing technique, and comparatively analyze the performance of PDM and PF via simulations. It is observed that due to the use of hashing techniques in knowledge collection, work parallelization and distribution across all nodes can be achieved very eciently within PDM, showing the very advantage of PDM over PF. It is experimentally shown that PDM not only attains very good parallelization eciencies, but also performs very robustly for various iut patterns. This paper is organized as follows. Preliminaries are given in Section 2. The proposed algorithm PDM for parallel data mining is described in Section 3. Performance results are presented in Section 4. Section 5 contains the summary. 2 Preliminaries Note that PDM is a parallel version of algorithm DHP devised in [10] and the latter is a revised version of Apriori reported in [5]. Due to the page limitation, we refer readers to [5] and [10] for the details of Apriroi and DHP. 2.1 Mining Association Rules Let I=fi 1, i 2, :::, i mg be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. Note that the quantities of items bought in a transaction are not considered, meaning that each item is a binary variable representing if an item was bought. Each transaction is associated with an identier, called TID. Let X be a set of items. A transaction T is said to contain X if and only if X T. An association rule is an implication of the form X =) Y, where X I, Y I and X T Y =. The rule X =) Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule X =) Y has support s in the transaction set D if s% of transactions in D contain X S Y. As mentioned before, the problem of mining association rules is composed of the following two steps: 1. Discover the large itemsets, i.e., all sets of itemsets that have transaction support above a pre-determined minimum support s. 2. Use the large itemsets to generate the association rules for the database. The overall performance of mining association rules is in fact determined by the rst step. After the large itemsets are identied, the corresponding association rules can be derived in a straightforward manner. Apriori, DHP, and also PDM proposed in this paper mainly address the issue of discovering large itemsets. 2.2 Algorithm DHP Same as Apriori, DHP also generates the candidate set of large k-itemsets, C k, from the set of large (k? 1)-itemsets, L k?1. However, DHP is unique in that it employs a hash table, which is built in the previous pass, to test the eligibility of a k-itemset. Instead of including all k-itemsets from L k?1 L k?1 into C k, DHP adds a k-itemset into C k only if that k-itemset is hashed into a hash entry whose value is larger than or equal to s. DHP also reduces the database size progressively by not only trimming each individual transaction size but also pruning the number of transactions in the database. We note that both DHP and Apriori are iterative algorithms on the large itemset size in the sense that the large k-itemsets are derived from the large (k? 1)-itemsets. As shown in [10], the hash ltering in DHP can drastically reduce the size of C k for small k values, especially for k = 2. The signicant size reduction on C 2 by DHP will in turn make the trimming and pruning of transactions very eective in later passes. 3 Parallel Data Mining In this section we describe a parallel data mining algorithm, PDM, which eectively parallelizes algorithm DHP for mining association rules in a multiprocessor environment. 3.1 Algorithm for Parallel Data Mining Basically, algorithm PDM consists of parallel generation of candidate itemsets and parallel determination of large itemsets. The algorithm uses a hash table to generate candidate itemsets at early passes, and then uses a method to generate candidate itemsets directly from the previous large itemsets at later passes. Let n p be the number of nodes and D 1i denote a partitioned database located at node i. Note that S?1 D 1i = D 1 where D 1 is an initial transaction database, and that D 2i = D 1i. Each node i starts with counting the support for each item in D 1i. We use L 1i to represent the set of potential 1-itemsets, each of which has an associated count for its occurrences in D 1i. Each node also builds a hash table to count the support for each 2-itemset. We use H 2i to represent the hash table and its associated counts for hash buckets. Note that in contrast to the sequential case, the counts at each node here only contain information on a single database partition. To obtain the global counts, we need to sum L 1i and H 2i over all nodes so as to get L 1 and H 2. With a moderate size L 1i, nodes can exchange information with each other directly to determine L 1. However, the size of H 2i is usually large and the cost of direct exchange could be very high. To remedy this, we devise a clue-and-poll approach to judiciously select a small number of hash buckets, which have the potential to meet the support requirement, for information exchange. Specically, instead of constructing H 2, we only need to gure out its subset which has hash counts larger than or equal to s, i.e., the set fxjh 2[x] sg. After fxjh 2[x] sg is derived, we can divide it into n p equal partitions, and have each node generate a partial set of candidate 2-itemsets in parallel from its own database partition. This partial set of candidate 2-itemsets, C 2i, from node i is the set of 2-itemsets whose hash values belong to the i-th partition of fxjh 2[x] sg. By exchanging C 2i among all nodes, we then get C 2. With C 2, each node can next scan its local database partition, in parallel, to derive L 2i, thus

3 n N0 D 10! L 10 and H 20 = [2, 17, 15, 0, 16, 24, 14, 0] n N1 n N0 D 11! L 11 and L k?1! C k0 H 21 = [4, 13, 17, 0, 18, 23, 7, 1] C k; D k0! L k0; D (k+1)0 n N1 L k?1! C k1 C k; D k1! L k1; D (k+1)1 n N2 D 12! L 12 and H 22 = [3, 12, 18, 16, 17, 22, 0, 3] n N3 n N2 D 13! L 13 and L k?1! C k2 H 23 = [0, 15, 16, 9, 14, 14, 8, 0] C k; D k2! L k2; D (k+1)2 n N3 L k?1! C k3 C k; D k3! L k3; D (k+1)3 Figure 1: Local processing for L 1i and H 2i of Step P-1. Figure 2: Generation of C k and L ki's. associating each 2-itemset in C 2 with its corresponding support count in the local database D 2i. Similarly to DHP, when a transaction is scanned to evaluate the support counts for 2-itemsets in C 2, we can simultaneously trim items which do not appear in any potential large itemset from the transactions as well as prune transactions that do not contain potential large itemsets, thereby leading to a smaller database partition D 3i to scan in later passes. By exchanging L 2i among all nodes, the set of large 2-itemsets, L 2, can therefore be determined. Note that C 3, which equals to L 2 L 2, is in fact a join between L 2 and itself. We partition and parallelize the (nested loop) join operation on L 2 so that each node generates a subset C 3i of the results in parallel. By exchanging C 3i, we hence get C 3. As such, the process repeats for later passes till all large itemsets are identied. More formally, PDM is described in its algorithmic form below. Algorithm PDM: Parallel algorithm for data mining. P-1) Each node generates L 1i and H 2i by scanning D 1i. P-2) Determine L 1 by exchanging L 1i through all-to-all broadcast. P-3) Call procedure Clue-and-Poll to nd fxjh 2[x] sg from all H 2i's. P-4) k = 2 P-5) While (L k?1 k) do P-6) If (k = 2), each node generates C 2i from L 1 and fxjh 2[x] sg. P-7) Otherwise, each node generates C ki directly from L k?1.?1 P-8) Determine C k = C ki by exchanging C ki where x:count i is the count information of the bucket address x at node i. Obviously, as a necessary condition for among all nodes. P-9) Each node generates L ki by scanning and trimming D ki. a hash address to have a global count reaching the minimal P-10) Determine L k by exchanging L ki among all nodes. support requirement, there must exist at least one node j P-11) k = k + 1. with a count of that bucket address, x:count j, greater than or equal to b s c. In light of this, if each node only exchanges Figure 1 illustrates the mining procedure for determining L 1i and H 2i from D 1i in a multiprocessor system of 4 bucket addresses with a count greater than or equal to b s nodes. Each square bracket in Figure 1 contains counts of no hash bucket with the minimal support will be left out. the 8 buckets in H 2i. Due to its complication, the description of P-3, i.e., the clue-and-poll method of using H 2i's to count by adding together the count values on each bucket ad- After the exchange, each node can obtain an accumulated obtain fxjh 2[x] sg is deferred to Section 3.2. In Steps P-2 dress. We maintain for each hash bucket x its accumulated and P-10, when these L ki's are exchanged, the transaction count, x:count, and the number of nodes, denoted by x:n b, supports of these candidate k-itemsets in L ki's are added that send information on the bucket. These hash buckets of together as a process to reach the global count. After exchanging the resulting hash table can be classied into two cases: all L ki's, we get L k = fxjx:count s and x 2 C kg. In Step P-6, we split fxjh 2[x] sg into n p partitions, and each node generates a partial set of candidate S 2-itemsets,?1 C 2i, from each of these partitions, where C 2 = C 2i: This method generates a much smaller C 2 than the corresponding candidate itemset generation in Apriori. Figure 2 shows conceptually Steps P-7, P-8, and P-9 at each node. In Step P-7, we parallelize the generation of candidate k- itemsets by having S each node generate a subset C ki. Then?1 we get C k = C ki in Step P-8. In Step P-9, node i counts the transaction support of each k-itemset in C k by scanning D ki. When node i scans D ki to obtain L ki, it trims the sizes of transactions in D ki by removing from each transaction those items which are deemed not useful for later large itemset generation, thus resulting in D (k+1)i for L k Procedure Clue-and-Poll We describe here a procedure which is able to eciently determine a small number of elements from many large distributed data sets. Explicitly, we need to identify bucket addresses with the minimum support, fxjh 2[x] sg, for Step P-6 from those H 2i's located at distributed nodes. A naive method would be to exchange H 2i among all nodes. However, this method incurs substantial communication costs since the size of H 2i is usually very large, say, approximately 512K. To minimize the communication cost, we devise procedure Clue-and-Poll to judiciously exchange only a small set of elements which could possibly reach the minimum support. Recall that x 2 fyjh 2[y] sg means?1 X x:count i s; 1. x:count s, where x is a bucket address reaching the minimum support requirement. 2. x:count < s. For the rst case, it is concluded that the count in this hash bucket has reached the minimum support and this hash bucket will be involved in later processing. Note, however,

4 that the accumulated count for hash bucket x does not necessarily include the counts from all n p nodes because only those nodes with their counts for hash bucket x reaching the exchange threshold (i.e., b s c) provide the count information. In view of this, for the second case, we need to determine whether or not the bucket could exceed the support requirement if all the remaining nodes which did not send in their bucket counts are polled. We now describe the polling criterion. There are (n p? x:n b) nodes which did not send information on the bucket address x and each of these nodes can at most have a count value of (b s c? 1). Thus, the sum of the counts from these nodes on that bucket address has an upper bounded of (b s c? 1)(? x:nb). The maximal value of the count after polling this set of nodes is hence x:count + (b s c? 1)(n p? x:n b). If this maximal count value exceeds the minimum support, we should poll these (n p? x:n b) nodes. Each of these (n p? x:n b) nodes, upon receiving the request, exchanges its count on the bucket address x with others. b s It is noted that if we use b s c for > 1:0, instead of c, as the threshold in the rst step of the communication, we are lowering the threshold on the bucket count to exchange information. Although increasing the amount of communication in the rst step, the use of a larger will reduce the communication cost in the polling step. The value of was varied to study this trade-o on the communication cost in [9]. Formally, procedure Clue-and-Poll can be described as follows. Procedure Clue-and-Poll: An ecient method to identify elements with the minimal support required. /* Iut: fs i; i = 0; ::; n p? 1g, where S i contains count information and is stored at node i. */ /* Output: S T = fxjx 2 S and x:count sg, where S = S?1 S i. */ C-1) Node i sends information on elements with a count reaching the threshold requirement, i.e., R i = fxjx 2 S i and x:count b s cg. C-2) Determine S T and R poll = x = fxjx:count s and x 2 S?1 R ig s s?? 1 (n p? x:n b) x:count < s n p and x 2?1 [ R i ) where x:n b is the number of nodes which sent element x. C-3) If R poll = then stop. Otherwise, do this step: Poll the count information of each element in R poll from nodes which did not send information on the element in Step C-1, and this count information is added to the corresponding element in R poll. S T = S T [fxjx:count s and x 2 R pollg. The operations of procedure Clue-and-Poll to determine bucket addresses with the minimum support can be best understood by a numerical example given in Figure 1. For Step P-3 of PDM, S i in procedure Clue-and-Poll corresponds to ; Items N = 1000 PICK ITEM Potentially Large Itemsets L = 2000 I = 4 PICK PLI Transactions D 1 = 500,000 T = 20 Figure 3: The process of generating synthetic transactions H 2i. Suppose that n p = 4, jd 1j = 10; 000, s = 0:6% (i.e., s = 60), and = 1:0. In Step C-1, node 0 sends out count information on bucket addresses 1, 2, 4, and 5, and receives information on bucket addresses with counts greater than or equal to b s c=15 from other nodes. As a result of Step C- S?1 1, all nodes have R i = f(1; 32; 2); (2; 66; 4); (3; 16; 1), (4; 51; 3); (5; 69; 3)g, where the triplet in each parenthesis represents (x; x:count; x:n b), i.e., the bucket address, the accumulated counts from the exchange, and the number of nodes sending information on the bucket, respectively. Step C-2 determines that S T = fxjh 2[x] sg = f2; 5g and R poll = f1; 4g. In Step C-3, since R poll 6=, the count information of each element in R poll is polled, and consequently, information on buckets 1 and 4 gets updated to become f(1; 57; 4); (4; 65; 4)g. Finally, we have S T = f2; 4; 5g. Note that some steps of PDM need to exchange information for nding candidate itemsets and large itemsets. The communication scheme used in this paper is based on the all-to-all broadcast scheme [6]. The performance impact of using other communication schemes was evaluated in [9]. 4 Simulation Results We evaluated the performance of PDM based on a synthetically generated transaction database. A sales transaction consists of a collection of sales items. Dierent methods of generating the items in a transaction were considered. Sensitivity analysis on PDM was conducted on the database size, the transaction size, and the degree of parallelism. A shared nothing type of parallel computers is considered as a basic model to evaluate our algorithms. The methods used to generate synthetic transactions are described in Section 4.1. For comparison purposes, in Section 4.2 we also implement another parallel algorithm without using the technique of hashing in candidate set generation. This algorithm is referred to as PF. Performance comparison of PDM and PF algorithms is then conducted via simulation. As will be shown by simulation results, the eciency of PDM is significantly better than that of PF. 4.1 Generation of Synthetic Transactions Synthetic transactions are generated from an N item set with the purpose of emulating sales transactions in a retail industry. Figure 3 shows a basic generation method for the transaction database, which is in essence similar to those used in [5] and [10]. Table 1 summarizes the meaning of various parameters used in the experiments. Synthetic transactions consist of a series of potentially large itemsets (PLI's) which are chosen according to the weight of each PLI. The method of selecting PLI's to form a transaction is referred to as PICK-PLI. The weight of each PLI corresponds to the probability that this PLI will be included into these synthetic transactions.

5 Table 1: Meaning of various parameters. D k Set of transactions for large k-itemsets. jd kj The number of transactions in D k. H k Hash table containing jh kj buckets for C k. C k Set of candidate k-itemsets. L k Set of large k-itemsets. jt j Average size of the transactions. jij Ave. size of maximal potentially large itemsets. jlj Number of maximal potentially large itemsets. N Number of items. Table 2: Simulation results for varying the size of D 1 when n p = 32. jd 1j 100, , , ,000 1,000,000 D 1(KB) T T comm T cpu T We next consider the method of selecting items to form PLI's, denoted by PICK-ITEM. For the rst PLI, items are chosen by a random selection method in such a way that each item is chosen from some given distribution of N items, e.g., a uniform distribution. To capture the phenomenon that large itemsets often have common items, some fraction of items in subsequent PLI's is chosen from the previous PLI and the remaining items are chosen based on the above mentioned random selection. The common fraction is referred to as the correlation level and is generated from an exponentially distributed random variable with mean equal to a predetermined value. Same as in [5], the mean of the correlation level is set to 0:5 for our experiments. The transaction database is divided into n p partitions, each of which contains approximately the same number of transactions. The i-th partition of the database is allocated at node i. Transactions are generated under the assumption that the random selection method in PICK-ITEM chooses an item from a uniform distribution of N items. Also, in PICK- PLI a weight of each PLI is picked from an exponential distribution with unit mean. Furthermore, all the weights are normalized so that the sum of them equals to one. A study on using a Zipf-like distribution to model the data skew, i.e., skew on items contained in the transactions, can be found in [9]. 4.2 Performance of PDM We now analyze the sensitivity of the performance of PDM to the database size. Table 2 shows that not only the execution times, but also the eciencies are aected by the size of the database. The parallelization eciency of performing a task in parallel is dened as: = T 1 n p T ; where T i is the average execution time when i nodes are employed to perform the task, i.e., T 1 is the execution time in a single processing node and T is the execution time in a system of n p nodes. It is noted that when n p nodes are employed to perform a given task in parallel, the execution time at each node is dierent from one to another because the database partition at each site is not the identical. T cpu is determined from adding up, along all steps, the maximum execution times among n p nodes. Note that T cpu also contains the time to receive and process the messages in the communication steps. Thus, T equals to the sum of T cpu and T comm, where T comm is obtained as follows. Given the message latency A (in seconds) and point-to-point bandwidth B (in bytes/second), the message passing time is modeled by f(x) = A+bx, where b = 1=B and x is the message size in bytes. In a multiprocessor system considered, A is assumed to be 30 microsecond and B = 14MB/second for full duplex communication. As shown in Table 2, when the number of transactions increases, the communication cost decreases, thus improving. This is due to the fact that the statistical variation among the dierent nodes on the count value at each hash class decreases as the number of transaction increases. As mentioned before, we also implement another parallel mining algorithm, called PF, which does not use the technique of hashing. Since the number of candidate 2-itemsets, i.e., jl 1j(jL1 j?1), is very large, procedure PF generates C 2 2 at each node, and then utilizes procedure Clue-and-Poll to determine L 1 2. Procedure PF: A direct parallelization without using the hashing technique. F-1) Each node generates L 1i by scanning D 1i. F-2) Determine L 1 by exchanging L 1i among all nodes. F-3) Each node generates C 2, and then derives L 2i. L 2i is generated by obtaining the transaction support of each 2-itemset in C 2 through scanning D 1i. (Note that similarly to [5] some auxiliary data structure on C 2 is built to facilitate the matching between D 1i and the 2-itemsets in C 2.) F-4) Call procedure Clue-and-Poll to derive L 2 from all L 2i's. F-5) k = 3. F-6) Do Step P-5 to Step P-11 of algorithm PDM. Performance comparisons were conducted between algorithms PDM and PF, and the results are summarized in Table 3 and Figure 4 for jd 1j = 250; 000. Two dierent transaction sizes, 10 and 20, are considered. It is noted that procedure PF performs poorly due to the very large number of candidate 2-itemsets, i.e., jc 2j = jl 1 j(jl1 j?1). The large 2 size of C 2 increases not only the time to generate it from L 1, but also the time to scan the transactions and count the support for each 2-itemset in C 2. The communication time 1 Note that utilizing procedure Clue-and-Poll in PF will not only improve the performance of PF but also result in a fair comparison between PF and PDM, hence allowing us to better capture the advantage from using hashing in PDM.

6 Table 3: Performance comparisons between PDM and PF. jt j = 10; jij = 4 jt j = 20; jij = 4 n p T T P DM T P F P DM P F global set of large itemsets was hence determined based on the aggregate count of itemsets. It was experimentally shown that PDM not only attains very good parallelization eciencies, but also provides robust performance in the presence of data skew. References [1] R. Agrawal, C. Faloutsos, and A. Swami. Ecient Similarity Search in Sequence Databases. Proceedings of the 4th Intl. conf. on Foundations of Data Organization and Algorithms, October, execution time (in seconds) number of nodes T-PDM T-PF [2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An Interval Classier for Database Mining Applications. Proceedings of the 18th International Conference on Very Large Data Bases, pages 560{573, August [3] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. Proceedings of ACM SIGMOD, pages 207{216, May [4] R. Agrawal and R. Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering, pages 3{14, March Figure 4: Execution times of PDM and PF when jt j=10, jij=4, and jd 1j=250,000. to determine L 2 is also increased. Furthermore, the large C 2 leads to little trimming and pruning eect on transactions in D 1i, and results in the size of D 3i being nearly equal to the size of D 1i. It is worth mentioning that under PF, the time to generate C 2 and build the auxiliary data structure to facilitate the counting and comparison for itemsets among transactions is about 7.25% of total execution time when jt j = 20, jij = 4, and n p = 8 in Table 3. The corresponding time in PDM is about 0.33% of the total time. This very feature shows that the hashing approach of generating a small C 2 in PDM is very eective and parallelizable in a multiprocessor environment. Although still far worse than PDM, the performance of PF improves in the case when a larger number of passes incurred as its parallelization eciency for jt j = 20 is better than that for jt j = 10. (See Table 3.) This is because the effect of large C 2 reduces when the number of passes increases. 5 Conclusions We developed in this paper algorithm PDM to conduct parallel data mining for association rules. With a given database partition, each processing node collected (count) information on each itemset from its local database eciently via a hashing method. The discovered information by each node was next shared with other processors via some communication schemes. Then, PDM employed a clue-and-poll procedure to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets to exchange count information. The [5] R. Agrawal and S. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478{499, September [6] M.-S. Chen, P. S. Yu, and K.-L. Wu. Optimal NODUP All-To-All Broadcast Schemes in Distributed Systems. IEEE Transactions on Parallel and Distributed System, pages 1275{1285, December [7] J. Han, Y. Cai,, and N. Cercone. Knowledge Discovery in Databases: An Attribute-Oriented Approach. Proceedings of the 18th International Conference on Very Large Data Bases, pages 547{559, August [8] R.T. Ng and J. Han. Ecient and Eective Clustering Methods for Spatial Data Mining. Proceedings of the 18th International Conference on Very Large Data Bases, pages 144{155, September [9] J.-S. Park, M.-S. Chen, and P. S. Yu. Ecient Parallel Data Mining for Association Rules. IBM Research Report, RC 20156, August [10] J.-S. Park, M.-S. Chen, and P. S. Yu. An Eective Hash Based Algorithm for Mining Association Rules. Proceedings of ACM SIGMOD, pages 175{186, May, 1995.

Using a Hash-Based Method with Transaction Trimming for Mining Association Rules

Using a Hash-Based Method with Transaction Trimming for Mining Association Rules IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 5, SEPTEMBER/OCTOBER 1997 813 Using a Hash-Based Method with Transaction Trimming for Mining Association Rules Jong Soo Park, Member, IEEE,

More information

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others Vol.15 No.6 J. Comput. Sci. & Technol. Nov. 2000 A Fast Algorithm for Mining Association Rules HUANG Liusheng (ΛΠ ), CHEN Huaping ( ±), WANG Xun (Φ Ψ) and CHEN Guoliang ( Ξ) National High Performance Computing

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Transactions. Database Counting Process. : CheckPoint

Transactions. Database Counting Process. : CheckPoint An Adaptive Algorithm for Mining Association Rules on Shared-memory Parallel Machines David W. Cheung y Kan Hu z Shaowei Xia z y Department of Computer Science, The University of Hong Kong, Hong Kong.

More information

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE Saravanan.Suba Assistant Professor of Computer Science Kamarajar Government Art & Science College Surandai, TN, India-627859 Email:saravanansuba@rediffmail.com

More information

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department. PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day. Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules Anthony K. H. Tung 1 Hongjun Lu 2 Jiawei Han 1 Ling Feng 3 1 Simon Fraser University, British Columbia, Canada. fkhtung,hang@cs.sfu.ca

More information

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com

More information

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao Department of Computer Science, The University of Hong Kong, Hong Kong. fsdlee,dcheung,aog@cs.hu.h

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Ching-Huang Yun and Ming-Syan Chen Department of Electrical Engineering National Taiwan

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

Aggregation and maintenance for database mining

Aggregation and maintenance for database mining Intelligent Data Analysis 3 (1999) 475±490 www.elsevier.com/locate/ida Aggregation and maintenance for database mining Shichao Zhang School of Computing, National University of Singapore, Lower Kent Ridge,

More information

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing An Ecient Algorithm for Mining Association Rules in Large Databases Ashok Savasere Edward Omiecinski Shamkant Navathe College of Computing Georgia Institute of Technology Atlanta, GA 3332 e-mail: fashok,edwardo,shamg@cc.gatech.edu

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

An Algorithm for Frequent Pattern Mining Based On Apriori

An Algorithm for Frequent Pattern Mining Based On Apriori An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

A Fast Distributed Algorithm for Mining Association Rules

A Fast Distributed Algorithm for Mining Association Rules A Fast Distributed Algorithm for Mining Association Rules David W. Cheung y Jiawei Han z Vincent T. Ng yy Ada W. Fu zz Yongjian Fu z y Department of Computer Science, The University of Hong Kong, Hong

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

signicantly higher than it would be if items were placed at random into baskets. For example, we

signicantly higher than it would be if items were placed at random into baskets. For example, we 2 Association Rules and Frequent Itemsets The market-basket problem assumes we have some large number of items, e.g., \bread," \milk." Customers ll their market baskets with some subset of the items, and

More information

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science Applying Objective Interestingness Measures in Data Mining Systems Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada SS 0A fhilder,hamiltong@cs.uregina.ca

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

Null. Example A. Level 0

Null. Example A. Level 0 A Tree Projection Algorithm For Generation of Frequent Itemsets Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 E-mail: f agarwal, charu,

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Performance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced

Performance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced Mining Sequential Patterns: Generalizations and Performance Improvements Ramakrishnan Srikant? and Rakesh Agrawal fsrikant, ragrawalg@almaden.ibm.com IBM Almaden Research Center 650 Harry Road, San Jose,

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

A New Method for Similarity Indexing of Market Basket Data. Charu C. Aggarwal Joel L. Wolf. Philip S.

A New Method for Similarity Indexing of Market Basket Data. Charu C. Aggarwal Joel L. Wolf.  Philip S. A New Method for Similarity Indexing of Market Basket Data Charu C. Aggarwal Joel L. Wolf IBM T. J. Watson Research Center IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Yorktown Heights,

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

Mining Temporal Association Rules in Network Traffic Data

Mining Temporal Association Rules in Network Traffic Data Mining Temporal Association Rules in Network Traffic Data Guojun Mao Abstract Mining association rules is one of the most important and popular task in data mining. Current researches focus on discovering

More information

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Sanguthevar Rajasekaran Dept. of Computer Science Ohio Northern University Ada, OH 4581 Email: j-luo@onu.edu Dept. of

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea. 15-721 DB Sys. Design & Impl. Association Rules Christos Faloutsos www.cs.cmu.edu/~christos Roadmap 1) Roots: System R and Ingres... 7) Data Analysis - data mining datacubes and OLAP classifiers association

More information

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This

More information

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset Ye-In Chang and Yu-Ming Hsieh Dept. of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan, Republic

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Association mining rules

Association mining rules Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when

More information

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou

More information

A Literature Review of Modern Association Rule Mining Techniques

A Literature Review of Modern Association Rule Mining Techniques A Literature Review of Modern Association Rule Mining Techniques Rupa Rajoriya, Prof. Kailash Patidar Computer Science & engineering SSSIST Sehore, India rprajoriya21@gmail.com Abstract:-Data mining is

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing

Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing Frank S.C. Tseng and Pey-Yen Chen Dept. of Information Management National Kaohsiung First University of Science and Technology

More information

A Novel Texture Classification Procedure by using Association Rules

A Novel Texture Classification Procedure by using Association Rules ITB J. ICT Vol. 2, No. 2, 2008, 03-4 03 A Novel Texture Classification Procedure by using Association Rules L. Jaba Sheela & V.Shanthi 2 Panimalar Engineering College, Chennai. 2 St.Joseph s Engineering

More information

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining Gyozo Gidofalvi Uppsala Database Laboratory Announcements Updated material for assignment 3 on the lab course home

More information

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Int. J. Advanced Networking and Applications 458 Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Puttegowda D Department of Computer Science, Ghousia

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining Son N. Nguyen, Maria E. Orlowska School of Information Technology and Electrical Engineering The University of Queensland,

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

An Efficient Algorithm for finding high utility itemsets from online sell

An Efficient Algorithm for finding high utility itemsets from online sell An Efficient Algorithm for finding high utility itemsets from online sell Sarode Nutan S, Kothavle Suhas R 1 Department of Computer Engineering, ICOER, Maharashtra, India 2 Department of Computer Engineering,

More information

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study Mirzaei.Afshin 1, Sheikh.Reza 2 1 Department of Industrial Engineering and

More information

ON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION

ON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION ON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION Ioannis N. Kouris Department of Computer Engineering and Informatics, University of Patras 26500 Patras, Greece and

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department

More information

SQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE)

SQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE) SQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE) Takeshi Yoshizawa, Iko Pramudiono, Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 7-22-1 Roppongi,

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Evaluation of Sampling for Data Mining of Association Rules

Evaluation of Sampling for Data Mining of Association Rules Evaluation of Sampling for Data Mining of Association Rules Mohammed Javeed Zaki, Srinivasan Parthasarathy, Wei Li, Mitsunori Ogihara Computer Science Department, University of Rochester, Rochester NY

More information

Medical Data Mining Based on Association Rules

Medical Data Mining Based on Association Rules Medical Data Mining Based on Association Rules Ruijuan Hu Dep of Foundation, PLA University of Foreign Languages, Luoyang 471003, China E-mail: huruijuan01@126.com Abstract Detailed elaborations are presented

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques

More information

Roadmap. PCY Algorithm

Roadmap. PCY Algorithm 1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

Performance and Scalability: Apriori Implementa6on

Performance and Scalability: Apriori Implementa6on Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:

More information

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane

More information

Parallel Algorithms for Discovery of Association Rules

Parallel Algorithms for Discovery of Association Rules Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Mining Spatial Gene Expression Data Using Association Rules

Mining Spatial Gene Expression Data Using Association Rules Mining Spatial Gene Expression Data Using Association Rules M.Anandhavalli Reader, Department of Computer Science & Engineering Sikkim Manipal Institute of Technology Majitar-737136, India M.K.Ghose Prof&Head,

More information

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * Jia-Ling Koh and Shu-Ning Shin Department of Computer Science and Information Engineering National Taiwan Normal University

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

2. Discovery of Association Rules

2. Discovery of Association Rules 2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining

More information

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin A Fast Algorithm for Data Mining Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin Our Work Interested in finding closed frequent itemsets in large databases Large

More information