Transactions. Database Counting Process. : CheckPoint

Size: px
Start display at page:

Download "Transactions. Database Counting Process. : CheckPoint"

Transcription

1 An Adaptive Algorithm for Mining Association Rules on Shared-memory Parallel Machines David W. Cheung y Kan Hu z Shaowei Xia z y Department of Computer Science, The University of Hong Kong, Hong Kong. dcheung@cs.hku.hk. z Department of Automation, Tsinghua University, Beijing. swxia@mail.tsinghua.edu.cn. Abstract Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach : they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm has been implemented on a Sun Enterprise 4 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good eects and has a substantial lead in performance over other proposed level-wise algorithms. Keywords: Data Mining, Parallel Databases, Association Rules, Parallel Mining, Parallel Computing, Sharedmemory Multi-processors. 1 Introduction Mining association rules in a large database is an important problem in data mining [1, 2, 4, 6, 8, 11, 12, 14]. A representative class of users of this technique is the retail industry. A sales record in a retail database typically consists of all the items bought by a customer in a transaction, together with other pieces of information such as date, card identication number, etc. Mining association rules is to discover from a huge number of transaction records the strong associations among the items such that the presence of one item in a transaction may imply the presence of another. It has been shown that the problem of mining association rules can be decomposed into two subproblems [1]. The rst is to nd out all large itemsets which are contained by a signicant number of transactions with respect to a threshold minimum support. The second is to generate all the association rules from the large itemsets found, with respect to another threshold, the minimum condence. 1 Since it is straight forward to generate association rules once the large itemsets are available, the challenge is in computing the large itemsets. The cost of computing the large itemsets consists of two main components. (1) The I/O cost in scanning the database. (2) For each transaction 1 Background of association rules mining is available in [2]. Readers not familiar with the terminologies such as large itemset and support threshold may refer to the brief introduction in Section 2. 1

2 read, the cost of identifying all candidate itemsets which are contained in the transaction. For the second component, the dominant factor is the size of the search domain, i.e., the number of candidate itemsets. One of the earliest proposed algorithm for association rules mining is the Apriori algorithm [1, 2]. It adopts a level-wise pruning technique which prunes candidate sets by levels. The pruning makes use of the large itemsets found in one level (size k) to generate the candidates of the next level (size k + 1). This in general could reduce signicantly the amount of candidate sets after the rst level. The main contribution of the Apriori algorithm is the development of techniques which use the search result in one level to prune the candidates of the next level, and hence reduce the searching cost. The smaller the number of candidate sets is, the faster the algorithm would be. Since computing large itemsets is very costly in terms of cpu and I/O resources there is a practical need to develop parallel algorithms for this task. Many algorithms for this purpose have been proposed for parallel system with distributed shared-nothing memory [3, 5, 7, 13, 15]. With its popularity and cost eectiveness, shared-memory multiprocessor (SMP) is another important parallel computing paradigm. Mining association rules requires the storage of large amount of intermediate values, with its large aggregated memory, SMP is particularly good for this mining task. However, no ecient algorithm has been proposed for this purpose on SMP. Our stuy has shown that currently available solutions for sequential machine or other parallel models are not suitable; hence, our goal is to develop one that is suitable for SMP. In a distributed shared-nothing memory parallel system, the processors rely on communication to coordinate their tasks. All proposed algorithms for mining association rules in this model focus on three issues : (1) reduction of communication; (2) pruning of candidate sets; (3) partitioning of the candidate sets across the distributed memory. They all follow the level-wise approach developed in the serial algorithm Apriori [2]. Large itemsets in this approach are computed in iterations by their lengths which is the number of items they contain; thus, the database must be scanned as many times as the length of the maximum large itemsets a hard lower bound on the I/O cost. They also adopt a synchronous protocol in the exchanging of support counts at the end of every iteration, which may not be the best for shared-memory system. In an SMP system, processors communicate through shared variables and communication cost is no longer an issue; mining performance is dominated solely by I/O and computation costs. In the distributed memory model, processors access their own database partitions on their local disks independently. As in an SMP, the partitions for dierent processors are stored in the same shared storage and accessed via the same I/O channel, which could very easily become a performance bottleneck. Further aggravating the problem, the synchronized access to their partitions in every iteration create serious I/O contention among the processors. Our contribution is the proposal of an adaptive and asynchronous algorithm for the shared-memory model which requires much less I/O comparing with the level-wise approach. A direct extension of the serial association rules mining algorithm Apriori to the distributed memory model has been developed. Its implementation on the IBM SP2 system is called (Count Distribution), which is a synchronous level-wise algorithm [3]. A variant of was adopted to SMP in [17], which parallelizes the candidate generation. However, it suers heavily from I/O contention. From what we know, no asynchronous parallel algorithm has been proposed for mining association rules. In this paper, we will propose a parallel algorithm (Adaptive Parallel Mining) for mining association rules on an SMP machine. has the following merits : (1) it requires much less scanning on the database than a level-wise algorithm such as ; (2) it has a much smaller set of candidates then ; (3) it is an asynchronous algorithm which produces less I/O contention. 2

3 The candidate set generation in is based on the dynamic candidate generation method in the Dynamic Itemset Counting (DIC) algorithm [4]. It is a break away from the conventional level-wise approach : it divides the database into intervals, generates candidates and performs counting on the intervals instead of the entire database. By working on intervals, it can start its counting earlier during the scanning of the database. In an ideal situation in which the itemset distribution over the intervals is homogeneous, the counting can be completed with much less database scanning. Albeit its simplicity, its advantages in many cases are overwhelmed by its problems. It's performance is very sensitive to the data distribution characteristics and the choice of the interval size. A direct implementation of dynamic counting would often lead to a large number of candidates because the distribution in general is not highly homogeneous. Thus, the amount of database scanning required in many cases would be even worst than that in the level-wise approach. The merits of dynamic candidate generation are its potential to generate less database scanning, and its provision for the processors to perform counting asynchronously over their partitions. In, we have incorporated this dynamic technique and successfully tackled its problems with two optimization techniques : (1) Adaptive interval conguration It can generate a conguration of the intervals which produces a database partition across the processors with a highly homogeneous inter-partition and intra-partition itemset distributions. With such a homogeneous distribution, the dynamic technique performs much more close to the ideal situation. (2) Virtual partition pruning The dynamic technique is very sensitive to an error accumulating eect. Unlike Apriori, it does not generate candidate sets based on the large itemsets found in every iteration; instead, it uses locally large itemsets found in one interval to generate candidates for the subsequent intervals. If a large part of these locally large itemsets are not globally large, (not large with respect to the entire database), then the candidates generated based on these non-globally-large itemsets will contain a lot more unnecessary candidates (false hits). This error accumulating eect can push up the computation cost exponentially to an unacceptable level. The virtual partition pruning technique can start the dynamic generation with a much more accurate and smaller initial candidate set; reduce substantially the accumulated error; and hence produce a much smaller candidate sets. We have implemented and on a Sun Enterprise 4 shared-memory multiprocessor machine with 12 processors and carried out extensive performance studies. Our results show that a direct implementation of the dynamic technique does not bring in any performance gain; in fact, in some cases, it is even worst than the level-wise approach. However, with the incorporation of the two optimization techniques, performs consistently better than in all respects : it was 2 to 5 times faster than in most cases, had reduced about half of the I/O cost, and had a candidate set which is 6 to 3 times smaller. In short, the results conrm that the optimizations are the main reasons for the super performance of. The rest of this paper is organized as follows. Dynamic candidate generation and virtual partition pruning are studied in Section 2. Details of the algorithm is in Section 3. Section 4 reports the results of an extensive performance study. Sections 5 and 6 are the discussion and conclusion. 2 Dynamic Techniques for Association Rule Mining Let I = fi 1 ; i 2 ; ; i m g be a set of items and D be a database of transactions, where each transaction t consists of a set of items such that t I. An association rule is an implication of the form X ) Y, where X I, Y I and X \ Y = [1]. An association rule X ) Y has support s in D if the probability of a transaction in D containing both X and Y is s. The association rule X ) Y holds in D with condence c, if c is the probability that a transaction 3

4 d in D containing X also contains Y. The task of mining association rules is to nd all the association rules whose support is larger than a given minimum support threshold and whose condence is larger than a given minimum condence threshold. For an itemset X, we use X :sup to denote its support count in database D, which is the number of transactions in D containing X. An itemset X I is large (or frequent) if X :sup minsup jdj, where minsup is the given minimum support threshold. For the purpose of presentation, we sometimes use support to stand for support count of an itemset. It has been shown that the problem of mining association rules can be reduced to nding all large itemsets for a given minimum support threshold. 2.1 Level-wise Approach All association rule mining algorithms have adopted in various ways the candidate generation technique in the Apriori algorithm [2]. It is a level-wise approach which nds large itemsets by iteration. (We use L k and C k to denote the set of size-k large itemsets and the set of size-k candidate itemsets, k 1, respectively.) It scans the database in the rst iteration to count and nd the set of size-1 large itemsets L 1. In this iteration, the candidate set C 1 consists of all the items in I. After that, the candidate sets C k in each subsequent iteration, (iteration k), are generated by applying a function Apriori-gen on the large itemsets L k?1 found in iteration k? 1. Apriori-gen essentially joins the large itemsets in L k?1 in all possible ways to form size-k candidates and eliminates some of them with the subset constraint that all size-(k? 1) subsets of a candidate must have been found large in the previous iteration. Apriori then counts the supports of the candidates in C k by checking the transactions in D against C k, which requires a full scanning of D. In this level-wise approach, the database must be scanned as many times as the length of the maximum large itemset. Let us perform an analysis on the cost of nding the large itemsets, which mainly consists of I/O and computation cost. If! is the length of the maximum large itemset, then the I/O cost is determined by the! times of scans of the database. The computation cost in each iteration is dominated by the counting of the supports of the candidates in C k, in which the size of C k is the major factor. Let the size of C k and L k be c k and l k, respectively. The value of c k is determined by the Apriori-gen function and the set L k?1. Suppose c is the total number of candidate sets, then c =!X c k =!X k=1 k=1 f k (L k?1 ); where f k, the number of candidates generated, is a function of L k?1. Furthermore, if the I/O and computation cost in each iteration k are k and k, respectively, then the total cost? of computing all the large itemsets is given by :? =!X ( k + k ) =!X k=1 k=1 k +! ; where is the I/O cost of a full database scan, which is the same for all the iterations. In general, the number of candidates reduces rapidly after the rst few iterations, and k peaks in the rst few iterations. Thus, there are two options for reducing the cost : (1) reduce the value!, i.e., the number of rounds of scanning the database; (2) reduce the number of candidates, in particular, those in the rst few iterations. This observation has also been discussed in [12], in which a dynamic hashing technique is used to reduce the size-2 candidate sets. 4

5 2.2 Dynamic Candidate Generation Dynamic candidate generation is an attempt on the rst option described above to reduce the scannings [4]. It divides the database into equal size intervals, each containing the same number of transactions, and assigns checkpoints at the boundaries between consecutive intervals (Figure 1). It then scans the database starting from the rst interval, generates candidate itemsets and counts them in a pipeline fashion. In the rst interval, the initial set of candidates is initialized to the set of size-1 itemsets. During the scanning of an interval, all candidates generated and accumulated so far are counted. At the checkpoint, those that are found to be locally large with respect to the number of transactions in the interval are used to generate candidates for the next interval. These locally large itemsets are divided into groups by their lengths, and the function Apriori-gen is applied on each group to generate new candidates. This candidate generation and counting process is repeated on every interval. It does not stop at the last interval but wraps around the database and continues on at the rst interval until the termination condition becomes true. Transactions Database Counting Process : CheckPoint Figure 1: DIC This aggressive and optimistic approach allows the candidate sets to grow very quickly. It will be counting size-k candidates during the scanning of the k-th interval in the rst iteration instead of in the k-th iteration. It also allows candidates of dierent sizes to be counted together once they are generated. The global support of a candidate in the entire database is found as soon as the scanning has wrapped around the database and come back to the checkpoint at which the candidate was rst generated. If all generated candidates have been counted, the counting is completed and terminated. The dynamic technique has the following problems : (1) It is very sensitive to the data distribution characteristics. (2) It requires a proper interval size. (3) It accumulates errors recursively in generating candidates and the amount of unnecessary candidates generated is very sensitive to the amount of false hits in the initial set of candidates. The dynamic technique requires a homogeneous distribution of large itemsets over the intervals. In an ideal scenario, the itemset distribution in most of the the intervals are very similar, and very few new candidates would be generated after the rst! intervals, where! is the length of the maximum large itemset; hence, the amount of scanning required could be as little as one pass of the database plus! intervals. On the other hand, if the distribution is not very uniform, new candidates may be generated at various stages which would keep extending the scanning of the database. In the worst case, the number of passes required to scan the database could be larger than!. The choice of the interval size is also important in the dynamic generation. With respect to the whole database, an interval is similar to a sample from the database. If it is large, it is more likely that locally large itemsets found are also globally large; hence, fewer unnecessary candidates would be generated. However, it would require more scanning on the database. On the other hand, a smaller interval would reduce the amount of scanning. However, it 5

6 may not be as good a sample and more locally large itemset would be generated; hence, more unnecessary candidate sets. 2 In, we have developed a technique to congure and merge the intervals to create a more homogeneous data distribution for dynamic generation. As a consequence, an interval size compatible to the homogeneous distribution can also be determined in this process. Now let us look at the problem of error accumulation in the dynamic technique. Let! c be the length of the maximum candidate itemset. Suppose c is the total number of candidate sets generated by the dynamic technique, and c k is the number of size-k candidates, then c = Xk c c k = Xk c k=1 k=1 f k (L k?1); where L k is the set of size-k locally large itemsets found and f k is determined by the function Apriori-gen. 3 In the ideal case in which the data distribution is homogeneous, L k will be very close to L k, and the candidate set size will be similar to that in Apriori-gen. On the other hand, in the opposite case, L k will be much larger than L k, and a lot more candidates will be unnecessary generated. Worse yet, the error would be accumulated over the intervals. In Apriori-gen every candidate is generated from the "true" large itemsets found in the previous iteration. In the dynamic case, some locally large itemsets which are not "true" large itemset are used to generate candidates which will not become large. These unnecessary candidates, if turn out to be locally large subsequently, will then be used again to generate new unnecessary candidates. This recursive eect could undesirably pushes up the false hits (unnecessary candidates) among the candidates. Because of this recursive eect, the amount of small-size unnecessary candidates would aect signicantly the amount of unnecessary candidates of the larger size, and hence the total number of unnecessary candidates. Hence, we must control the amount of small-size unnecessary candidate sets. We have found out in our performance studies that if we can prune down C 2 to more close to L 2, the accumulated error in dynamic candidate generation can be reduced signicantly. A simple way to achieve this is to delay the dynamic generation until L 2 is found, which requires two additional passes of scanning on the database. We have taken a more eective approach of using only one additional scan to compute L 1, and at the same time adopt a virtual partition pruning technique to prune C 2. Our performance studies have shown that this technique is very eective in bringing down the accumulated error in the candidate generation. 2.3 Virtual Partition Pruning Given a database D, we can divide it into n partitions D i, 1 i n. For an itemset W, let us use W :sup(i) to denote the support (local support) of W in D i, 1 i n. We also use W :sup to denote the support (global support) of W in D. Suppose X is a size-k candidate set. If Y X, then Y :sup(i) X :sup(i), 1 i n. Therefore, the local support of X in D i, X :sup(i), is bounded by the value minfy :sup(i) j Y X; and jy j = k? 1g. Hence, the value X :maxsup = nx i=1 X :maxsup(i) ; where X :maxsup(i) = minfy :sup(i) j Y X; and jy j = k? 1g; (2.1) 2 A candidates is an unnecessary candidates if it is not large. 3 In dynamic candidation generation, in each interval, not all local large itemsets are used to generate candidates, only those found in the interval are used. 6

7 is an upper bound of the global support of X. If X :maxup < s jdj, where s is the minimum support threshold, then X can be pruned away. We refer this as partition pruning, which is an extension of the global pruning proposed in [5]. Note that in a system with multiprocessor, it is natural to divide the database according to the number of processors. With such a conguration, partition pruning can be performed with respect to the actual partitioning. On the other hand, the database can also be partitioned irrespective to the number of processors. We call the pruning performed in this way a virtual partition pruning. The eect of partition pruning can be illustrated by the example in Table 1. Suppose the global support count threshold is 3. Then all the itemsets in fa; B; C;? D; E; F g become size-1 large itemsets. Hence, the number of size-2 6 candidates from Apriori-gen would be equal to 2 = 15. However, among all these 15 candidates, only AB and EF can survive the partition pruning. The upper bound of the global supports of AB and EF, computed according to equation 2.1, are 31 and 3, respectively, (and hence survive the pruning); but the upper bound of the other size-2 candidates are all below 3, (and hence pruned away). In this example, the partition pruning can prune away more than 8% of the candidates in Apriori. Items A B C D E F local support at partition local support at partition local support at partition global support Table 1: Virtual Partition Pruning It can be seen from the above example that this pruning technique is more eective if the itemsets have a skewed distribution over the partitions. Intuitively, the data skewness of a partitioned database is high if most large itemsets are locally large only at a few partitions. In the following, we will show analytically the eect of partition pruning in highly skewed partition. Lemma 1 Suppose a database is partitioned into n partitions. Let L 1 be the set of size-1 large itemsets, C 2(a) and C 2(v) be the set of size-2 candidates generated in Apriori and those remaining after the partition pruning, respectively. Suppose that each size-1 large itemset is locally large at one and only one partition, and the number of size-1 locally large itemsets at each partition is jl1j n. Then jc 2(v)j jc jl1 j n?1 2(a) j jl 1 1j?1 n.? jl1j Proof: Since Apriori will generate size-2 candidates as all the 2-combinations from L 1, jc 2(a) j = 2 = jl 1j(jL 1j?1). 2 As for partition pruning, if AB 2 C 2(v), then both A and B? must be locally large together at one of partitions. Therefore, the total number of candidates in C 2(v), jc 2(v) j jl 1 j n 2 n = jl 1j ( jl1j 2 n? 1). Therefore, jc 2(v)j jc 2(a) j jl 1 j n?1 jl 1 1j?1 n. 2 Lemma 1 shows that the partition technique can dramatically prune away at least n?1 n size-2 candidates generated by Apriori in a skewed partition. Note that the cost of partition pruning is very low. In, we will use it to prune the candidates in C 2, the only additional cost required is to store the local supports of the size-1 itemsets. The local and global supports can be counted together in one pass of scanning the database. 7

8 root {} A B C C D item: itemid: InterCount: B C D LocalCount: branchno: D C D branchpointer:... D Figure 2: Trie and its node 3 Parallel Mining on SMP (Count Distribution) is a parallel version of Apriori proposed for distributed memory parallel system [3]. In this model, the database D is partitioned onto the local disks of the processors. At every iteration (k-th iteration), each processor rst computes the same candidate set C k from L k?1, and then scans its own partition to compute the local supports of the candidates in C k. All the processors then exchange their local supports by performing synchronous broadcasts. Following that, each processor computes the global supports of the candidates and nd out the globally large itemsets in L k. repeats these steps until no new candidate is generated. Several other parallel algorithms such as the IDD, HD and HPA, which are improvement on, have been proposed [7, 15]. They focus on the issue of distributing candidates among the distributed memory. All these are level-wise and synchronous algorithms. In the following, we will explain that asynchronous algorithms could be more ecient in a shared-memory system. Processors would not be forced to wait for each other in every round of communication. 3.1 Asynchronous Parallel Mining of Association Rules In a shared-memory multiprocessor parallel system, the model generally used for computing the large itemsets is called common candidates partitioned database [17]. The database D is stored in the shared storage system and divided logically into partitions D 1 ; D 2 ; ; D n, where n is the number of processors. Each processor counts the supports of the common candidates against its own partition. The results of the counting are stored in a shared data structure in the memory. 8

9 (Adaptive Parallel Mining) uses the dynamic candidation generation technique to generate the common candidates asynchronously. In order to store candidates of dierent sizes, a trie instead of a hash tree is used to store the supports. Figure 2 is a trie in. Every node on the trie is associated with a candidate itemset. For example, in the rst branch of the trie, the nodes store the support for the candidates A, AB, ABC, AC, and AD. intervals... partition D 1... partition D 2... database D partition D n... Figure 3: Database, partitions and intervals In order to generate candidates, each partition is divided into smaller intervals and dynamic candidate generation is performed on the intervals by the corresponding processor (Figure 3). New candidates are stored on the trie once they are generated. Candidates generated by one processor are shared by other processors once they are inserted on the trie. All processors perform the generation and counting cycle in their own partitions until all candidates on the trie have been counted by all the processors and no new candidate is generated. 3.2 The Algorithm We present the algorithm in Figure 4, the program fragment will be executed by each processor i on its partition D i, 1 i n. The algorithm consists of two phases. The steps in the rst phase : (1) The database is divided into intervals and these intervals are further divided into groups to form a skewed partitions. (The skewness is created by clustering the intervals. Details is in Section 3.3.) Each processor then scans its partition to compute support counts of size-1 itemsets. This results in the size-1 large itemsets L 1. (2) The size-2 candidates C 2 are computed from L 1. (3) Virtual partition pruning is performed to reduce the size of C 2. (See section 3.3.) (4) A shared trie for recording support counts is initialized with the remaining candidates in C 2. (5) Inter-partition and intra-partition interval congurations are performed to increase the homogeneity of the data distribution in the partition. ( See section 3.4.) At the end of phase 1, we have a trie containing size-2 candidates and a partition which has high homogeneity. In phase 2, every processor performs the dynamic generation and counting in its partition with the following steps : (1) Scans intervals and computes locally large itemsets in each interval. (2) Uses locally large itemsets found to generate new candidates. (3) Applies virtual partition pruning on new candidates before inserting them into the shared trie. (4) Before starting a new round of counting on the next interval, every processor traverses the trie to remove itemsets which are globally small. If all the processors have counted all the itemsets on the trie, then the algorithm terminates. 9

10 /* Preprocessing : (1) all processors scan their partitions to compute local supports of size-1 itemsets in their intervals; (2) compute L 1 and generate C 2 = Apriori-gen(L 1 ); (3) perform a virtual partition pruning on C 2 ; (4) initialize the shared trie with the remaining size-2 candidates; (5) perform inter-partition interval conguration and intra-partition interval conguration to prepare a homogeneous distribution */ /* Parallel execution: every processor i runs the following fragment on its partition D i */ 1) while (some processor has not nished the counting of all the itemsets on the trie on its partition) 2) f while ( processor i has not nished the counting of all the itemsets on the trie on D i ) 3) f scan the next interval on D i and count the supports of the itemsets on the trie in the interval scanned; 4) nd the locally large itemsets among the itemsets on the trie; 5) generate new candidates from these locally large itemsets; 6) perform virtual partition pruning on these candidates and insert the survivors into the trie; 7) remove globally small itemsets on the trie; 8) g 9) g Figure 4: The Algorithm 3.3 Virtual Partition from Clustering As discussed in Section 2.3, the initial set of candidates used in the dynamic generation in is important in controlling the total number of candidate sets. We use one scan on the database to compute L 1 and generate size-2 candidates from L 1. Then we use the virtual partition technique to prune the candidates and use the remaining sets as the initial set of candidates to start the dynamic generation in. The skewness of the virtual partitioning of the database can increase the eect of the pruning. We use a k-clustering technique to generate skew virtual partition. 4 To prepare for the dynamic generation, D will be divided into a set of equal size small intervals. While computing the size-1 large itemsets, an item support vector (or support vector) can be generated from every interval. A support vector contains the support counts of all the items in the interval. It represents the distribution of size-1 itemsets in the interval. In the multidimensional space of the support vectors, a set of k clusters is computed. We logically merge the intervals in every cluster to create a virtual partitioning of D. The skewness of the partitioning so generated is much higher than that generated by a random partitioning. The technique described in Section 3.3 is then used to generate the size-2 candidates. This candidate set will be much smaller than that in Apriori. This smaller size-2 candidate set has the eect of reducing the initial error in the dynamic generation in. This observation has been conrmed in our performance studies. 4 In the k-clustering technique, the number of resulted clusters k is a control parameter. 1

11 3.4 Homogeneous Inter-partition Distribution Since the database D needs to be divided into partitions and intervals to perform dynamic generation, we need to control two itemset distributions. We want both the inter-partition and intra-partition distributions to be homogeneous. We use the result of the k-clustering in the rst pass to create a homogeneous inter-partition distribution. The distributions of itemsets in the intervals belonging to a cluster are very similar. If every partition contains similar number of intervals from every cluster, the itemset distribution between these partitions would be more homogeneous than a random partition. We distribute intervals to partitions following this rule. Since all intervals are accessible by every processor, no physical movement on the storage is required to produce these partitions. Every processor just needs to know the sequence of intervals belonging to its partition. 3.5 Adaptive Interval Conguration uses another adaptive technique to produce intra-partition interval congurations with high homogeneity. It shues the intervals in a partition and merges them iteratively until the homogeneity reaches a certain measurable level. The homogeneity will be guaranteed if we merge all the intervals into one big interval. However, we also need to have small interval size. In order to adaptively create a near optimal interval conguration, we need to dene a measurement on the homogeneity of a conguration. Denition 1 Given two sets of large itemsets L i and L j, the distance between them is dened as The operator dist has the following properties: 1. dist(l i ; L j ) 2 [; 1]; 2. dist(l i ; L i ) = ; 3. dist(l i ; L j ) = dist(l j ; L i ); 4. dist(l i ; L j ) dist(l i ; L k ) + dist(l k ; L j ); 5. dist(l i ; L j ) = 1 if L i \ L j = ;. dist(l i ; L j ) = 1? jli \ L j j jl i [ L j j : Example 1 Let L 1 = fa; B; C; AB; BCg and L 2 = fa; B; D; AB; BDg be two sets of large itemsets, the distance between L 1 and L 2 is dist(l 1 ; L 2 ) = 1? jl1 \ L 2 j jl 1 [ L 2 j = 1? 3 7 = 4 7 : Denition S 2 A division Dv = fi 1 ; I 2 ; :::; I m g of a partition D i is a set of disjoint intervals dividing D i such that m D i = I j=1 j. The evenness factor E f (Dv) of Dv is dened by P m j=1 E f (Dv) = dist(lj ; L) ; m where L j is the locally large itemsets in I j, 1 j m, and L is the large itemsets in D i. Evenness factor has a value in [, 1]. It equals to zero if all the intervals have the same set of large itemsets. Let Dv a = fi 1 ; I 2 ; :::; I p g be a division of a partition D i. A division Dv b = fi s1 ; I s2 ; :::; I sp g of D i is an equivalent division of Dv a if it is a re-ordering of the intervals in Dv a. 11

12 Denition 3 Given a division Dv a = fi 1 ; I 2 ; :::; I p g of a partition D i, a division Dv m = fi m1 ; I m2 ; :::; I ml g of D i is a k-merge of Dv a if l = bp=kc, and there exists an equivalent division Dv b = fi s1 ; I s2 ; :::; I sp g of Dv a such that I mi = S ik j=(i?1)k+1 I s j, for 1 i l? 1, and I ml = S p j=(l?1)k+1 I s j. A k-merge of a division is a shuing and merging of the intervals in a division such that each resulted interval contains k intervals from the original division. Note that the last interval I ml in the k-merge may contain more than k intervals. Example 2 Let Dv a = fi 1 ; I 2 ; :::; I 1 g be a division of a partition D i, then Dv b = fi j1 ; I j2 ; I j3 g, where I j1 = I 1 [ I 4 [ I 5, I j2 = I 2 [ I 3 [ I 7, and I j3 = I 6 [ I 8 [ I 9 [ I 1, is a 3-merge of Dv a. We use an adaptive technique to generate merges for each partition to produce homogeneous intra-partition interval conguration. After the inter-partition conguration, each partition contains an equal number of intervals. We perform another k-clustering on the support vectors of the intervals inside each partition. Let Dv be a division of a partition D i, and G 1 ; G 2 ; ; G k be the k clusters generated from the support vectors. The clusters are arranged in a decreasing order by their sizes. We again pick intervals alternatively from the clusters in a round robin fashion to re-order the intervals in Dv and then perform a k-merge on the re-ordered Dv. What we are doing here is logically re-arranging the orders of the intervals in each partition and group them to form bigger intervals such that each one of these bigger interval contains similar number of intervals from every clusters. As a result, the distribution of the itemsets among these bigger intervals will be more homogeneous than that over the smaller intervals. Hence, the homogeneity of the new division would be increased. In order to measure the homogeneity of a resulted merge eciently, we restrict the computing of the evenness factor on size-1 itemsets. In other words, if Dv = fi 1 ; I 2 ; ; I p g is a division on a partition D i, we modify P m j=1 E f (Dv) = dist(lj; L 1 1) : m where L j is the set of size-1 locally large itemsets in I 1 j, 1 j p, and L 1 is the set of size-1 large itemsets in D i. Example 3 Let Dv = fi 1 ; I 2 g be a division on partition D i. Suppose the size-1 large itemset in D i is L 1 = fa; Cg. And the size-1 locally large itemsets of I 1 is L 1 1 = fa; Cg, and that of I 2 is L 2 1 = fag. Then dist(l 1 1 ; L 1) = 1?2=2 = and dist(l 2 1; L 1 ) = 1? 1=2 = :5. Hence, E f (Dv) = [dist(l 1 1; L 1 ) + dist(l 2 1; L 1 )]=2 = :25. We compute the evenness factor after a k-merge has been created. If the factor is less than a threshold, then the adaptive conguration stops; otherwise, the re-conguation and merging of the intervals is repeated until one of the following three conditions becomes true : (1) the evenness factor is smaller than a threshold; (2) the rate of change of the evenness factor is less than a threshold; (3) the number of resulted intervals is below a threshold. The rst condition has been discussed. If E fi and E fj are the evenness factors of two consecutive merges, the rate of change is dened as E f = je fi? E fj j E fi : A small rate of change indicates that further merging would not improve the factor. Therefore, it should stop. The third condition is to control the number of intervals in the division. The merge should not proceed further if the intervals have become too large. 12

13 3.6 Implementation Itemset Counting When a candidate itemset is rst generated, a node will be inserted in the trie for it. The status information associated with the node are those in Figure 2. The Item register stores the last item of the itemset. E.g., C will be stored in the node if the itemset is ABC. The ItemID register stores an id assigned to the itemset for indexing into the counter arrays in the memory. Every processor has a private counter array for storing the local supports of the candidate sets for the interval being scanned. 5 In addition, two counter arrays, LocalCount and InterCount, are created at each node. LocalCount records the total local support counts of the itemset over all the intervals scanned so far for every processor. InterCount records the number of intervals already counted for the itemset for every processor. After the nodes of all new candidate itemsets have been created and inserted into the trie at a checkpoint, the corresponding processor will scan the next interval. The support of the candidates in the trie in the new interval will be stored rstly in the processor's private counter array. At the next checkpoint, the supports in the private counter array are used to determine the locally large itemsets of the interval which are then used to generate the next set of candidates. In addition, the counts in the private counter array are added to the LocalCount array to update the total support counts of the candidates. The private counter arrays are reset to zero for every interval. The values in InterCount is also updated to reect the number of intervals counted. Note that the processors perform itemset insertion and support count updates concurrently on the trie. The only constraint is that the local support of a newly inserted itemset will start to be counted by a processor on its partition when it begins a new interval scanning cycle Pruning Candidates at Checkpoints It is useful to prune away as many candidates as possible during the dynamic generation and counting on the trie. applies partition pruning at every checkpoint to prune newly generated candidates. The computation of the upper bound in partition pruning needs to be modied slightly for this purpose. Let D i, 1 i n, be the partitions. For a given size-k itemset X, partition pruning requires the local support counts of all the size-(k? 1) subsets of X to compute a bound. In the asynchronous case, when a new candidate is generated by a processor, the local supports of some of its subsets in the other partitions may not be available on the trie. To accommodate this, we need to use a more conservative bound of the support count. Suppose we want to determine whether a candidate X can be pruned at a checkpoint. Also assume that a partition D i, 1 i n, has been divided into m intervals and each interval contains M transactions. Let Y X. If Y has been counted over t intervals in D i, and the accumulated support of Y over the t intervals is Y :sup(i;t), then Y :sup(i) is bounded by Y :sup(i;t) + M (m? t). With this bound, we then can compute an upper bound for X :sup. If the result is less than s jdj, then X can be pruned away. Note that this pruning is more eective when most subsets of X have been counted completely. At each checkpoint, a processor can traverse the trie to identify globally large and globally small itemsets. All globally small itemsets found are removed before the processor starts a counting cycle. We also remove the supersets of the small itemsets, which are no the same branch. For example, if BC is globally small, then its superset B 5 The array is just a data structure to store counts of itemsets found in an interval. It can be replaced by a hash table if there is a concern on space usage. We use array to simplify the discussion. 13

14 will be removed (Figure 2) Termination Condition terminates when all candidates on the trie have been counted by all the processors. A processor can determine whether it has completed the counting of an itemset by comparing the value in InterCount with the total number of intervals. Every processor i keeps a counter NodeCount(i) on the number of itemsets that it has completely counted. also keeps a counter NumItemsets on the number of candidates on the trie. By comparing the values of these two counters, a processor can determine at a checkpoint whether it has completely counted all the candidate itemsets. Hence, the termination condition can be veried by all the processors independently Optimization Cost We want to discuss here the cost of the optimizations. The cost of producing a more homogeneous interval con- guration involves clustering and merging of intervals. As will be shown in the performance studies in the next sectio, the cost of clustering is negligible, because the data set in the clustering consists of only one vector from each intervals. Hence, the data set is small. Also, the merging of the intervals does not involve physical movement of data. Since the database is on the shared disk, the reconguration and merging of the intervals can be implemented by assigning dierent sequences of pages to the partitions and processors. As for the cost of virtual partition pruning on the size-2 candidates, it requires the storage of the local supports of all the size-1 items with respect to all the partitions. However, this is only required in the rst iteration and the size is proportional to the number of items and partitions only. As for the memory space requirement, the main requirement is for the trie, which is proportional to the number of candidate sets and processors. However, since the trie is shared, the amount of space required per processor remains proportional to the number of candidate sets. With the help in the pruning of candidate sets, the space requirement can be reduced substantially. Our performance studies have conrmed this observation. (Please see the next section.) 4 Performance Studies We have carried out extensive performance studies on a 12-node Sun Enterprise 4 shared-memory multiprocessor. Each node of it is an Ultra Sparc 25 MHz processor running Solaris 2.6. The machine has 1G main memory. 4.1 Synthetic Databases Generation We follow the methodology proposed in [2] to develop synthetic database to study the performance. In order to test the eect of the dynamic interval conguration, we need skewed data distribution. We enhanced the procedures in [2] to generate partitions and introduced a technique to control its skewness. Table 2 is a list of the parameters used in our synthetic databases. Suppose we want to generate n database partitions of size equal to D=n. We rst generate n smaller data intervals. Each of them has D=n transactions. Their skewness is controlled by the parameter S. Subsequently, 6 We don't want to search the whole trie for superset of small itemsets, only those on the same branch are removed. 14

15 the intervals are divided into n groups and each group has intervals. Finally, the intervals in every group are combined to form partitions in the test database. D T I L N S n total number of transactions in the database average size of the transactions average size of the maximal potentially large itemsets number of maximal potentially large itemsets number of items interval skewness number of partitions number of intervals in each partition Table 2: Synthetic Database Parameters All intervals and hence partitions are generated from a pool of potentially large itemsets as in [2]. We rst generate the relative weights of these large itemsets. These weights are then broken down into n weights, one for each interval. In other words, every itemset in the pool has n weights associated with it. Each one of these weights corresponds to the probability of occurrence of the itemset in an interval. The weight of each itemset in a size-l pool is picked from an exponential distribution with unit mean. Then we pick a skewness level s for the itemset from a normal distribution with mean S and variance.1. After that we generate n probability values from an exponential distribution with variance equal to s, and normalize them so that their sum equals to 1. These n probability values are randomly assigned to the n intervals. Eventually, the weight of the itemset is broken down into n weights by multiplying it with the probability values. The procedure is iterated to generate the weights of all the itemsets and their breakdowns. The second step is to generate the itemsets in the pool. We rst divide the N items into n disjoint equal ranges. In order to control the skewness of the itemsets, we regard the i-th (1 i n) probability values generated for an itemset in the previous step as the probability of choosing items from the i-th range to put into the itemset. The size of the itemsets is determined by a Poisson distribution with the parameter I as in [2]. All items of the rst itemset are chosen randomly from some ranges. The ranges are determined by tossing a ( n)-side weighted coin, where the weight of side i is the i-th probability of the n probability values of the itemset. Items are picked until the number is equal to the size of the itemset. Some items of the subsequent itemsets are copied from the previous itemset according to the correlation level, while the remaining items are picked in the same way as in the rst itemset. The third step is to generate the transactions in all the intervals. The size of the transactions is determined by a Poisson distribution with the parameter T. When generating interval i, (1 i n), we normalize the i-th weights of all the itemsets so that their sum equals to 1 and use the normalized i-th weight as the probability of occurrence of the associated itemset in interval i. Each transaction in the interval is assigned a series of large itemsets, which are chosen by tossing an L-side weighted coin, where the weight for a side is the i-th weight of the associated itemset. We also incorporate the corruption factor to control the generation. The last step is to combine the n intervals into n partitions. The n intervals are equally divided into n groups, and the intervals in each group are concatenated to form new partition. The results are n database partitions, each has a size equal to D=n and contains intervals. 15

16 4.2 Relative Performance In the studies of relative performance, we have two goals : (1) to nd out how much improvement can deliver over the level-wise algorithm ; (2) to identify the improvement from dynamic generation versus those from the two optimizaton techniques. For these purposes, besides and, we also have implemented two variants of. The rst variant -DIC is a direct parallelization of DIC with no optimization. In other words, -DIC is the algorithm in Figure 4 without steps 3,4,5 in the Pre-processing part. The second variant -AIC has incorporated adaptive interval conguration to bring in intra-partition homogeneity on -DIC. In the implementation of, a hash tree is shared by all the processors to store their support counts. Every processor has a private counter array to record support counts of the candidates in its partition. At the end of each iteration, these support counts are copied from the arrays to the hash tree, and a master processor would compute the large itemsets and generate the candidates for the next iteration. Name T I Partition Size D(124n)K.T5.I MB D(512n)K.T8.I MB D(512n)K.T1.I MB D(512n)K.T1.I MB D(256n)K.T2.I MB Table 3: Test Databases Five series of databases with dierent attributes have been generated for the studies. Their attributes are summarized in Table 3. The name of the databases are denoted in the form of Dx.Ty.Iz, where x is the total number of transactions, y is the average size of the transactions, z is the average size of the itemsets. In Table 3, the total number of transactions are specied in terms of n, the number of partitions, and the number of transactions in a partition. This allows us to compare the performance on databases with same characteristics but dierent number of partitions. We set the parameters in Table 2 with the following values : N = 1; L = 1; S = :7; = 8, and correlation level = :5. We ran, -DIC, -AIC and on the ve series of databases. The minimum support threshold in the rst four series is 1%, and is 2% in the last series. Figure 5 shows the performance of these algorithms. The ratios of the response times versus that of are shown in Table 4. is superior than in all cases: when n = 12, it is 2-5 times faster; when n = 8, it is about 3 times faster; when n = 1 (no partitioning), it is still about 2 times faster, except in one single case. When comparing -DIC with, we found that the gain from pure dynamic generation is not impressive. When n = 1, -DIC is equivalent to the serial algorithm DIC, and the results show that it performs worse than, which is as predicted. This shows that DIC without optimization could not bring in much gain in general. When we compared -AIC with -DIC, we found signicant eect from intra-partition interval conguration in some cases; however, the gain is not uniform over all the cases. In some cases, the skewness of the data distribution among partitions has caused -AIC to generate many unnecessary candidates. Lastly, when we compared with -AIC, we found that the eects of inter-partition interval conguration and the inital candidate set reduction have produced substantial gain consistently. 16

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 jpark@cs.sungshin.ac.kr, fmschen,

More information

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing An Ecient Algorithm for Mining Association Rules in Large Databases Ashok Savasere Edward Omiecinski Shamkant Navathe College of Computing Georgia Institute of Technology Atlanta, GA 3332 e-mail: fashok,edwardo,shamg@cc.gatech.edu

More information

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao Department of Computer Science, The University of Hong Kong, Hong Kong. fsdlee,dcheung,aog@cs.hu.h

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

A Fast Distributed Algorithm for Mining Association Rules

A Fast Distributed Algorithm for Mining Association Rules A Fast Distributed Algorithm for Mining Association Rules David W. Cheung y Jiawei Han z Vincent T. Ng yy Ada W. Fu zz Yongjian Fu z y Department of Computer Science, The University of Hong Kong, Hong

More information

signicantly higher than it would be if items were placed at random into baskets. For example, we

signicantly higher than it would be if items were placed at random into baskets. For example, we 2 Association Rules and Frequent Itemsets The market-basket problem assumes we have some large number of items, e.g., \bread," \milk." Customers ll their market baskets with some subset of the items, and

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day. Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules Anthony K. H. Tung 1 Hongjun Lu 2 Jiawei Han 1 Ling Feng 3 1 Simon Fraser University, British Columbia, Canada. fkhtung,hang@cs.sfu.ca

More information

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

Data Mining for Knowledge Management. Association Rules

Data Mining for Knowledge Management. Association Rules 1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Null. Example A. Level 0

Null. Example A. Level 0 A Tree Projection Algorithm For Generation of Frequent Itemsets Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 E-mail: f agarwal, charu,

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Generating assoc. rules from frequent itemsets Assume that we have discovered the frequent itemsets and their support How do we generate association rules? Frequent itemsets: {1}

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Association Rule Mining: FP-Growth

Association Rule Mining: FP-Growth Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong We have already learned the Apriori algorithm for association rule mining. In this lecture, we will discuss a faster

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Association Rules. A. Bellaachia Page: 1

Association Rules. A. Bellaachia Page: 1 Association Rules 1. Objectives... 2 2. Definitions... 2 3. Type of Association Rules... 7 4. Frequent Itemset generation... 9 5. Apriori Algorithm: Mining Single-Dimension Boolean AR 13 5.1. Join Step:...

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information

2 CONTENTS

2 CONTENTS Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a Road Map..................................... 3 5.1.1 Market Basket Analysis: A Motivating Example........................

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

INTELLIGENT SUPERMARKET USING APRIORI

INTELLIGENT SUPERMARKET USING APRIORI INTELLIGENT SUPERMARKET USING APRIORI Kasturi Medhekar 1, Arpita Mishra 2, Needhi Kore 3, Nilesh Dave 4 1,2,3,4Student, 3 rd year Diploma, Computer Engineering Department, Thakur Polytechnic, Mumbai, Maharashtra,

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Int. J. Advanced Networking and Applications 458 Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms Puttegowda D Department of Computer Science, Ghousia

More information

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS 23 CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS This chapter introduces the concepts of association rule mining. It also proposes two algorithms based on, to calculate

More information

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department. PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com

More information

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 INTRODUCTION In centralized database: Data is located in one place (one server) All DBMS functionalities are done by that server

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Mining Frequent Itemsets in Time-Varying Data Streams

Mining Frequent Itemsets in Time-Varying Data Streams Mining Frequent Itemsets in Time-Varying Data Streams Abstract A transactional data stream is an unbounded sequence of transactions continuously generated, usually at a high rate. Mining frequent itemsets

More information

16 Greedy Algorithms

16 Greedy Algorithms 16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

New Parallel Algorithms for Frequent Itemset Mining in Very Large Databases

New Parallel Algorithms for Frequent Itemset Mining in Very Large Databases New Parallel Algorithms for Frequent Itemset Mining in Very Large Databases Adriano Veloso and Wagner Meira Jr. Computer Science Dept. Universidade Federal de Minas Gerais adrianov, meira @dcc.ufmg.br

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

An Efficient Algorithm for Finding Dense Regions for Mining Quantitative Association Rules

An Efficient Algorithm for Finding Dense Regions for Mining Quantitative Association Rules An Efficient Algorithm for Finding Dense Regions for Mining Quantitative Association Rules Wang Lian David W. Cheung S. M. Yiu Faculty of Information Technology Macao University of Science and Technology

More information

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Mining Association Rules Definitions Market Baskets. Consider a set I = {i 1,...,i m }. We call the elements of I, items.

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

COURSE 12. Parallel DBMS

COURSE 12. Parallel DBMS COURSE 12 Parallel DBMS 1 Parallel DBMS Most DB research focused on specialized hardware CCD Memory: Non-volatile memory like, but slower than flash memory Bubble Memory: Non-volatile memory like, but

More information

Finding Frequent Patterns Using Length-Decreasing Support Constraints

Finding Frequent Patterns Using Length-Decreasing Support Constraints Finding Frequent Patterns Using Length-Decreasing Support Constraints Masakazu Seno and George Karypis Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455 Technical

More information

Chapter 7: Frequent Itemsets and Association Rules

Chapter 7: Frequent Itemsets and Association Rules Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 VII.1&2 1 Motivational Example Assume you run an on-line

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset Ye-In Chang and Yu-Ming Hsieh Dept. of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan, Republic

More information

Mining Frequent Patterns with Counting Inference at Multiple Levels

Mining Frequent Patterns with Counting Inference at Multiple Levels International Journal of Computer Applications (097 7) Volume 3 No.10, July 010 Mining Frequent Patterns with Counting Inference at Multiple Levels Mittar Vishav Deptt. Of IT M.M.University, Mullana Ruchika

More information

A DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES

A DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES A DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES Pham Nguyen Anh Huy *, Ho Tu Bao ** * Department of Information Technology, Natural Sciences University of HoChiMinh city 227 Nguyen Van Cu Street,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Scalable Parallel Data Mining for Association Rules

Scalable Parallel Data Mining for Association Rules IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 12, NO. 3, MAY/JUNE 2000 337 Scalable arallel Data Mining for Association Rules Eui-Hong (Sam) Han, George Karypis, Member, IEEE, and Vipin Kumar,

More information

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland A Study of Query Execution Strategies for Client-Server Database Systems Donald Kossmann Michael J. Franklin Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 f kossmann

More information

Parallel DBMS. Chapter 22, Part A

Parallel DBMS. Chapter 22, Part A Parallel DBMS Chapter 22, Part A Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also: http://www.research.microsoft.com/research/barc/gray/pdb95.ppt Database

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

Dynamic Itemset Counting and Implication Rules For Market Basket Data

Dynamic Itemset Counting and Implication Rules For Market Basket Data Dynamic Itemset Counting and Implication Rules For Market Basket Data Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur SIGMOD'97, pp. 255-264, Tuscon, Arizona, May 1997 11/10/00 Introduction

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

Timeline: Obj A: Obj B: Object timestamp events A A A B B B B 2, 3, 5 6, 1 4, 5, 6. Obj D: 7, 8, 1, 2 1, 6 D 14 1, 8, 7. (a) (b)

Timeline: Obj A: Obj B: Object timestamp events A A A B B B B 2, 3, 5 6, 1 4, 5, 6. Obj D: 7, 8, 1, 2 1, 6 D 14 1, 8, 7. (a) (b) Parallel Algorithms for Mining Sequential Associations: Issues and Challenges Mahesh V. Joshi y George Karypis y Vipin Kumar y Abstract Discovery of predictive sequential associations among events is becoming

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati Analytical Representation on Secure Mining in Horizontally Distributed Database Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering

More information

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW International Journal of Computer Application and Engineering Technology Volume 3-Issue 3, July 2014. Pp. 232-236 www.ijcaet.net APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW Priyanka 1 *, Er.

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Decision Making. final results. Input. Update Utility

Decision Making. final results. Input. Update Utility Active Handwritten Word Recognition Jaehwa Park and Venu Govindaraju Center of Excellence for Document Analysis and Recognition Department of Computer Science and Engineering State University of New York

More information

Towards a Memory-Efficient Knapsack DP Algorithm

Towards a Memory-Efficient Knapsack DP Algorithm Towards a Memory-Efficient Knapsack DP Algorithm Sanjay Rajopadhye The 0/1 knapsack problem (0/1KP) is a classic problem that arises in computer science. The Wikipedia entry http://en.wikipedia.org/wiki/knapsack_problem

More information

Pushing Support Constraints Into Association Rules Mining. Ke Wang y. Simon Fraser University. Yu He z. National University of Singapore.

Pushing Support Constraints Into Association Rules Mining. Ke Wang y. Simon Fraser University. Yu He z. National University of Singapore. Pushing Support Constraints Into Association Rules Mining Ke Wang y Simon Fraser University Yu He z National University of Singapore Jiawei Han x Simon Fraser University Abstract Interesting patterns often

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

Parallel Algorithms for Discovery of Association Rules

Parallel Algorithms for Discovery of Association Rules Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

Dynamic Itemset Counting and Implication Rules. for Market Basket Data

Dynamic Itemset Counting and Implication Rules. for Market Basket Data Dynamic Itemset Counting and Implication Rules for Market Basket Data Sergey Brin Rajeev Motwani y Jerey D. Ullman z Department of Computer Science Stanford University fsergey,rajeev,ullmang@cs.stanford.edu

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Association Rules Outline

Association Rules Outline Association Rules Outline Goal: Provide an overview of basic Association Rule mining techniques Association Rules Problem Overview Large/Frequent itemsets Association Rules Algorithms Apriori Sampling

More information

Object A A A B B B B D. timestamp events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7

Object A A A B B B B D. timestamp events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 A Universal Formulation of Sequential Patterns Mahesh Joshi George Karypis Vipin Kumar Department of Computer Science University of Minnesota, Minneapolis fmjoshi,karypis,kumarg@cs.umn.edu Technical Report

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information