Transactions. Database Counting Process. : CheckPoint

Size: px

Start display at page:

Download "Transactions. Database Counting Process. : CheckPoint"

Ilene Griffin
5 years ago
Views:

1 An Adaptive Algorithm for Mining Association Rules on Shared-memory Parallel Machines David W. Cheung y Kan Hu z Shaowei Xia z y Department of Computer Science, The University of Hong Kong, Hong Kong. dcheung@cs.hku.hk. z Department of Automation, Tsinghua University, Beijing. swxia@mail.tsinghua.edu.cn. Abstract Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach : they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm has been implemented on a Sun Enterprise 4 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good eects and has a substantial lead in performance over other proposed level-wise algorithms. Keywords: Data Mining, Parallel Databases, Association Rules, Parallel Mining, Parallel Computing, Sharedmemory Multi-processors. 1 Introduction Mining association rules in a large database is an important problem in data mining [1, 2, 4, 6, 8, 11, 12, 14]. A representative class of users of this technique is the retail industry. A sales record in a retail database typically consists of all the items bought by a customer in a transaction, together with other pieces of information such as date, card identication number, etc. Mining association rules is to discover from a huge number of transaction records the strong associations among the items such that the presence of one item in a transaction may imply the presence of another. It has been shown that the problem of mining association rules can be decomposed into two subproblems [1]. The rst is to nd out all large itemsets which are contained by a signicant number of transactions with respect to a threshold minimum support. The second is to generate all the association rules from the large itemsets found, with respect to another threshold, the minimum condence. 1 Since it is straight forward to generate association rules once the large itemsets are available, the challenge is in computing the large itemsets. The cost of computing the large itemsets consists of two main components. (1) The I/O cost in scanning the database. (2) For each transaction 1 Background of association rules mining is available in [2]. Readers not familiar with the terminologies such as large itemset and support threshold may refer to the brief introduction in Section 2. 1

2 read, the cost of identifying all candidate itemsets which are contained in the transaction. For the second component, the dominant factor is the size of the search domain, i.e., the number of candidate itemsets. One of the earliest proposed algorithm for association rules mining is the Apriori algorithm [1, 2]. It adopts a level-wise pruning technique which prunes candidate sets by levels. The pruning makes use of the large itemsets found in one level (size k) to generate the candidates of the next level (size k + 1). This in general could reduce signicantly the amount of candidate sets after the rst level. The main contribution of the Apriori algorithm is the development of techniques which use the search result in one level to prune the candidates of the next level, and hence reduce the searching cost. The smaller the number of candidate sets is, the faster the algorithm would be. Since computing large itemsets is very costly in terms of cpu and I/O resources there is a practical need to develop parallel algorithms for this task. Many algorithms for this purpose have been proposed for parallel system with distributed shared-nothing memory [3, 5, 7, 13, 15]. With its popularity and cost eectiveness, shared-memory multiprocessor (SMP) is another important parallel computing paradigm. Mining association rules requires the storage of large amount of intermediate values, with its large aggregated memory, SMP is particularly good for this mining task. However, no ecient algorithm has been proposed for this purpose on SMP. Our stuy has shown that currently available solutions for sequential machine or other parallel models are not suitable; hence, our goal is to develop one that is suitable for SMP. In a distributed shared-nothing memory parallel system, the processors rely on communication to coordinate their tasks. All proposed algorithms for mining association rules in this model focus on three issues : (1) reduction of communication; (2) pruning of candidate sets; (3) partitioning of the candidate sets across the distributed memory. They all follow the level-wise approach developed in the serial algorithm Apriori [2]. Large itemsets in this approach are computed in iterations by their lengths which is the number of items they contain; thus, the database must be scanned as many times as the length of the maximum large itemsets a hard lower bound on the I/O cost. They also adopt a synchronous protocol in the exchanging of support counts at the end of every iteration, which may not be the best for shared-memory system. In an SMP system, processors communicate through shared variables and communication cost is no longer an issue; mining performance is dominated solely by I/O and computation costs. In the distributed memory model, processors access their own database partitions on their local disks independently. As in an SMP, the partitions for dierent processors are stored in the same shared storage and accessed via the same I/O channel, which could very easily become a performance bottleneck. Further aggravating the problem, the synchronized access to their partitions in every iteration create serious I/O contention among the processors. Our contribution is the proposal of an adaptive and asynchronous algorithm for the shared-memory model which requires much less I/O comparing with the level-wise approach. A direct extension of the serial association rules mining algorithm Apriori to the distributed memory model has been developed. Its implementation on the IBM SP2 system is called (Count Distribution), which is a synchronous level-wise algorithm [3]. A variant of was adopted to SMP in [17], which parallelizes the candidate generation. However, it suers heavily from I/O contention. From what we know, no asynchronous parallel algorithm has been proposed for mining association rules. In this paper, we will propose a parallel algorithm (Adaptive Parallel Mining) for mining association rules on an SMP machine. has the following merits : (1) it requires much less scanning on the database than a level-wise algorithm such as ; (2) it has a much smaller set of candidates then ; (3) it is an asynchronous algorithm which produces less I/O contention. 2

3 The candidate set generation in is based on the dynamic candidate generation method in the Dynamic Itemset Counting (DIC) algorithm [4]. It is a break away from the conventional level-wise approach : it divides the database into intervals, generates candidates and performs counting on the intervals instead of the entire database. By working on intervals, it can start its counting earlier during the scanning of the database. In an ideal situation in which the itemset distribution over the intervals is homogeneous, the counting can be completed with much less database scanning. Albeit its simplicity, its advantages in many cases are overwhelmed by its problems. It's performance is very sensitive to the data distribution characteristics and the choice of the interval size. A direct implementation of dynamic counting would often lead to a large number of candidates because the distribution in general is not highly homogeneous. Thus, the amount of database scanning required in many cases would be even worst than that in the level-wise approach. The merits of dynamic candidate generation are its potential to generate less database scanning, and its provision for the processors to perform counting asynchronously over their partitions. In, we have incorporated this dynamic technique and successfully tackled its problems with two optimization techniques : (1) Adaptive interval conguration It can generate a conguration of the intervals which produces a database partition across the processors with a highly homogeneous inter-partition and intra-partition itemset distributions. With such a homogeneous distribution, the dynamic technique performs much more close to the ideal situation. (2) Virtual partition pruning The dynamic technique is very sensitive to an error accumulating eect. Unlike Apriori, it does not generate candidate sets based on the large itemsets found in every iteration; instead, it uses locally large itemsets found in one interval to generate candidates for the subsequent intervals. If a large part of these locally large itemsets are not globally large, (not large with respect to the entire database), then the candidates generated based on these non-globally-large itemsets will contain a lot more unnecessary candidates (false hits). This error accumulating eect can push up the computation cost exponentially to an unacceptable level. The virtual partition pruning technique can start the dynamic generation with a much more accurate and smaller initial candidate set; reduce substantially the accumulated error; and hence produce a much smaller candidate sets. We have implemented and on a Sun Enterprise 4 shared-memory multiprocessor machine with 12 processors and carried out extensive performance studies. Our results show that a direct implementation of the dynamic technique does not bring in any performance gain; in fact, in some cases, it is even worst than the level-wise approach. However, with the incorporation of the two optimization techniques, performs consistently better than in all respects : it was 2 to 5 times faster than in most cases, had reduced about half of the I/O cost, and had a candidate set which is 6 to 3 times smaller. In short, the results conrm that the optimizations are the main reasons for the super performance of. The rest of this paper is organized as follows. Dynamic candidate generation and virtual partition pruning are studied in Section 2. Details of the algorithm is in Section 3. Section 4 reports the results of an extensive performance study. Sections 5 and 6 are the discussion and conclusion. 2 Dynamic Techniques for Association Rule Mining Let I = fi 1 ; i 2 ; ; i m g be a set of items and D be a database of transactions, where each transaction t consists of a set of items such that t I. An association rule is an implication of the form X ) Y, where X I, Y I and X \ Y = [1]. An association rule X ) Y has support s in D if the probability of a transaction in D containing both X and Y is s. The association rule X ) Y holds in D with condence c, if c is the probability that a transaction 3

4 d in D containing X also contains Y. The task of mining association rules is to nd all the association rules whose support is larger than a given minimum support threshold and whose condence is larger than a given minimum condence threshold. For an itemset X, we use X :sup to denote its support count in database D, which is the number of transactions in D containing X. An itemset X I is large (or frequent) if X :sup minsup jdj, where minsup is the given minimum support threshold. For the purpose of presentation, we sometimes use support to stand for support count of an itemset. It has been shown that the problem of mining association rules can be reduced to nding all large itemsets for a given minimum support threshold. 2.1 Level-wise Approach All association rule mining algorithms have adopted in various ways the candidate generation technique in the Apriori algorithm [2]. It is a level-wise approach which nds large itemsets by iteration. (We use L k and C k to denote the set of size-k large itemsets and the set of size-k candidate itemsets, k 1, respectively.) It scans the database in the rst iteration to count and nd the set of size-1 large itemsets L 1. In this iteration, the candidate set C 1 consists of all the items in I. After that, the candidate sets C k in each subsequent iteration, (iteration k), are generated by applying a function Apriori-gen on the large itemsets L k?1 found in iteration k? 1. Apriori-gen essentially joins the large itemsets in L k?1 in all possible ways to form size-k candidates and eliminates some of them with the subset constraint that all size-(k? 1) subsets of a candidate must have been found large in the previous iteration. Apriori then counts the supports of the candidates in C k by checking the transactions in D against C k, which requires a full scanning of D. In this level-wise approach, the database must be scanned as many times as the length of the maximum large itemset. Let us perform an analysis on the cost of nding the large itemsets, which mainly consists of I/O and computation cost. If! is the length of the maximum large itemset, then the I/O cost is determined by the! times of scans of the database. The computation cost in each iteration is dominated by the counting of the supports of the candidates in C k, in which the size of C k is the major factor. Let the size of C k and L k be c k and l k, respectively. The value of c k is determined by the Apriori-gen function and the set L k?1. Suppose c is the total number of candidate sets, then c =!X c k =!X k=1 k=1 f k (L k?1 ); where f k, the number of candidates generated, is a function of L k?1. Furthermore, if the I/O and computation cost in each iteration k are k and k, respectively, then the total cost? of computing all the large itemsets is given by :? =!X ( k + k ) =!X k=1 k=1 k +! ; where is the I/O cost of a full database scan, which is the same for all the iterations. In general, the number of candidates reduces rapidly after the rst few iterations, and k peaks in the rst few iterations. Thus, there are two options for reducing the cost : (1) reduce the value!, i.e., the number of rounds of scanning the database; (2) reduce the number of candidates, in particular, those in the rst few iterations. This observation has also been discussed in [12], in which a dynamic hashing technique is used to reduce the size-2 candidate sets. 4

5 2.2 Dynamic Candidate Generation Dynamic candidate generation is an attempt on the rst option described above to reduce the scannings [4]. It divides the database into equal size intervals, each containing the same number of transactions, and assigns checkpoints at the boundaries between consecutive intervals (Figure 1). It then scans the database starting from the rst interval, generates candidate itemsets and counts them in a pipeline fashion. In the rst interval, the initial set of candidates is initialized to the set of size-1 itemsets. During the scanning of an interval, all candidates generated and accumulated so far are counted. At the checkpoint, those that are found to be locally large with respect to the number of transactions in the interval are used to generate candidates for the next interval. These locally large itemsets are divided into groups by their lengths, and the function Apriori-gen is applied on each group to generate new candidates. This candidate generation and counting process is repeated on every interval. It does not stop at the last interval but wraps around the database and continues on at the rst interval until the termination condition becomes true. Transactions Database Counting Process : CheckPoint Figure 1: DIC This aggressive and optimistic approach allows the candidate sets to grow very quickly. It will be counting size-k candidates during the scanning of the k-th interval in the rst iteration instead of in the k-th iteration. It also allows candidates of dierent sizes to be counted together once they are generated. The global support of a candidate in the entire database is found as soon as the scanning has wrapped around the database and come back to the checkpoint at which the candidate was rst generated. If all generated candidates have been counted, the counting is completed and terminated. The dynamic technique has the following problems : (1) It is very sensitive to the data distribution characteristics. (2) It requires a proper interval size. (3) It accumulates errors recursively in generating candidates and the amount of unnecessary candidates generated is very sensitive to the amount of false hits in the initial set of candidates. The dynamic technique requires a homogeneous distribution of large itemsets over the intervals. In an ideal scenario, the itemset distribution in most of the the intervals are very similar, and very few new candidates would be generated after the rst! intervals, where! is the length of the maximum large itemset; hence, the amount of scanning required could be as little as one pass of the database plus! intervals. On the other hand, if the distribution is not very uniform, new candidates may be generated at various stages which would keep extending the scanning of the database. In the worst case, the number of passes required to scan the database could be larger than!. The choice of the interval size is also important in the dynamic generation. With respect to the whole database, an interval is similar to a sample from the database. If it is large, it is more likely that locally large itemsets found are also globally large; hence, fewer unnecessary candidates would be generated. However, it would require more scanning on the database. On the other hand, a smaller interval would reduce the amount of scanning. However, it 5

6 may not be as good a sample and more locally large itemset would be generated; hence, more unnecessary candidate sets. 2 In, we have developed a technique to congure and merge the intervals to create a more homogeneous data distribution for dynamic generation. As a consequence, an interval size compatible to the homogeneous distribution can also be determined in this process. Now let us look at the problem of error accumulation in the dynamic technique. Let! c be the length of the maximum candidate itemset. Suppose c is the total number of candidate sets generated by the dynamic technique, and c k is the number of size-k candidates, then c = Xk c c k = Xk c k=1 k=1 f k (L k?1); where L k is the set of size-k locally large itemsets found and f k is determined by the function Apriori-gen. 3 In the ideal case in which the data distribution is homogeneous, L k will be very close to L k, and the candidate set size will be similar to that in Apriori-gen. On the other hand, in the opposite case, L k will be much larger than L k, and a lot more candidates will be unnecessary generated. Worse yet, the error would be accumulated over the intervals. In Apriori-gen every candidate is generated from the "true" large itemsets found in the previous iteration. In the dynamic case, some locally large itemsets which are not "true" large itemset are used to generate candidates which will not become large. These unnecessary candidates, if turn out to be locally large subsequently, will then be used again to generate new unnecessary candidates. This recursive eect could undesirably pushes up the false hits (unnecessary candidates) among the candidates. Because of this recursive eect, the amount of small-size unnecessary candidates would aect signicantly the amount of unnecessary candidates of the larger size, and hence the total number of unnecessary candidates. Hence, we must control the amount of small-size unnecessary candidate sets. We have found out in our performance studies that if we can prune down C 2 to more close to L 2, the accumulated error in dynamic candidate generation can be reduced signicantly. A simple way to achieve this is to delay the dynamic generation until L 2 is found, which requires two additional passes of scanning on the database. We have taken a more eective approach of using only one additional scan to compute L 1, and at the same time adopt a virtual partition pruning technique to prune C 2. Our performance studies have shown that this technique is very eective in bringing down the accumulated error in the candidate generation. 2.3 Virtual Partition Pruning Given a database D, we can divide it into n partitions D i, 1 i n. For an itemset W, let us use W :sup(i) to denote the support (local support) of W in D i, 1 i n. We also use W :sup to denote the support (global support) of W in D. Suppose X is a size-k candidate set. If Y X, then Y :sup(i) X :sup(i), 1 i n. Therefore, the local support of X in D i, X :sup(i), is bounded by the value minfy :sup(i) j Y X; and jy j = k? 1g. Hence, the value X :maxsup = nx i=1 X :maxsup(i) ; where X :maxsup(i) = minfy :sup(i) j Y X; and jy j = k? 1g; (2.1) 2 A candidates is an unnecessary candidates if it is not large. 3 In dynamic candidation generation, in each interval, not all local large itemsets are used to generate candidates, only those found in the interval are used. 6

7 is an upper bound of the global support of X. If X :maxup < s jdj, where s is the minimum support threshold, then X can be pruned away. We refer this as partition pruning, which is an extension of the global pruning proposed in [5]. Note that in a system with multiprocessor, it is natural to divide the database according to the number of processors. With such a conguration, partition pruning can be performed with respect to the actual partitioning. On the other hand, the database can also be partitioned irrespective to the number of processors. We call the pruning performed in this way a virtual partition pruning. The eect of partition pruning can be illustrated by the example in Table 1. Suppose the global support count threshold is 3. Then all the itemsets in fa; B; C;? D; E; F g become size-1 large itemsets. Hence, the number of size-2 6 candidates from Apriori-gen would be equal to 2 = 15. However, among all these 15 candidates, only AB and EF can survive the partition pruning. The upper bound of the global supports of AB and EF, computed according to equation 2.1, are 31 and 3, respectively, (and hence survive the pruning); but the upper bound of the other size-2 candidates are all below 3, (and hence pruned away). In this example, the partition pruning can prune away more than 8% of the candidates in Apriori. Items A B C D E F local support at partition local support at partition local support at partition global support Table 1: Virtual Partition Pruning It can be seen from the above example that this pruning technique is more eective if the itemsets have a skewed distribution over the partitions. Intuitively, the data skewness of a partitioned database is high if most large itemsets are locally large only at a few partitions. In the following, we will show analytically the eect of partition pruning in highly skewed partition. Lemma 1 Suppose a database is partitioned into n partitions. Let L 1 be the set of size-1 large itemsets, C 2(a) and C 2(v) be the set of size-2 candidates generated in Apriori and those remaining after the partition pruning, respectively. Suppose that each size-1 large itemset is locally large at one and only one partition, and the number of size-1 locally large itemsets at each partition is jl1j n. Then jc 2(v)j jc jl1 j n?1 2(a) j jl 1 1j?1 n.? jl1j Proof: Since Apriori will generate size-2 candidates as all the 2-combinations from L 1, jc 2(a) j = 2 = jl 1j(jL 1j?1). 2 As for partition pruning, if AB 2 C 2(v), then both A and B? must be locally large together at one of partitions. Therefore, the total number of candidates in C 2(v), jc 2(v) j jl 1 j n 2 n = jl 1j ( jl1j 2 n? 1). Therefore, jc 2(v)j jc 2(a) j jl 1 j n?1 jl 1 1j?1 n. 2 Lemma 1 shows that the partition technique can dramatically prune away at least n?1 n size-2 candidates generated by Apriori in a skewed partition. Note that the cost of partition pruning is very low. In, we will use it to prune the candidates in C 2, the only additional cost required is to store the local supports of the size-1 itemsets. The local and global supports can be counted together in one pass of scanning the database. 7

8 root {} A B C C D item: itemid: InterCount: B C D LocalCount: branchno: D C D branchpointer:... D Figure 2: Trie and its node 3 Parallel Mining on SMP (Count Distribution) is a parallel version of Apriori proposed for distributed memory parallel system [3]. In this model, the database D is partitioned onto the local disks of the processors. At every iteration (k-th iteration), each processor rst computes the same candidate set C k from L k?1, and then scans its own partition to compute the local supports of the candidates in C k. All the processors then exchange their local supports by performing synchronous broadcasts. Following that, each processor computes the global supports of the candidates and nd out the globally large itemsets in L k. repeats these steps until no new candidate is generated. Several other parallel algorithms such as the IDD, HD and HPA, which are improvement on, have been proposed [7, 15]. They focus on the issue of distributing candidates among the distributed memory. All these are level-wise and synchronous algorithms. In the following, we will explain that asynchronous algorithms could be more ecient in a shared-memory system. Processors would not be forced to wait for each other in every round of communication. 3.1 Asynchronous Parallel Mining of Association Rules In a shared-memory multiprocessor parallel system, the model generally used for computing the large itemsets is called common candidates partitioned database [17]. The database D is stored in the shared storage system and divided logically into partitions D 1 ; D 2 ; ; D n, where n is the number of processors. Each processor counts the supports of the common candidates against its own partition. The results of the counting are stored in a shared data structure in the memory. 8

9 (Adaptive Parallel Mining) uses the dynamic candidation generation technique to generate the common candidates asynchronously. In order to store candidates of dierent sizes, a trie instead of a hash tree is used to store the supports. Figure 2 is a trie in. Every node on the trie is associated with a candidate itemset. For example, in the rst branch of the trie, the nodes store the support for the candidates A, AB, ABC, AC, and AD. intervals... partition D 1... partition D 2... database D partition D n... Figure 3: Database, partitions and intervals In order to generate candidates, each partition is divided into smaller intervals and dynamic candidate generation is performed on the intervals by the corresponding processor (Figure 3). New candidates are stored on the trie once they are generated. Candidates generated by one processor are shared by other processors once they are inserted on the trie. All processors perform the generation and counting cycle in their own partitions until all candidates on the trie have been counted by all the processors and no new candidate is generated. 3.2 The Algorithm We present the algorithm in Figure 4, the program fragment will be executed by each processor i on its partition D i, 1 i n. The algorithm consists of two phases. The steps in the rst phase : (1) The database is divided into intervals and these intervals are further divided into groups to form a skewed partitions. (The skewness is created by clustering the intervals. Details is in Section 3.3.) Each processor then scans its partition to compute support counts of size-1 itemsets. This results in the size-1 large itemsets L 1. (2) The size-2 candidates C 2 are computed from L 1. (3) Virtual partition pruning is performed to reduce the size of C 2. (See section 3.3.) (4) A shared trie for recording support counts is initialized with the remaining candidates in C 2. (5) Inter-partition and intra-partition interval congurations are performed to increase the homogeneity of the data distribution in the partition. ( See section 3.4.) At the end of phase 1, we have a trie containing size-2 candidates and a partition which has high homogeneity. In phase 2, every processor performs the dynamic generation and counting in its partition with the following steps : (1) Scans intervals and computes locally large itemsets in each interval. (2) Uses locally large itemsets found to generate new candidates. (3) Applies virtual partition pruning on new candidates before inserting them into the shared trie. (4) Before starting a new round of counting on the next interval, every processor traverses the trie to remove itemsets which are globally small. If all the processors have counted all the itemsets on the trie, then the algorithm terminates. 9

10 /* Preprocessing : (1) all processors scan their partitions to compute local supports of size-1 itemsets in their intervals; (2) compute L 1 and generate C 2 = Apriori-gen(L 1 ); (3) perform a virtual partition pruning on C 2 ; (4) initialize the shared trie with the remaining size-2 candidates; (5) perform inter-partition interval conguration and intra-partition interval conguration to prepare a homogeneous distribution */ /* Parallel execution: every processor i runs the following fragment on its partition D i */ 1) while (some processor has not nished the counting of all the itemsets on the trie on its partition) 2) f while ( processor i has not nished the counting of all the itemsets on the trie on D i ) 3) f scan the next interval on D i and count the supports of the itemsets on the trie in the interval scanned; 4) nd the locally large itemsets among the itemsets on the trie; 5) generate new candidates from these locally large itemsets; 6) perform virtual partition pruning on these candidates and insert the survivors into the trie; 7) remove globally small itemsets on the trie; 8) g 9) g Figure 4: The Algorithm 3.3 Virtual Partition from Clustering As discussed in Section 2.3, the initial set of candidates used in the dynamic generation in is important in controlling the total number of candidate sets. We use one scan on the database to compute L 1 and generate size-2 candidates from L 1. Then we use the virtual partition technique to prune the candidates and use the remaining sets as the initial set of candidates to start the dynamic generation in. The skewness of the virtual partitioning of the database can increase the eect of the pruning. We use a k-clustering technique to generate skew virtual partition. 4 To prepare for the dynamic generation, D will be divided into a set of equal size small intervals. While computing the size-1 large itemsets, an item support vector (or support vector) can be generated from every interval. A support vector contains the support counts of all the items in the interval. It represents the distribution of size-1 itemsets in the interval. In the multidimensional space of the support vectors, a set of k clusters is computed. We logically merge the intervals in every cluster to create a virtual partitioning of D. The skewness of the partitioning so generated is much higher than that generated by a random partitioning. The technique described in Section 3.3 is then used to generate the size-2 candidates. This candidate set will be much smaller than that in Apriori. This smaller size-2 candidate set has the eect of reducing the initial error in the dynamic generation in. This observation has been conrmed in our performance studies. 4 In the k-clustering technique, the number of resulted clusters k is a control parameter. 1

11 3.4 Homogeneous Inter-partition Distribution Since the database D needs to be divided into partitions and intervals to perform dynamic generation, we need to control two itemset distributions. We want both the inter-partition and intra-partition distributions to be homogeneous. We use the result of the k-clustering in the rst pass to create a homogeneous inter-partition distribution. The distributions of itemsets in the intervals belonging to a cluster are very similar. If every partition contains similar number of intervals from every cluster, the itemset distribution between these partitions would be more homogeneous than a random partition. We distribute intervals to partitions following this rule. Since all intervals are accessible by every processor, no physical movement on the storage is required to produce these partitions. Every processor just needs to know the sequence of intervals belonging to its partition. 3.5 Adaptive Interval Conguration uses another adaptive technique to produce intra-partition interval congurations with high homogeneity. It shues the intervals in a partition and merges them iteratively until the homogeneity reaches a certain measurable level. The homogeneity will be guaranteed if we merge all the intervals into one big interval. However, we also need to have small interval size. In order to adaptively create a near optimal interval conguration, we need to dene a measurement on the homogeneity of a conguration. Denition 1 Given two sets of large itemsets L i and L j, the distance between them is dened as The operator dist has the following properties: 1. dist(l i ; L j ) 2 [; 1]; 2. dist(l i ; L i ) = ; 3. dist(l i ; L j ) = dist(l j ; L i ); 4. dist(l i ; L j ) dist(l i ; L k ) + dist(l k ; L j ); 5. dist(l i ; L j ) = 1 if L i \ L j = ;. dist(l i ; L j ) = 1? jli \ L j j jl i [ L j j : Example 1 Let L 1 = fa; B; C; AB; BCg and L 2 = fa; B; D; AB; BDg be two sets of large itemsets, the distance between L 1 and L 2 is dist(l 1 ; L 2 ) = 1? jl1 \ L 2 j jl 1 [ L 2 j = 1? 3 7 = 4 7 : Denition S 2 A division Dv = fi 1 ; I 2 ; :::; I m g of a partition D i is a set of disjoint intervals dividing D i such that m D i = I j=1 j. The evenness factor E f (Dv) of Dv is dened by P m j=1 E f (Dv) = dist(lj ; L) ; m where L j is the locally large itemsets in I j, 1 j m, and L is the large itemsets in D i. Evenness factor has a value in [, 1]. It equals to zero if all the intervals have the same set of large itemsets. Let Dv a = fi 1 ; I 2 ; :::; I p g be a division of a partition D i. A division Dv b = fi s1 ; I s2 ; :::; I sp g of D i is an equivalent division of Dv a if it is a re-ordering of the intervals in Dv a. 11

12 Denition 3 Given a division Dv a = fi 1 ; I 2 ; :::; I p g of a partition D i, a division Dv m = fi m1 ; I m2 ; :::; I ml g of D i is a k-merge of Dv a if l = bp=kc, and there exists an equivalent division Dv b = fi s1 ; I s2 ; :::; I sp g of Dv a such that I mi = S ik j=(i?1)k+1 I s j, for 1 i l? 1, and I ml = S p j=(l?1)k+1 I s j. A k-merge of a division is a shuing and merging of the intervals in a division such that each resulted interval contains k intervals from the original division. Note that the last interval I ml in the k-merge may contain more than k intervals. Example 2 Let Dv a = fi 1 ; I 2 ; :::; I 1 g be a division of a partition D i, then Dv b = fi j1 ; I j2 ; I j3 g, where I j1 = I 1 [ I 4 [ I 5, I j2 = I 2 [ I 3 [ I 7, and I j3 = I 6 [ I 8 [ I 9 [ I 1, is a 3-merge of Dv a. We use an adaptive technique to generate merges for each partition to produce homogeneous intra-partition interval conguration. After the inter-partition conguration, each partition contains an equal number of intervals. We perform another k-clustering on the support vectors of the intervals inside each partition. Let Dv be a division of a partition D i, and G 1 ; G 2 ; ; G k be the k clusters generated from the support vectors. The clusters are arranged in a decreasing order by their sizes. We again pick intervals alternatively from the clusters in a round robin fashion to re-order the intervals in Dv and then perform a k-merge on the re-ordered Dv. What we are doing here is logically re-arranging the orders of the intervals in each partition and group them to form bigger intervals such that each one of these bigger interval contains similar number of intervals from every clusters. As a result, the distribution of the itemsets among these bigger intervals will be more homogeneous than that over the smaller intervals. Hence, the homogeneity of the new division would be increased. In order to measure the homogeneity of a resulted merge eciently, we restrict the computing of the evenness factor on size-1 itemsets. In other words, if Dv = fi 1 ; I 2 ; ; I p g is a division on a partition D i, we modify P m j=1 E f (Dv) = dist(lj; L 1 1) : m where L j is the set of size-1 locally large itemsets in I 1 j, 1 j p, and L 1 is the set of size-1 large itemsets in D i. Example 3 Let Dv = fi 1 ; I 2 g be a division on partition D i. Suppose the size-1 large itemset in D i is L 1 = fa; Cg. And the size-1 locally large itemsets of I 1 is L 1 1 = fa; Cg, and that of I 2 is L 2 1 = fag. Then dist(l 1 1 ; L 1) = 1?2=2 = and dist(l 2 1; L 1 ) = 1? 1=2 = :5. Hence, E f (Dv) = [dist(l 1 1; L 1 ) + dist(l 2 1; L 1 )]=2 = :25. We compute the evenness factor after a k-merge has been created. If the factor is less than a threshold, then the adaptive conguration stops; otherwise, the re-conguation and merging of the intervals is repeated until one of the following three conditions becomes true : (1) the evenness factor is smaller than a threshold; (2) the rate of change of the evenness factor is less than a threshold; (3) the number of resulted intervals is below a threshold. The rst condition has been discussed. If E fi and E fj are the evenness factors of two consecutive merges, the rate of change is dened as E f = je fi? E fj j E fi : A small rate of change indicates that further merging would not improve the factor. Therefore, it should stop. The third condition is to control the number of intervals in the division. The merge should not proceed further if the intervals have become too large. 12

13 3.6 Implementation Itemset Counting When a candidate itemset is rst generated, a node will be inserted in the trie for it. The status information associated with the node are those in Figure 2. The Item register stores the last item of the itemset. E.g., C will be stored in the node if the itemset is ABC. The ItemID register stores an id assigned to the itemset for indexing into the counter arrays in the memory. Every processor has a private counter array for storing the local supports of the candidate sets for the interval being scanned. 5 In addition, two counter arrays, LocalCount and InterCount, are created at each node. LocalCount records the total local support counts of the itemset over all the intervals scanned so far for every processor. InterCount records the number of intervals already counted for the itemset for every processor. After the nodes of all new candidate itemsets have been created and inserted into the trie at a checkpoint, the corresponding processor will scan the next interval. The support of the candidates in the trie in the new interval will be stored rstly in the processor's private counter array. At the next checkpoint, the supports in the private counter array are used to determine the locally large itemsets of the interval which are then used to generate the next set of candidates. In addition, the counts in the private counter array are added to the LocalCount array to update the total support counts of the candidates. The private counter arrays are reset to zero for every interval. The values in InterCount is also updated to reect the number of intervals counted. Note that the processors perform itemset insertion and support count updates concurrently on the trie. The only constraint is that the local support of a newly inserted itemset will start to be counted by a processor on its partition when it begins a new interval scanning cycle Pruning Candidates at Checkpoints It is useful to prune away as many candidates as possible during the dynamic generation and counting on the trie. applies partition pruning at every checkpoint to prune newly generated candidates. The computation of the upper bound in partition pruning needs to be modied slightly for this purpose. Let D i, 1 i n, be the partitions. For a given size-k itemset X, partition pruning requires the local support counts of all the size-(k? 1) subsets of X to compute a bound. In the asynchronous case, when a new candidate is generated by a processor, the local supports of some of its subsets in the other partitions may not be available on the trie. To accommodate this, we need to use a more conservative bound of the support count. Suppose we want to determine whether a candidate X can be pruned at a checkpoint. Also assume that a partition D i, 1 i n, has been divided into m intervals and each interval contains M transactions. Let Y X. If Y has been counted over t intervals in D i, and the accumulated support of Y over the t intervals is Y :sup(i;t), then Y :sup(i) is bounded by Y :sup(i;t) + M (m? t). With this bound, we then can compute an upper bound for X :sup. If the result is less than s jdj, then X can be pruned away. Note that this pruning is more eective when most subsets of X have been counted completely. At each checkpoint, a processor can traverse the trie to identify globally large and globally small itemsets. All globally small itemsets found are removed before the processor starts a counting cycle. We also remove the supersets of the small itemsets, which are no the same branch. For example, if BC is globally small, then its superset B 5 The array is just a data structure to store counts of itemsets found in an interval. It can be replaced by a hash table if there is a concern on space usage. We use array to simplify the discussion. 13

14 will be removed (Figure 2) Termination Condition terminates when all candidates on the trie have been counted by all the processors. A processor can determine whether it has completed the counting of an itemset by comparing the value in InterCount with the total number of intervals. Every processor i keeps a counter NodeCount(i) on the number of itemsets that it has completely counted. also keeps a counter NumItemsets on the number of candidates on the trie. By comparing the values of these two counters, a processor can determine at a checkpoint whether it has completely counted all the candidate itemsets. Hence, the termination condition can be veried by all the processors independently Optimization Cost We want to discuss here the cost of the optimizations. The cost of producing a more homogeneous interval con- guration involves clustering and merging of intervals. As will be shown in the performance studies in the next sectio, the cost of clustering is negligible, because the data set in the clustering consists of only one vector from each intervals. Hence, the data set is small. Also, the merging of the intervals does not involve physical movement of data. Since the database is on the shared disk, the reconguration and merging of the intervals can be implemented by assigning dierent sequences of pages to the partitions and processors. As for the cost of virtual partition pruning on the size-2 candidates, it requires the storage of the local supports of all the size-1 items with respect to all the partitions. However, this is only required in the rst iteration and the size is proportional to the number of items and partitions only. As for the memory space requirement, the main requirement is for the trie, which is proportional to the number of candidate sets and processors. However, since the trie is shared, the amount of space required per processor remains proportional to the number of candidate sets. With the help in the pruning of candidate sets, the space requirement can be reduced substantially. Our performance studies have conrmed this observation. (Please see the next section.) 4 Performance Studies We have carried out extensive performance studies on a 12-node Sun Enterprise 4 shared-memory multiprocessor. Each node of it is an Ultra Sparc 25 MHz processor running Solaris 2.6. The machine has 1G main memory. 4.1 Synthetic Databases Generation We follow the methodology proposed in [2] to develop synthetic database to study the performance. In order to test the eect of the dynamic interval conguration, we need skewed data distribution. We enhanced the procedures in [2] to generate partitions and introduced a technique to control its skewness. Table 2 is a list of the parameters used in our synthetic databases. Suppose we want to generate n database partitions of size equal to D=n. We rst generate n smaller data intervals. Each of them has D=n transactions. Their skewness is controlled by the parameter S. Subsequently, 6 We don't want to search the whole trie for superset of small itemsets, only those on the same branch are removed. 14

15 the intervals are divided into n groups and each group has intervals. Finally, the intervals in every group are combined to form partitions in the test database. D T I L N S n total number of transactions in the database average size of the transactions average size of the maximal potentially large itemsets number of maximal potentially large itemsets number of items interval skewness number of partitions number of intervals in each partition Table 2: Synthetic Database Parameters All intervals and hence partitions are generated from a pool of potentially large itemsets as in [2]. We rst generate the relative weights of these large itemsets. These weights are then broken down into n weights, one for each interval. In other words, every itemset in the pool has n weights associated with it. Each one of these weights corresponds to the probability of occurrence of the itemset in an interval. The weight of each itemset in a size-l pool is picked from an exponential distribution with unit mean. Then we pick a skewness level s for the itemset from a normal distribution with mean S and variance.1. After that we generate n probability values from an exponential distribution with variance equal to s, and normalize them so that their sum equals to 1. These n probability values are randomly assigned to the n intervals. Eventually, the weight of the itemset is broken down into n weights by multiplying it with the probability values. The procedure is iterated to generate the weights of all the itemsets and their breakdowns. The second step is to generate the itemsets in the pool. We rst divide the N items into n disjoint equal ranges. In order to control the skewness of the itemsets, we regard the i-th (1 i n) probability values generated for an itemset in the previous step as the probability of choosing items from the i-th range to put into the itemset. The size of the itemsets is determined by a Poisson distribution with the parameter I as in [2]. All items of the rst itemset are chosen randomly from some ranges. The ranges are determined by tossing a ( n)-side weighted coin, where the weight of side i is the i-th probability of the n probability values of the itemset. Items are picked until the number is equal to the size of the itemset. Some items of the subsequent itemsets are copied from the previous itemset according to the correlation level, while the remaining items are picked in the same way as in the rst itemset. The third step is to generate the transactions in all the intervals. The size of the transactions is determined by a Poisson distribution with the parameter T. When generating interval i, (1 i n), we normalize the i-th weights of all the itemsets so that their sum equals to 1 and use the normalized i-th weight as the probability of occurrence of the associated itemset in interval i. Each transaction in the interval is assigned a series of large itemsets, which are chosen by tossing an L-side weighted coin, where the weight for a side is the i-th weight of the associated itemset. We also incorporate the corruption factor to control the generation. The last step is to combine the n intervals into n partitions. The n intervals are equally divided into n groups, and the intervals in each group are concatenated to form new partition. The results are n database partitions, each has a size equal to D=n and contains intervals. 15

16 4.2 Relative Performance In the studies of relative performance, we have two goals : (1) to nd out how much improvement can deliver over the level-wise algorithm ; (2) to identify the improvement from dynamic generation versus those from the two optimizaton techniques. For these purposes, besides and, we also have implemented two variants of. The rst variant -DIC is a direct parallelization of DIC with no optimization. In other words, -DIC is the algorithm in Figure 4 without steps 3,4,5 in the Pre-processing part. The second variant -AIC has incorporated adaptive interval conguration to bring in intra-partition homogeneity on -DIC. In the implementation of, a hash tree is shared by all the processors to store their support counts. Every processor has a private counter array to record support counts of the candidates in its partition. At the end of each iteration, these support counts are copied from the arrays to the hash tree, and a master processor would compute the large itemsets and generate the candidates for the next iteration. Name T I Partition Size D(124n)K.T5.I MB D(512n)K.T8.I MB D(512n)K.T1.I MB D(512n)K.T1.I MB D(256n)K.T2.I MB Table 3: Test Databases Five series of databases with dierent attributes have been generated for the studies. Their attributes are summarized in Table 3. The name of the databases are denoted in the form of Dx.Ty.Iz, where x is the total number of transactions, y is the average size of the transactions, z is the average size of the itemsets. In Table 3, the total number of transactions are specied in terms of n, the number of partitions, and the number of transactions in a partition. This allows us to compare the performance on databases with same characteristics but dierent number of partitions. We set the parameters in Table 2 with the following values : N = 1; L = 1; S = :7; = 8, and correlation level = :5. We ran, -DIC, -AIC and on the ve series of databases. The minimum support threshold in the rst four series is 1%, and is 2% in the last series. Figure 5 shows the performance of these algorithms. The ratios of the response times versus that of are shown in Table 4. is superior than in all cases: when n = 12, it is 2-5 times faster; when n = 8, it is about 3 times faster; when n = 1 (no partitioning), it is still about 2 times faster, except in one single case. When comparing -DIC with, we found that the gain from pure dynamic generation is not impressive. When n = 1, -DIC is equivalent to the serial algorithm DIC, and the results show that it performs worse than, which is as predicted. This shows that DIC without optimization could not bring in much gain in general. When we compared -AIC with -DIC, we found signicant eect from intra-partition interval conguration in some cases; however, the gain is not uniform over all the cases. In some cases, the skewness of the data distribution among partitions has caused -AIC to generate many unnecessary candidates. Lastly, when we compared with -AIC, we found that the eects of inter-partition interval conguration and the inital candidate set reduction have produced substantial gain consistently. 16

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 jpark@cs.sungshin.ac.kr, fmschen,