SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases

Size: px

Start display at page:

Download "SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases"

Mavis Roberts
6 years ago
Views:

1 SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases Youcef Djenouri 1, Marco Comuzzi 1(B), and Djamel Djenouri 2 1 Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea {ydjenouri,mcomuzzi}@unist.ac.kr 2 DTISI, CERIST Center Research, Algiers, Algeria ddjenouri@acm.org Abstract. The quest for frequent itemsets in a transactional database is explored in this paper, for the purpose of extracting hidden patterns from the database. Two major limitations of the Apriori algorithm are tackled, (i) the scan of the entire database at each pass to calculate the support of all generated itemsets, and (ii) its high sensitivity to variations of the minimum support threshold defined by the user. To deal with these limitations, a novel approach is proposed in this paper. The proposed approach, called Single Scan Frequent Itemsets Mining (SS-FIM), requires a single scan of the transactional database to extract the frequent itemsets. It has a unique feature to allow the generation of a fixed number of candidate itemsets, independently from the minimum support threshold, which intuitively allows to reduce the cost in terms of runtime for large databases. SS-FIM is compared with Apriori using several standard databases. The results confirm the scalability of SS- FIM and clearly show its superiority compared to Apriori for medium and large databases. Keywords: Frequent itemsets mining Apriori heuristic Support computing 1 Introduction Frequent Itemsets Mining (FIM) aims to extract highly correlated items from a large transactional database. It is defined as follows: Let T be a set of m transactions, {T 1,T 2,...,T m } a transactional database, and I asetofn different items or attributes {I 1,I 2,...,I n }. An itemset X is a subset of the set of items (X I). The support of X is the number of transactions that contains X divided by the number of all transactions in T. The itemset X is called frequent if its support is no less than a user s predefined minimum support threshold [1]. Two categories of approaches have been proposed for solving the FIM problem. Approaches in the first category are based on the Apriori heuristic [1]. They first generate the k-sized candidate itemsets from the (k 1)-sized frequent itemsets and then test the frequency of the generated candidate itemsets. c Springer International Publishing AG 2017 J. Kim et al. (Eds.): PAKDD 2017, Part II, LNAI 10235, pp , DOI: /

2 SS-FIM: Single Scan for Frequent Itemsets Mining 645 Approaches in the second category are based on the FPgrowth heuristic [2]. They compress the transactional database in the main memory using an efficient tree structure, then they apply recursively the mining process to find the frequent itemsets. Although this second heuristic reduces the number of database scanning as compared to Apriori, these approaches consume a high amount of memory, particularly when dealing with large database instances. We propose in this paper a different approach called SS-FIM (Single Scan Frequent Itemsets Mining), which solves the FIM problem with only one scan of the database T. In SS-FIM, candidates itemsets are first generated from each transaction and stored in a hash table to maintain information about their support. When generating from a new transaction an itemset that already exists in the hash table, then its entry counter is simply incremented. Otherwise, if the itemset does not exist, then a new entry is created with the counter initiated to one. In the end, the frequencies of itemsets occurrences in the hash table are compared to the minimum support to determine which itemsets to retain (considered as frequent). The proposed approach has been tested on several well known FIM instances. The results show that SS-FIM outperforms the Apriori heuristic for medium size and large size databases. They also show the scalability of SS-FIM compared to the Apriori heuristic when varying the minimum support. The remainder of the paper is organized as follows. Section 2 reviews existing FIM algorithms. In Sect. 3, the Apriori heuristic is presented in detail, followed by the proposed SS-FIM approach in Sect. 4. The performance evaluation is presented in Sect. 5, while Sect. 6 draws the conclusions. 2 Related Work Deterministic optimal strategies for solving the FIM problem can be divided into two categories. The first one is the generate and test strategy, where the itemsets are first generated and then their frequency is tested. The second one is the divide and conquer strategy. Solutions based on this strategy compress the database in an efficient tree structure and then apply recursively the mining process to extract the frequent itemsets. In the following, we discuss more in detail the existing FIM approaches of both categories. The first algorithm we cite within the generate and test category is Apriori, by Agrawal et al. [1]. In this reference algorithm, candidate itemsets are generated incrementally and recursively. To generate candidates of k-sized itemsets, the algorithm calculates and combine the frequent (k 1)-sized itemsets. This process is repeated until an empty candidate itemsets is obtained in an iteration. Many FIM algorithms are based on Apriori. The Dynamic Itemsets Counting (DIC) algorithm has been proposed by Brin et al. [3] as a generalization of Apriori where the database is split into P equally sized partitions such that each of them fits in memory. DIC then gathers support of single items for the first partition. Locally found frequent items are used to generate candidate 2-sized itemsets. Then, the second partition is read to find support of all current candidates. This process

3 646 Y. Djenouri et al. is repeated for the remaining partitions. DIC terminates if no new candidates are generated from the current partition and all previous candidate have been counted. Mueller [4] has proposed a sequential FIM algorithm that is similar to Apriori, except that it stores candidates in a prefix tree instead of a hash tree. This structure enables fast testing of whether subsets of prospective candidates are frequent or not. However, both candidates and frequent itemsets are stored in the same structure, which degrades the performance of the algorithm in terms of memory footprint. Zaki et al. [5] have proposed the Eclat algorithm, which uses vertical tidlists of itemsets. Frequent k-sized itemsets are organized into disjoint equivalence classes by common (k 1)-sized prefixes, so that candidate (k + 1)- sized itemsets can be generated by joining pairs of frequent k-sized itemsets from the same classes. The support of a candidate itemsets can then be computed simply by intersecting the tid-lists of the two component subsets. In [6], a data structure is proposed to store and compress the transactions in an efficient tidlist. With this structure, the number of scans of the transactional database is reduced. However, only regular frequent itemsets can be extracted. For the divide and conquer strategy, we start with the FPgrowth algorithm [2], which uses a compressed FP-tree structure for mining a complete set of frequent itemsets without candidate itemsets generation. The algorithm is divided into two phases: (i) construct a FP-tree that encodes the dataset by reading the database and mapping each transaction onto a path in the FP-tree, while simultaneously counting the support of each item, and (ii) extract frequent itemsets directly from the FP-tree using a bottom-up strategy to find all possible frequent itemsets that end with a particular item. Cerf et al. [8] haveproposed the NFP-growth algorithm. It improves the original FP-growth by constructing an independent head table, which allows creating a frequent pattern tree only once. This dramatically increases the processing speed. In [7], the authors proposed a new FPGrowth algorithm for mining uncertain data. They develop a tree structure to store uncertain data, in which the occurrence count of a node is at least the sum of occurrence counts of all its children nodes. This allows to count rapidly the support of each candidate itemset. In [9], an FP-array technique that reduces the need to traverse FP-trees is proposed. This structure is adopted to mine several types of frequent itemsets, such as maximal, closed and categorical frequent itemsets. A more detailed survey of most existing FIM algorithms can be found in [10]. The generate and test strategy requires multiple scanning of the database to generate all frequent itemsets, whereas the divide and conquer requires only two scans of the database. Divide and conquer approaches, however, are highly memory consuming because of the need to compress the database into a tree structure. Nowadays, transactional databases are very large and possibly extends to several million transactions [11]. Storing these transactions into an efficient tree structure is a very challenging problem. This makes the divide and conquer approaches inefficient for large transactional databases. Recently, some bioinspired approaches have been proposed to reduce the number of scans of the transactional database. Among these, we cite BSO-ARM [12], PeARM [13] and

4 SS-FIM: Single Scan for Frequent Itemsets Mining 647 PGARM [14], to quote just a few. These approaches deal with FIM in reasonable time. However, the quality of their mining is limited, i.e., they discover only a part of frequent itemsets, and miss many. 3 Apriori Heuristic The goal of the Apriori heuristic is to reduce the search space of frequent itemsets by exploring recursively the candidate itemsets. In the Apriori heuristic, an itemset of size k is frequent iff all its subsets are frequent. Thus, at each iteration k, the candidates itemsets of size k are generated by joining two frequent itemsets of size (k 1). This process is repeated until the set representing the candidate itemsets of size k is empty. To determine the frequent itemsets at each iteration from the candidates, the support of every candidate itemset is computed. If it is greater than the minimum support threshold, then it is added to the set of frequent itemsets. The support of each itemset, t, is calculated as the ratio between the number of transactions that contain t, and the total number of transactions in T, i.e., the frequency of a transaction t in the database T. To compute the support of t, the entire transactional database T is scanned, such that t is verified against each transaction T i.ift belongs to T i, then the numerator of the frequency ratio is incremented by one. Let us consider the example of a transactional database with 5 transactions {T 1,T 2,T 3,T 4,T 5 } and 5 items {a, b, c, d, e}, as illustrated in Table 1. Table 1. Illustrative example of a transactional database TID T 1 T 2 T 3 T 4 T 5 Items a, b b, c, d a, b, c e c, d, e Figure1 illustrates the results of the Apriori algorithm when applied with minimum support σ sup set to 0.4. The transactional database is first scanned to calculate the support of each candidate itemset of size 1 (candidate itemsets containing only one item). The frequent itemsets of size 1 are then extracted. In this example, all candidates itemsets are frequent because their supports exceeds 0.4. In the second iteration, the candidate itemsets of size 2 are extracted by joining the frequent itemsets of size 1. The support of each candidate itemsets of size 2 is computed and then the frequent itemsets of size 2 are extracted, i.e. {ab, bc, cd}. The itemsets {abc, abd, bcd} are candidates for the size 3, but as their support is less than 0.4, they are not considered, and the process terminates. The set of frequent itemsets with minimum support greater than 40% is the union of the frequent itemsets of size 1 and 2, that is, {a, b, c, d, e, ab, bc, cd}. The Apriori algorithm has two limitations:

5 648 Y. Djenouri et al. Fig. 1. Apriori heuristic illustration 1. Multiple scanning of the transactional database is required: To compute the support of candidate itemsets, all existing approaches based on the Apriori heuristic scan the entire transactional database. Thus, the number of database scans is proportional to the number of generated candidate itemsets, which tends to be high for large databases. 2. Setting the minimum support user s threshold is challenging: Apriori heuristic is very sensitive when varying the minimum support. When low minimum support is chosen, a high number of candidate itemsets is obtained, which worsens the runtime of the algorithm, as each candidate itemset requires a scan of the entire database. 4 Single Scan Frequent Itemset Mining (SS-FIM) This section presents our proposed algorithm, i.e., Single Scan Frequent Itemset Mining (SS-FIM). The algorithm description is followed by a theoretical analysis of SS-FIM in comparison to the Apriori heuristic. 4.1 SS-FIM Algorithm Description The aim of SS-FIM is to minimize the number of database scans and the number of generated candidates while discovering frequent itemsets. This to overcome the limitations of the Apriori heuristic. The main idea of SS-FIM is to generate all possible itemsets for each transaction. If a generated itemset t has already been created when processing a previous transaction, then its support is incremented by one. Otherwise, its support is created and initialized to one. The process is repeated until all the transactions in the database have been processed. SS-FIM allows to find all frequent itemsets by performing a single scan of the transactional database. SS-FIM is also complete, because the frequent itemsets

6 SS-FIM: Single Scan for Frequent Itemsets Mining 649 are extracted directly from the transactional database and, a given itemset is frequent iff it is found (σ sup m) times in the transactional database. Consequently, no information is lost in the itemset generation process. Algorithm 1 describes SS-FIM in detail. Algorithm 1. SS-FIM Algorithm 1: Input: T: Transactional database. σ sup: user s minimum support threshold. 2: Output :F: The set of frequent Itemsets. 3: for each Transaction T i do 4: S GenerateAllItemsets(T i). 5: for each itemset t S do 6: if t h then 7: h(t) h(t)+1. 8: else 9: h(t) 1. 10: end if 11: end for 12: end for 13: F. 14: for itemset t h do 15: if h(t) σ sup then 16: F F t. 17: end if 18: end for 19: return F SS-FIM has as input the transactional database, T, and the minimum support value, σ sup. It also uses an internal data structure represented by a hash table h to store all generated itemsets with their partial number of occurrences. The algorithm returns the set of all frequent itemsets, F. First, the set of itemsets, S, is computed from each transaction in T.For instance, if the transaction T i contains the items a, b, andc, then S contains the itemsets a, b, c, ab, ac, bc, andabc. Afterwards, each itemset, t S, is stored in the hash table h. Ift already exists as a key in h, then the entry with key t in h, i.e., h(t) is increased by one. Otherwise, a new entry with key, t, is created in h and initialized to one. Finally, each entry, t h with support exceeding the minimum support σ sup is added to the set of the frequent itemsets F. 4.2 Illustration Figure 2 shows the SS-FIM algorithm execution using the example of Table 1 with σ sup set to 0.4. SS-FIM starts by scanning the first transaction {a, b} and extracting from it all possible candidates itemsets, i.e., {a, b, ab}. The hash table, h, is empty at this stage, so for each candidate itemset, an entry in h is created and initialized to one. For the second transaction {b, c, d}, SS-FIM determines all

7 650 Y. Djenouri et al. possible candidate itemsets, i.e., {b, c, d, bc, bd, cd, bcd}. The itemset {b} already exists in h, hence its entry is increased by one. As the remaining candidate itemsets are not in h, their entries are created and initialized to one. The same process is repeated for all remaining transactions {T 3,T 4,T 5 }. In the end, the itemsets in h with supports no less 0.4 are selected. The returned set of frequent itemsets in this example is {a, b, c, d, e, ab, bc, cd}, the same result as of the Apriori heuristic. Fig. 2. SS-FIM approach illustration 4.3 Theoretical Analysis The runtime cost of SS-FIM is the sum of (i) the cost of generating itemsets, and (ii) the cost of determining the frequent itemsets. Regarding the former, the number of all candidates generated from a transaction T i is 2 Ti 1, where T i represents the number of items of T i. The total number of generated candidate itemsets is thus m i=1 2 Ti 1, where m is the number of transactions in the database T.Ifp is the maximum number of items generated per transaction, then the number of candidates itemsets is at most m(2 p 1). The complexity of the operations needed for the generation of itemsets is then O(m(2 p 1)). For determining the frequent itemsets, the hash table has to be scanned for each candidate itemset, to evaluate its frequency against σ sup. This operation is O(m(2 p 1)). Consequently, the runtime cost of SS-FIM is: O(2m (2 p 1)) = O(m2 p ). (1)

8 SS-FIM: Single Scan for Frequent Itemsets Mining 651 According to the theoretical study of Hegland [15], the complexity of Apriori algorithm is: O(m n 2 ), (2) where n is the number of items in the database. Although Eq. 1 has exponential form, while Eq. 2 has polynomial form, Eq. 1 generally yields lower values compared to Eq. 2 for most existing transactional databases. In fact, Eq. 1 is exponential with respect to the parameter p, that is, the maximum number of itemsets generated from a transactions, not the problem size, i.e., the number of transactions in the database. In practice, the value of p is usually much lower than the number of items in the database n. For instance, for the well known case of supermarket basket analysis, the number of products sold by a supermarket can be several thousands whereas the average number of products bought by each client hardly exceeds a few dozens. Table 2. Theoretical runtime complexity comparison of SS-FIM and Apriori using standard database. Data set type Data set name m n p SS-FIM cost/m Apriori cost/m Small Bolts Small Sleep Small Pollution Small Basket ball Small Quake Average BMS-WebView Average BMS-WebView Average retail Average Connect Large BMP POS Table 2 presents a comparison between SS-FIM and Apriori using the standard FIM datasets described in [16]. The columns SS-FIM cost and Aprioricost, in particular, show an estimate of the number of CPU operations required based on the theoretical study of the two algorithms presented in this section. The table reveals that for small instances, the Apriori algorithm gives better results compared to SS-FIM in terms of number of CPU operations. However, for medium and large instances, SS-FIM clearly outperforms Apriori. These results are confirmed by the experimental study presented in the next section. To conclude, SS-FIM is more scalable than the Apriori and has lower computation cost for databases with medium and large number of items n.

9 652 Y. Djenouri et al. 5 Experimental Results To evaluate the SS-FIM algorithm, several experiments have been carried out using three types of well known database instances [16]. The first one is a collection of 5 small instances with number of transactions ranging between 40 and 2178, number of items ranging between 8 and 16 items, and the average size of transactions between 8 and 16. The second instance is a collection of 4 medium-sized database instances, with number of transactions ranging between and transactions, the number of items between 500 and items, and the average size of transactions between 2 and 10 items. The third type of instance is a large-sized database instance, named BMP- POS, which contains more than transactions and more than 1600 items, with average size of transaction equal to 2.5. All algorithms in the experiments have been implemented in C++ and experiments run on a desktop machine equipped with Intel I3 processor and 4 GB memory. Table 3. Runtime (Sec) of SS-FIM and Apriori using standard database. Data set Name SS-FIM Apriori Bolts Sleep Pollution Basket ball Quake BMS-WebView BMS-WebView retail Connect BMP POS Table 3 presents the runtime performance of the Apriori heuristic and SS- FIM using the standard FIM datasets described above. This table shows that, for small instances, the Apriori algorithm outperforms SS-FIM. However, for medium and large instances, SS-FIM clearly outperforms Apriori. This result confirms that our approach is better than Apriori when dealing with non dense and large transactional database. Apriori, however, outperforms our approach when dealing with dense but small transactional database. The second experiment focuses on the sensitivity of both approaches to variations of the minimum support. Figure 3 shows the runtime performance of the

10 SS-FIM: Single Scan for Frequent Itemsets Mining SS-FIM Apriori CPU Time (Sec) Minimum Support Fig. 3. Runtime (Sec) of SS-FIM and Apriori approaches for different minimum support (%) using the BMP-POS instance. Apriori and SS-FIM approaches using the BMP-POS instance with variable minimum support. By varying the minimum support (from 100% to 10%), the execution time of the Apriori algorithm highly increases, while the one of SS-FIM remains stable. These results confirm that SS-FIM is not sensitive to variations of the minimum support threshold. This can be explained by considering that SS-FIM is a transaction-based approach, in which the number of generated candidates itemsets is fixed no matter the support used in the input. Conversely, the Apriori heuristic is an item-based approach, in which the number of generated candidates increases when the minimum support is reduced. 6 Conclusions This paper has proposed SS-FIM, a new intelligent frequent itemsets mining algorithm. SS-FIM extracts frequent itemsets with only one scanning of the database. Candidate itemsets are first generated from each transaction and a hash table is used to keep track of the partial frequency of occurrence of candidate itemsets while processing transactions. Both the theoretical and the experimental evaluation reveal that SS-FIM outperforms the Apriori heuristic for large and non dense database instances. The scalability of SS-FIM also has been proven when varying the minimum support constraint. Motivated by the promising results shown in this paper, we plan to extend SS-FIM for solving domain specific big data related problems, such as in the fields of business intelligence, e.g., process mining based on process event logs, or Internet of things, e.g., mining of real-time sensor data.

11 654 Y. Djenouri et al. References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record, vol. 22, no. 2, pp ACM, June Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, vol. 29, no. 2, pp ACM, May Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD Record, vol. 26, no. 2, pp ACM, June Mueller, A.: Fast sequential and parallel algorithms for association rule mining: a comparison. Technical report CS-TR-3515, University of Maryland, College Park, August Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Third International Conference Knowledge Discovery and Data Mining (1997) 6. Amphawan, K., Lenca, P., Surarerks, A.: Efficient mining top-k regular-frequent itemset using compressed tidsets. In: Cao, L., Huang, J.Z., Bailey, J., Koh, Y.S., Luo, J. (eds.) PAKDD LNCS (LNAI), vol. 7104, pp Springer, Heidelberg (2012). doi: / Leung, C.K.-S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD LNCS (LNAI), vol. 5012, pp Springer, Heidelberg (2008). doi: / Cerf, L., Besson, J., Robardet, C., Boulicaut, J.F.: Closed patterns meet n-ary relations. ACM Trans. Knowl. Discov. Data (TKDD) 3(1), 3 (2009) 9. Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using FP-trees. IEEE Trans. Knowl. Data Eng. 17(10), (2005) 10. Borgelt, C.: Frequent itemset mining. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 2(6), (2012) 11. Djenouri, Y., Bendjoudi, A., Mehdi, M., Nouali-Taboudjemat, N., Habbas, Z.: GPU-based bees swarm optimization for association rules mining. J. Supercomput. 71(4), (2015) 12. Djenouri, Y., Drias, H., Habbas, Z.: Bees swarm optimisation using multiple strategies for association rule mining. Int. J. Bio-Inspired Comput. 6(4), (2014) 13. Gheraibia, Y., Moussaoui, A., Djenouri, Y., Kabir, S., Yin, P.Y.: Penguins search optimisation algorithm for association rules mining. CIT J. Comput. Inf. Technol. 24(2), (2016) 14. Luna, J.M., Pechenizkiy, M., Ventura, S.: Mining exceptional relationships with grammar-guided genetic programming. Knowl. Inf. Syst. 47(3), (2016) 15. Hegland, M.: The apriori algorithm tutorial. Math. Comput. imaging Sci. Inf. Process. 11, (2005) 16. Guvenir, H.A., Uysal, I.: Bilkent university function approximation repository (2000). Accessed 12 Mar 2012

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,