A Trie-based APRIORI Implementation for Mining Frequent Item Sequences

Size: px

Start display at page:

Download "A Trie-based APRIORI Implementation for Mining Frequent Item Sequences"

Gwen Webster
5 years ago
Views:

1 A Trie-based APRIORI Implementation for Mining Frequent Item Sequences Ferenc Bodon Department of Computer Science and Information Theory, Budapest University of Technology and Economics supervisor: Lajos Rónyai Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.1/14

2 The hierarchy of the pattern types sequence of itemsets itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

3 The hierarchy of the pattern types ordered/unordered sequence of itemsets itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

4 The hierarchy of the pattern types ordered/unordered directed/undirected sequence of itemsets itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

5 The hierarchy of the pattern types ordered/unordered directed/undirected sequence of itemsets vertex labeled/unlabeled edge labeled/unlabeled itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

6 The hierarchy of the pattern types ordered/unordered directed/undirected sequence of itemsets vertex labeled/unlabeled edge labeled/unlabeled itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph At each generalization step we have to examine new problems, that come into play, applicability of the existing techniques at previous level. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

7 The hierarchy of the pattern types ordered/unordered directed/undirected sequence of itemsets vertex labeled/unlabeled edge labeled/unlabeled itemset item sequence without duplicates item sequence with duplicates rooted tree tree graph We provide an efficient, open-source, trie-based APRIORI implementation for mining frequent sequences of items in a transactional database Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.2/14

8 The background We developed a FIM template library that includes a fast IO framework, some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth, nonord-fp algorithms, tester classes, a benchmark environment. In Apriori of FIM itemset are converted to item sequences, hence it is natural to extend FIM approach. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.3/14

9 General Properties of the Trie The representation of the list of edges label ordered list of (label, pointer) pairs, tabular representation (also with offset-index trick), hybrid solution is set by a template class no parent pointers are stored, dead-end branches are removed asap, classes that do support counting, candidate generation are set by template parameters. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.4/14

10 Investigated techniques 1. Routing strategy at the nodes, 2. candidate generation, 3. equisupport extension, 4. filtered transaction caching. Our environment makes it possible to examine each technique separately, and also to examine the effect on each other. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.5/14

11 1/4. Routing Strategy How to find the edge to follow in Apriori? Given a node with a list of n edges and a part of the filtered transaction (t ), find matching labels. search for corresponding item if transaction is stored in a list: nt comparisons, indexarray solution, search for corresponding label tabular representation of edge: t comparisons, binary search: t log n comparisons, linear search: t n comparisons, intelligent linear search: t n/2 comparisons, simultaneous traversal (merge) first sort, then remove duplicates, first remove duplicates, then sort. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.6/14

12 1/4. Routing Strategy No single best routing strategy. The best routing strategy depends on size of transaction, the number of duplicates, min_supp Strategies with large overhead are not competitive at large support thresholds. Search corresponding item performed always good. It is more important how does the strategy suit to the features (prefetching, pipelining, etc.) of the modern processors than the worst case run-times. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.7/14

13 2/4. Candidate Generation Same as in FIM, there solutions can be applied 1. complete pruning with subset checks, 2. complete pruning with an intersection-based solution, 3. omit complete pruning. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.8/14

14 2/4. Candidate Generation Same as in FIM, there solutions can be applied 1. complete pruning with subset checks, 2. complete pruning with an intersection-based solution, 3. omit complete pruning. Our expectation: Complete pruning in frequent item sequence mining is more important than in FIM. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.8/14

15 2/4. Candidate Generation Same as in FIM, there solutions can be applied 1. complete pruning with subset checks, 2. complete pruning with an intersection-based solution, 3. omit complete pruning. Our expectation: Complete pruning in frequent item sequence mining is more important than in FIM. Reasoning: Support counting is slower, while determining if a candidate s subsequence is contained in a trie is achieved as fast as in FIM. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.8/14

16 2/4. Candidate Generation Same as in FIM, there solutions can be applied 1. complete pruning with subset checks, 2. complete pruning with an intersection-based solution, 3. omit complete pruning. Our expectation: Complete pruning in frequent item sequence mining is more important than in FIM. Reasoning: Support counting is slower, while determining if a candidate s subsequence is contained in a trie is achieved as fast as in FIM. Experiments: Support our hypothesis. Omitting complete pruning never resulted in a faster Apriori than intersection-based Apriori. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.8/14

17 3/4. Equisupport Extension Omitting equisupport extension is the most widely used technique in FIM. Property 1. Let X Y I. If supp(x) = supp(y ) then supp(y Z) = supp(x Z) for any Z I \ Y. Usage: It is not necessary to determine the support of Y Z. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.9/14

18 3/4. Equisupport Extension Omitting equisupport extension is the most widely used technique in FIM. Property 1. Let X Y I. If supp(x) = supp(y ) then supp(y Z) = supp(x Z) for any Z I \ Y. Usage: It is not necessary to determine the support of Y Z Database: connect NOPRUNE INTERSECT-PRUNE NOPRUNE-NEE time (sec) Support threshold Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.9/14

19 3/4. Equisupport Extension Omitting equisupport extension is the most widely used technique in FIM. Property 1. Let X Y I. If supp(x) = supp(y ) then supp(y Z) = supp(x Z) for any Z I \ Y. Usage: It is not necessary to determine the support of Y Z. Equisupport pruning in Apriori for FIM: 1. in infrequent candidate removal phase: X = Y 1 and X is prefix of Y, then do not extend Y, 2. in candidate generation s subset check phase: Y Z is potential candidate, Y is non-prefix of Y Z, X is prefix of Y then the support of Y Z is not determined. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.9/14

20 3/4. Equisupport Extension Omitting equisupport extension is the most widely used technique in FIM. Property 1. Let X Y I. If supp(x) = supp(y ) then supp(y Z) = supp(x Z) for any Z I \ Y. Usage: It is not necessary to determine the support of Y Z. Equisupport pruning in Apriori for FIM: 1. in infrequent candidate removal phase: X = Y 1 and X is prefix of Y, then do not extend Y, 2. in candidate generation s subset check phase: Y Z is potential candidate, Y is non-prefix of Y Z, X is prefix of Y then the support of Y Z is not determined. Unfortunately, no equisupport extension pruning can be applied in frequent item sequence mining. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.9/14

21 3/4. Equisupport Extension Omitting equisupport extension is the most widely used technique in FIM. Property 1. Let X Y I. If supp(x) = supp(y ) then supp(y Z) = supp(x Z) for any Z I \ Y. Usage: It is not necessary to determine the support of Y Z. Equisupport pruning in Apriori for FIM: 1. in infrequent candidate removal phase: X = Y 1 and X is prefix of Y, then do not extend Y, Proof. By contradiction. Let the database T be { A, B, A }. Then supp( ) = supp( A ) = 2 but supp( B ) = 1 0 = supp( A, B ). Unfortunately, no equisupport extension pruning can be applied in frequent item sequence mining. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.9/14

22 4/4. Filtered transaction caching Let t be a transaction. filtered t: infrequent items are removed from t. Collect the same filtered transaction and store them in memory. Advantages: IO cost is reduced, Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.10/14

23 4/4. Filtered transaction caching Let t be a transaction. filtered t: infrequent items are removed from t. Collect the same filtered transaction and store them in memory. Advantages: IO cost is reduced, parsing costs are reduced, Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.10/14

24 4/4. Filtered transaction caching Let t be a transaction. filtered t: infrequent items are removed from t. Collect the same filtered transaction and store them in memory. Advantages: IO cost is reduced, parsing costs are reduced, the number of support count method calls is reduced. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.10/14

25 4/4. Filtered transaction caching Let t be a transaction. filtered t: infrequent items are removed from t. Collect the same filtered transaction and store them in memory. Advantages: IO cost is reduced, parsing costs are reduced, the number of support count method calls is reduced. Disadvantage: needs extra memory. requires CPU time to collect the same filtered transactions Possible data structures: ordered list, trie, red-black tree, patricia tree Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.10/14

26 4/4. Filtered transaction caching Expectation: Transaction caching is not such an effective technique as it is in FIM. Reasoning: For two item sequences to be equal, not just the items, but the indices of the same items have to be equal as well. Experiments: never resulted in a faster algorithm, sometimes it significantly increases memory need. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.11/14

27 Is our implementation competitive? Database: BMS-POS Database: kosarak100_infty time (sec) Apriori PrefixSpan time (sec) Apriori PrefixSpan Support threshold Support threshold Database: kosarak2_10_2 Database: kosarak2_100_infty time (sec) Apriori PrefixSpan time (sec) Apriori PrefixSpan Support threshold Support threshold Run-times Apriori vs. prefixspan by Taku Kudo Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.12/14

28 Is our implementation competitive? Database: BMS-POS Database: kosarak100_infty mem (MB) Apriori PrefixSpan mem (MB) Apriori PrefixSpan Support threshold Support threshold Database: kosarak2_10_2 Database: kosarak2_100_infty mem (MB) Apriori PrefixSpan mem (MB) Apriori PrefixSpan Support threshold Support threshold Memory needs Apriori vs. prefixspan by Taku Kudo Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.13/14

29 Conclusion When develping solutions for frequent item sequence mining it is useful to understand techniques used in FIM. Thank you for your attention! Open Source Data Mining Workshop on Frequent Pattern Mining Implementations p.14/14

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version) Ferenc Bodon 1 and Lars Schmidt-Thieme 2 1 Department of Computer