DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 4/18/13 1
Data Mining Alternative Association Analysis (Tan, Steinbach, Kumar ch. 6)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 4/18/13 2
Alternative methods for frequent itemset generation" Traversal of itemset lattice - General-to-specific vs Specific-to-general: Frequent itemset border Frequent itemset border............ {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 4/18/13 3
Alternative methods for frequent itemset generation" Traversal of itemset lattice as prefix or suffix trees implies different equivalence classes Left: prefix tree and equivalence classes defined by for prefixes of length k = 1 Right: suffix tree and equivalence classes defined by for prefixes of length k = 1 A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 4/18/13 4
Alternative methods for frequent itemset generation" Traversal of itemset lattice Breadth-first vs Depth-first: (a) Breadth first (b) Depth first 4/18/13 5
Alternative methods for frequent itemset generation" Representation of database - horizontal vs vertical data layout: Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 4/18/13 6
Characteristics of Apriori algorithm" Breadth-first search algorithm: all frequent itemsets of given size are kept in the algorithms processing queue General-to-specific search: start with itemsets with large support, work towards lowersupport region Generate-and-test strategy: generate candidates, test by support counting A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 4/18/13 7
Weaknesses of Apriori" Apriori is one of the first algorithms that succesfully tackled the exponential size of the frequent itemset space Nevertheless the Apriori suffers from two main weaknesses: High I/O overhead from the generate-and-test strategy: several passes are required over the database to find the frequent itemsets The performance can degrade significantly on dense databases, as large portion of the itemset lattice becomes frequent 4/18/13 8
FP-growth algorithm" FP-growth algorithm: mining frequent patterns without candidate generation using a frequent-pattern (FP) tree. FP-growth avoids the repeated scans of the database of Apriori by using a compressed representation of the transaction database using a data structure called FP-tree Once an FP-tree has been constructed, it uses a recursive divide- and-conquer approach to mine the frequent itemsets FP-tree is a compressed representation of the transaction database Each transaction is mapped onto a path in the tree Each node contains an item and the support count corresponding to the number of transactions with the prefix corresponding to the path from root node Nodes having the same item label are cross-linked: this helps finding the frequent itemsets ending with a particular item 4/18/13 9
FP-tree construction" After reading TID=1: TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:1 After reading TID=2: A:1 B:1 A:1 B:1 4/18/13 10
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=3: A:2 B:1 B:1 4/18/13 11
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=4: A:3 B:1 B:1 4/18/13 12
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=5: A:4 B:2 B:1 4/18/13 13
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=6: A:5 B:3 B:1 4/18/13 14
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=7: A:6 B:3 B:1 4/18/13 15
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=8: A:7 B:4 B:1 4/18/13 16
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=9: A:8 B:5 B:1 4/18/13 17
FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=10: A:8 B:5 B:2 C:2 4/18/13 18
Transaction Database FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:5 A:8 B:2 C:2 Header table Item A B C D E Pointer C:3 Pointers are used to assist frequent itemset generation 4/18/13 19
Observations about FP-tree" Size of FP-tree depends on how items are ordered. In the previous example, if ordering is done in increasing order, the resulting FPtree will be different and for this example, it will be denser (wider). At the root node the branching factor will increase from 2 to 5 as shown on next slide. Also, ordering by decreasing support count doesn t always lead to the smallest tree. 4/18/13 20
FP-tree size" The size of an FP tree is typically smaller than the size of the uncompressed data because many transactions often share a few items in common. Best case scenario: All transactions have the same set of items, and the FP tree contains only a single branch of nodes. Worst case scenario: Every transaction has a unique set of items. As none of the transactions have any items in common, the size of the FP tree is effectively the same as the size of the original data. The size of an FP tree also depends on how the items are ordered. If the ordering scheme in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP tree probably is denser (shown in next slide). Not always though ordering is just a heuristic. 4/18/13 21
FP-tree vs. original database " If the transactions share a significant number of items, FP-tree can be considerably smaller as the common subset of the items is likely to share paths. There is a storage overhead from the links as well from the support counts, so in worst case may even be larger than original. 4/18/13 22
Frequent itemset generation in FP-growth" This algorithm generates frequent itemsets from FP-tree by traversing in bottom-up fashion. This algorithm extracts frequent itemsets ending in e first and then ending in d, c, b and a. As every transaction is mapped onto a single path in the FP-tree, frequent itemsets, say ending in e can be found by investigating the paths containing node e. 4/18/13 23
Mining frequent patterns using FP-tree" General idea (divide-and-conquer) Recursively grow frequent patterns using the FP-tree looking for shorter ones recursively and then concatenating the suffix For each frequent item, construct its conditional pattern base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty. 4/18/13 24
Major steps of FP-growth algorithm" Step 1: Construct conditional pattern base for each item in the header table. Starting at the bottom of frequent-item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Step 2: Construct conditional FP-tree from each conditional pattern base. For each pattern base: Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern base Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far. 4/18/13 25
Frequent itemset generation in FP-growth" FP-growth uses a divide-and-conquer approach to find frequent itemsets It searches frequent itemsets ending with item E first, then itemsets ending with D, C, B, and A. That is, it uses equivalence classes based on length-1 suffixes. Paths corresponding to different suffixes are extracted from the FPtree. 4/18/13 26
Frequent itemset generation in FP-growth" To find all frequent itemsets ending with a given last item (e.g. E), we first need to compute the support of the item. This is given by the sum of support counts of all nodes labeled with the item (σ(e) = 3) found by following the cross-links connecting the nodes with the same item. If last item is found frequent, FP-growth next iteratively looks for all frequent itemsets ending with given length-2 suffix (DE, CE, BE, and AE), and recursively with length-3 suffix, length-4 suffix until no more frequent itemsets are found Conditional FP-tree is constructed for each different suffix to speed up the computation. 4/18/13 27
Frequent itemset generation in FPG" Paths containing node E: A:8 B:2 C:2 4/18/13 28
Frequent itemset generation in FPG" Paths containing node D: C:3 B:5 A:8 B:2 C:2 4/18/13 29
Frequent itemset generation in FPG" Paths containing node C: Paths containing node B: C:3 B:5 A:8 B:2 C:2 A:8 B:5 B:2 Paths containing node A: A:8 4/18/13 30
Frequent itemset generation for paths ending in E:" Prefix paths ending in E: A:8 B:2 Conditional FP-tree for E: X! A:2 B:1 C:2 Conditional pattern base for E: P = {(A:1,, ), (A:1, ) ()} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE 4/18/13 31
Frequent itemset generation for paths ending in E:" Prefix paths ending in DE: Conditional FP-tree for DE: A:2 A:2 X! 4/18/13 32
Frequent itemset generation for paths ending in E:" Prefix paths ending in CE: Conditional FP-tree for CE: A:1 A:1 X! 4/18/13 33
Frequent itemset generation for paths ending in E:" Since B is already pruned there is no need to cover this case and AE is left: Prefix paths ending in AE: Conditional FP-tree for AE: A:2 4/18/13 34
Is FP-growth faster than Apriori?" As the support threshold goes down, the number of itemsets increases dramatically. FP-growth does not need to generate candidates and test them the support threshold goes down, the number of ite 4/18/13 35
Is FP-growth faster than Apriori?" Both FP-growth and Apriori scale linearly with the number of transactions. But FP-growth is more efficient 4/18/13 36
Frequent itemset generation for paths ending in D:" Prefix paths ending in D: Conditional FP-tree for D: C:3 B:5 A:7 B:3 C:3 B:2 A:4 B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 37
Frequent itemset generation for paths ending in D:" Prefix paths ending in CD: Conditional FP-tree for CD: B:1 A:2 B:1 B:1 A:2 B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 38
Frequent itemset generation for paths ending in D:" Prefix paths ending in BCD: Conditional FP-tree for BCD: B:1 A:1 A:1 X! B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 39
Frequent itemset generation for paths ending in D:" Prefix paths ending in ACD: Conditional FP-tree for ACD: B:1 A:2 A:2 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 40
Frequent itemset generation for paths ending in D:" Prefix paths ending in BD: Conditional FP-tree for BD: B:2 A:2 B:1 A:2 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 41
Frequent itemset generation for paths ending in D:" Prefix paths ending in AD: Conditional FP-tree for AD: A:4 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 42
Frequent itemset generation in FPG" 4/18/13 43
Frequent itemset generation in FPG" 4/18/13 44
The tree projection algorithm" Generation of frequent itemsets by successive construction of nodes of a lexicographic tree of itemsets. FIG. 1. The lexicographic tree. 4/18/13 45