EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

Size: px

Start display at page:

Download "EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining"

Clara Holt
5 years ago
Views:

1 EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining 1

2 High-utility itemset mining Input a transaction database a unit profit table minutil: a minimum utility threshold set by the user (a positive integer) 2

3 High-utility itemset mining Input a transaction database a unit profit table minutil: a minimum utility threshold set by the user (a positive integer) Output All high-utility itemsets (itemsets having a utility minutil) For example, if minutil = 33$, the high-utility itemsets are: {b,d,e} 36$ 2 transactions {b,c,d,e} 40$ 2 transactions {b,c,d} 34$ 2 transactions {b,c,e} 37 $ 3 transactions 3

4 Input a transaction database Utility calculation a unit profit table The utility of the itemset {b,d,e} is calculated as follows: u({b,d,e}) = (5x2)+(3x2)+(3x1) + (4x2)+(2x3)+(1x3) = 36$ utility in transaction T 1 utility in transaction T 2 Challenge: utility is not anti-monotonic 4

5 Contribution A new algorithm named EFIM Three main ideas: High-utility Database Projection High-utility Transaction Merging Utility-Bin Array to calculate utility and upperbounds 5

6 The EFIM algorithm An algorithm for mining high utility-itemsets It performs a depth-first search. It applies pruning strategies to prune the search space based on upper-bounds on the utility. 6

7 7

8 Database projection Original database Projected database of {c} EFIM is a pattern-growth algorithm It scans the horizontal database to calculate the utility and upper-bound Using projected databases reduce the cost of database scans. To reduce memory usage, EFIM performs pseudo-projections. 8

9 Database merging Original database To further reduce the database, we can merge transactions: - In the original database - In each projected database Projected database of {c} Merged database of {c} 9

10 Optimization 1 - merging An idea: merge identical transactions to reduce the size of the database. This can reduce the size of the database and the cost of reading the database. Items T1 {A,B,C,D,E,F} 1 Weight Items T2 {A,B,C,D,E,F} 1 TX {A,B,C,D,E,F} 2 T3 {C,D,E, F} 1 T3 {C,D,E, F} 1 T4 {B, E, F} 1 T4 {B, E, F} 1 If we apply transaction merging only on the original database, this optimization will not make the algorithm much faster. Weight 10

11 Optimization 1 merging A better idea Merge transactions avec each projection. For example, after having projected the database with an item D: Prefix = {D} Items Weight T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T3 {C,D,E, F} 1 T4 {B, E, F} 1 11

12 Optimization 1 merging A better idea Merge transactions avec each projection. For example, after having projected the database with an item D: Prefix = {D} Items Weight T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T3 {C,D,E, F} 1 Items T1 {E,F} 4 Weight T4 {B, E, F} 1 Good, but we must be able to implement transaction merging efficiently! 12

13 Optimization 1 merging Solution: Sort transactions by the total order when transactions are read backward. e.g. Items Weight T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T3 {C,D,E, F} 1 T4 {D, A, G} 1 T5 {B, C, A, G} 1 T6 {A, E, F} 1 T7 {B, C, F, G} 1 T8 {A, C, D, F, G} 1 sort Items Weight T6 {A, E, F} 1 T3 {C,D,E, F} 1 T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T5 {B, C, A, G} 1 T4 {D, A, G} 1 T7 {B, C, F, G} 1 T8 {A, C, D, F, G} 1 13

14 Optimization 1 merging Solution: Sort transactions by the total order when transactions are read backward. e.g. Property: After the sort, no matter where we «cut» transactions, transactions to be merged will always appear one after the other. e.g. consider prefix = {a} Items Weight T6 {A, E, F} 1 T3 {C,D,E, F} 1 T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T5 {B, C, A, G} 1 T4 {D, A, G} 1 T7 {B, C, F, G} 1 T8 {A, C, D, F, G} 1 14

15 Optimization 1 merging Solution: Sort transactions by the total order, but by reading the transactions backwards. e.g. Property: After the sort, no matter where we «cut» transactions, transactions to be merged will always appear one after the other. e.g. consider prefix = {a} Items Weight T6 {A, E, F} 1 T3 {C,D,E, F} 1 T1 {A,B,C,D,E,F} 1 T2 {A,B,C,D,E,F} 1 T5 {B, C, A, G} 1 T4 {D, A, G} 1 T7 {B, C, F, G} 1 T8 {A, C, D, F, G} 1 15

16 Optimization 1 merging Solution: Sort transactions by the total order, but by reading the transactions backwards. e.g. Property: After the sort, no matter where we «cut» transactions, transactions to be merged will always appear one after the other. Items Weight T6 {A, E, F} 1 TX {A,B,C,D,E,F} 2 TY {B, C, A, G} 2 T8 {A, C, D, F, G} 1 e.g. consider prefix = {a} 16

17 17

18 18

19 Pruning the search space using the utility Merged database of {c} Sub-tree utility upper-bound: An upper-bound on the utility of itemsets that can be obtained starting with {c} is the utility of {c} plus the profit of items appearing after {c} that can be appended to {c}. 19

20 Upper-bound local utility 20

21 Upper-bound subtree utility 21

22 22

23 Redefined sub-tree utility 23

24 Redefined sub-tree utility 24

25 Counting upper-bounds, utility and Merged database of {c} support using arrays Utility a b c d e f g Sub-tree upper-bound Support a b c d e f g a b c d e f g

26 26

27 EFIM 27

28 EFIM (2) 28

29 Dataset Experimental Evaluation Datasets characterictics transaction count distinct item count Accidents 340, BMS 59, Chess 3, Connect 67, Foodmart 4,141 1,559 1,559 Mushroom 8, average transaction length Foodmart is a real-life transaction datasets from retail stores. External and internal utility values have been generated in the [1, 000] and [1, 5] intervals using a log-normal distribution 29

30 Experimental Evaluation We compared the performance of EFIM-Closed with the state-of-the-art CHUD algorithm two versions of EFIM-Closed without optimizations named EFIM-Closed(lu) and EFIM-Closed(nop) We varied the minutil threshold and compared execution time and memory usage Computer with 12 GB of RAM, Java, Windows 7, 64 bit Core i5 Processor 30

31 31

32 32

33 33

34 34

35 Thank you. Questions? Open source Java data mining software, 120 algorithms 35

36 EFIM-CLOSED 36

37 What is a closed high utility itemset? It is a high-utility itemset that has no proper superset having the same support (frequency). For example: {b,d,e} 36$ 2 transactions {b,c,d,e} 40$ 2 transactions {b,c,d} 34$ 2 transactions {b,c,e} 37 $ 3 transactions Interesting properties: A closed itemset is the largest group of items bought by a groups of customers. A closed pattern is always more profitable than its corresponding non closed patterns. 37

38 The EFIM-Closed algorithm An algorithm for mining closed high utility-itemsets It performs a depth-first search. It applies pruning strategies to prune the search space based on upper-bounds on the utility. 38

39 Forward/Backward extension checking To determine if an itemset X is closed, check if there exists an item not in X that appears in all transactions where X appears. If yes, then X is not closed e.g. {b,d,e} is not closed because of item c 39

40 Closure jumping For an itemset X, if all items not in X that can be appended to X have the same support as X, they can be directly appended to X to obtain a closed itemset. TID Items T1 {a,b,c, d} T2 {a,b,c, d} T3 {b, d} 40

41 Pseudocode 41

42 Accident Execution times BMS Chess Connect Foodmart Mushroom EFIM-Closed is up to 71 times faster than CHUD 42

43 Maximum Memory usage (MB) Dataset EFIM-Closed CHUD Accidents 895 2,603 BMS Chess Connect 385 1,504 Foodmart Mushroom 71 1,308 EFIM-Closed consumes up to 18 times less memory 43

44 Number of visited nodes Dataset EFIM-Closed CHUD Accidents 1,341 29,932 BMS 7 27 Chess 348,633 7,759,252 Connect 19, ,059 Foodmart 6,680 6,680 Mushroom 8,017 17,621 EFIM-Closed is generally more effective at pruning the search space. On Chess, 22 times less nodes are visited by EFIM-Closed 44

45 Contribution: Conclusion New algorithm for mining closed high utility itemsets named EFIM-Closed Experimental results: EFIM-Closed is up to 71 times faster and consumes up to 18 times less memory than the state-of-the-art CHUD algorithm Source code and datasets available as part of the SPMF data mining library (GPL 3). Open source Java data mining software, 120 algorithms 45

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures 1 Introduction Frequent itemset mining is a popular data mining task. It consists of discovering sets of items (itemsets) frequently