DATA MINING II - 1DL460

Size: px
Start display at page:

Download "DATA MINING II - 1DL460"

Transcription

1 DATA MINING II - 1DL460 Spring 2014 " An second class in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 04/05/14 1

2 Data Mining Alternative Association Analysis (Tan, Steinbach, Kumar ch. 6)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 04/05/14 2

3 Alternative methods for frequent itemset generation" Traversal of itemset lattice - General-to-specific vs Specific-to-general: Frequent itemset border Frequent itemset border {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 04/05/14 3

4 Alternative methods for frequent itemset generation" Traversal of itemset lattice as prefix or suffix trees implies different equivalence classes Left: prefix tree and equivalence classes defined by for prefixes of length k = 1 Right: suffix tree and equivalence classes defined by for suffixes of length k = 1 A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 04/05/14 4

5 Alternative methods for frequent itemset generation" Traversal of itemset lattice Breadth-first vs Depth-first: (a) Breadth first (b) Depth first 04/05/14 5

6 Alternative methods for frequent itemset generation" Representation of database - horizontal vs vertical data layout: Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E /05/14 6

7 Characteristics of Apriori algorithm" Breadth-first search algorithm: all frequent itemsets of given size are kept in the algorithms processing queue General-to-specific search: start with itemsets with large support, work towards lowersupport region Generate-and-test strategy: generate candidates, test by support counting A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 04/05/14 7

8 Weaknesses of Apriori" Apriori is one of the first algorithms that succesfully tackled the exponential size of the frequent itemset space Nevertheless the Apriori suffers from two main weaknesses: High I/O overhead from the generate-and-test strategy: several passes are required over the database to find the frequent itemsets The performance can degrade significantly on dense databases, as large portion of the itemset lattice becomes frequent 04/05/14 8

9 FP-growth algorithm" FP-growth algorithm: mining frequent patterns without candidate generation using a frequent-pattern (FP) tree. FP-growth avoids the repeated scans of the database of Apriori by using a compressed representation of the transaction database using a data structure called FP-tree Once an FP-tree has been constructed, it uses a recursive divide- and-conquer approach to mine the frequent itemsets FP-tree is a compressed representation of the transaction database Each transaction is mapped onto a path in the tree Each node contains an item and the support count corresponding to the number of transactions with the prefix corresponding to the path from root node Nodes having the same item label are cross-linked: this helps finding the frequent itemsets ending with a particular item 04/05/14 9

10 FP-tree construction" After reading TID=1: TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:1 After reading TID=2: A:1 B:1 A:1 B:1 C:1 04/05/14 10

11 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=3: A:2 B:1 B:1 C:1 C:1 E:1 04/05/14 11

12 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=4: A:3 B:1 C:1 B:1 E:1 C:1 E:1 04/05/14 12

13 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:1 After reading TID=5: A:4 B:2 C:1 B:1 E:1 C:1 E:1 04/05/14 13

14 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=6: A:5 B:3 C:1 B:1 E:1 C:1 E:1 04/05/14 14

15 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=7: A:6 B:3 C:1 B:1 E:1 C:1 E:1 04/05/14 15

16 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=8: A:7 B:4 C:1 B:1 E:1 C:1 E:1 04/05/14 16

17 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=9: A:8 B:5 C:1 B:1 E:1 C:1 E:1 04/05/14 17

18 FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=10: A:8 B:5 C:1 B:2 E:1 C:2 E:1 E:1 04/05/14 18

19 Transaction Database FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:5 A:8 C:1 B:2 C:2 Header table Item A B C D E Pointer C:3 E:1 E:1 Pointers are used to assist frequent itemset generation E:1 04/05/14 19

20 Observations about FP-tree" Size of FP-tree depends on how items are ordered. In the previous example, if ordering is done in increasing order, the resulting FPtree will be different and for this example, it will be denser (wider). At the root node the branching factor will increase from 2 to 5 as shown on next slide. Also, ordering by decreasing support count doesn t always lead to the smallest tree. 04/05/14 20

21 FP-tree size" The size of an FP tree is typically smaller than the size of the uncompressed data because many transactions often share a few items in common. Best case scenario: All transactions have the same set of items, and the FP tree contains only a single branch of nodes. Worst case scenario: Every transaction has a unique set of items. As none of the transactions have any items in common, the size of the FP tree is effectively the same as the size of the original data. The size of an FP tree also depends on how the items are ordered. If the ordering scheme in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP tree probably is denser (shown in next slide). Not always though ordering is just a heuristic. 04/05/14 21

22 FP-tree vs. original database " If the transactions share a significant number of items, FP-tree can be considerably smaller as the common subset of the items is likely to share paths. There is a storage overhead from the links as well from the support counts, so in worst case may even be larger than original. 04/05/14 22

23 Frequent itemset generation in FP-growth" This algorithm generates frequent itemsets from FP-tree by traversing in bottom-up fashion. This algorithm extracts frequent itemsets ending in e first and then ending in d, c, b and a. As every transaction is mapped onto a single path in the FP-tree, frequent itemsets, say ending in e can be found by investigating the paths containing node e. 04/05/14 23

24 Mining frequent patterns using FP-tree" General idea (divide-and-conquer) Recursively grow frequent patterns using the FP-tree looking for shorter ones recursively and then concatenating the suffix For each frequent item, construct its conditional pattern base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty. 04/05/14 24

25 Major steps of FP-growth algorithm" Step 1: Construct conditional pattern base for each item in the header table. Starting at the bottom of frequent-item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Step 2: Construct conditional FP-tree from each conditional pattern base. For each pattern base: Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern base Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far. 04/05/14 25

26 Frequent itemset generation in FP-growth" FP-growth uses a divide-and-conquer approach to find frequent itemsets It searches frequent itemsets ending with item E first, then itemsets ending with D, C, B, and A. That is, it uses equivalence classes based on length-1 suffixes. Paths corresponding to different suffixes are extracted from the FPtree. 04/05/14 26

27 Frequent itemset generation in FP-growth" To find all frequent itemsets ending with a given last item (e.g. E), we first need to compute the support of the item. This is given by the sum of support counts of all nodes labeled with the item (σ(e) = 3) found by following the cross-links connecting the nodes with the same item. If last item is found frequent, FP-growth next iteratively looks for all frequent itemsets ending with given length-2 suffix (DE, CE, BE, and AE), and recursively with length-3 suffix, length-4 suffix until no more frequent itemsets are found Conditional FP-tree is constructed for each different suffix to speed up the computation. 04/05/14 27

28 Frequent itemset generation in FPG" Paths containing node E: A:8 B:2 E:1 C:1 E:1 C:2 E:1 04/05/14 28

29 Frequent itemset generation in FPG" Paths containing node D: C:3 B:5 A:8 C:1 B:2 C:2 04/05/14 29

30 Frequent itemset generation in FPG" Paths containing node C: Paths containing node B: C:3 B:5 A:8 C:1 B:2 C:2 A:8 B:5 B:2 Paths containing node A: A:8 04/05/14 30

31 Frequent itemset generation for paths ending in E:" Prefix paths ending in E: A:8 B:2 Conditional FP-tree for E: X! A:2 B:1 C:1 C:2 C:1 C:1 E:1 E:1 E:3 is frequent E:1 Conditional pattern base for E: P = {(A:1, C:1, ), (A:1, ) (C:1)} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE D:2, C:2, A:2 header list Thus, recursion 04/05/14 31

32 Frequent itemset generation for paths ending in E:" Prefix paths ending in DE: Conditional FP-tree for DE: A:2 A:2 X! C:1 DE:2 is frequent X! C:1 A:2 header list ADE:2 is frequent Thus, end of current subpath! Conditional pattern base for E: P = {(A:1, C:1, ), (A:1, ) (C:1)} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE 04/05/14 32

33 Frequent itemset generation for paths ending in E:" Prefix paths ending in CE: Conditional FP-tree for CE: C:1 A:1 X! C:1 A:1 X! CE:2 is frequent {} header list Thus, end of current subpath! Conditional pattern base for E: P = {(A:1, C:1, ), (A:1, ) (C:1)} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE 04/05/14 33

34 Frequent itemset generation for paths ending in E:" Note: since B is already pruned there is no need to cover this case and AE is left: Prefix paths ending in AE: Conditional FP-tree for AE: A:2 {} header list AE:2 is frequent Thus, end of path E! Conditional pattern base for E: P = {(A:1, C:1, ), (A:1, ) (C:1)} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE 04/05/14 34

35 Is FP-growth faster than Apriori?" As the support threshold goes down, the number of itemsets increases dramatically. FPgrowth does not need to generate candidates and test them (Han et al, Mining Frequent Patterns without Candidate Generation, SIGMOD 2000) 04/05/14 35

36 Is FP-growth faster than Apriori?" Both FP-growth and Apriori scale linearly with the number of transactions. In this case (Han et al, Mining Frequent Patterns without Candidate Generation, SIGMOD 2000) FPgrowth is more efficient but this cannot be generalized to all situations. The properties are in general dependent on support thresholds and data densities and data set size. 04/05/14 36

37 Frequent itemset generation for paths ending in D:" Prefix paths ending in D: Conditional FP-tree for D: C:3 B:5 A:7 C:1 B:3 C:3 B:2 C:1 A:4 C:1 B:1 C:1 C:3, B:3, A:4 header list D:5 is frequent Thus, recursion Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 37

38 Frequent itemset generation for paths ending in D:" Prefix paths ending in CD: Conditional FP-tree for CD: B:1 A:2 B:1 B:1 A:2 B:1 C:1 C:1 C:1 B:2, A:2 header list CD:3 is frequent Thus, recursion Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 38

39 Frequent itemset generation for paths ending in D:" Prefix paths ending in BCD: Conditional FP-tree for BCD: B:1 A:1 A:1 X! B:1 {} header list BCD:2 is frequent Thus, end of this subpath Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 39

40 Frequent itemset generation for paths ending in D:" Prefix paths ending in ACD: Conditional FP-tree for ACD: A:2 {} header list ACD:2 is frequent Thus, end of this subpath Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 40

41 Frequent itemset generation for paths ending in D:" Prefix paths ending in BD: Conditional FP-tree for BD: B:2 A:2 BD:3 is frequent B:1 A:2 A:2 header list ABD:2 is frequent Thus, end of this subpath! Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 41

42 Frequent itemset generation for paths ending in D:" Prefix paths ending in AD: Conditional FP-tree for AD: A:4 {} header list AD:4 is frequent Thus, end of path D Conditional pattern base for D: P = {(A:1, B:1, C:1), (A:1, B:1), (A:1, C:1), (A:1), (B:1, C:1)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 04/05/14 42

43 Frequent itemset generation for paths ending in C:" Paths containing node C: Conditional FP-tree for C: (*) B:5 A:8 B:2 A:4 B:3 B:2 B:5 A:3 A:1 C:3 C:1 C:2 (*) Transformation according to the heading list B:5, A:4 header list C:6 is frequent Thus, recursion! Conditional pattern base for C: P = {(A:3, B:3), (A:1), (B:2)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): C:6, AC:4, BAC:3, BC:5 04/05/14 43

44 Frequent itemset generation for paths ending in C:" Prefix paths ending in AC: Conditional FP-tree for AC: A:3 B:5 A:1 B:3 BAC:3 is frequent AC:4 is frequent Thus, end of this subpath Conditional pattern base for D: P = {(A:3, B:3), (A:1), (B:2)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): C:6, AC:4, BAC:3, BC:5 04/05/14 44

45 Frequent itemset generation for paths ending in C:" Prefix paths ending in BC: Conditional FP-tree for BC: B:5 BC:5 is frequent {} header list Thus, end of path C! Conditional pattern base for D: P = {(A:3, B:3), (A:1), (B:2)} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): C:6, AC:4, BAC:3, BC:5 04/05/14 45

46 Frequent itemset generation for paths ending in B:" Prefix paths ending in B: Conditional FP-tree for B: A:8 B:2 A:5 B:5 A:5 header list B:7 is frequent AB:5 is frequent Conditional pattern base for B: P = {A:5} Thus, end of path B! Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): B:7, AB:5 04/05/14 46

47 Frequent itemset generation for paths ending in A:" Prefix paths ending in A: Conditional FP-tree for A: A:8 {} header list A:8 is frequent Thus, end of path A! Conditional pattern base for B: P = {} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): A:8 04/05/14 47

48 Frequent itemset generation in FPG" 04/05/14 48

49 Frequent itemset generation in FPG" 04/05/14 49

50 The tree projection algorithm" Generation of frequent itemsets by successive construction of nodes of a lexicographic tree of itemsets. FIG. 1. The lexicographic tree. 04/05/14 50

51 The tree projection algorithm cont " Agarwal R, Aggarwal CC, Prasad VVV (2001): A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing 61: Alternatives to the FP-growth algorithm that can apply different strategies for generating and traversing the lexicographic tree including breadth-first search, depth-first search and a combination of both. The innovation brought by this algorithm is the use of a lexicographic tree which requires substantially less memory than a hash tree. The support of the frequent item sets is counted by projecting the transactions onto the nodes of this tree. Also methods for parallelization of TreeProjection are suggested. 04/05/14 51

52 Tree projection - the lexicographic tree:" Possible Extension: E(A) = {B,C,D,E} A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Possible Extension: E(ABC) = {D,E} ABCD ABCE ABDE ACDE BCDE ABCDE 04/05/14 52

53 Tree projection - the lexicographic tree cont " Items are listed in lexicographic order At each node, the following information is stored: The itemset P at that node The set of possible lexicographic tree extensions E(P) of that node. Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the item set 04/05/14 53

54 Original database: TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Projected database" Projected database for node A: TID Items 1 {B} 2 {} 3 {C,D,E} 4 {D,E} 5 {B,C} 6 {B,C,D} 7 {} 8 {B,C} 9 {B,D} 10 {} For each transaction T, the projected transaction at node A is T E(A) 04/05/14 54

55 Creation of lexicographic tree" FIG. 3. Breadth-first creation of the lexicographic tree. 04/05/14 55

56 Support counting using matrices" {A, B, C, D, E, F} Null LEVEL 0 A B C D E {B, C, D { C, D, { D, E, F } E, F} E, F } {E, F} {F} {} F LEVEL 1 Transactions: AB(1) AC(2) BC(1) AD(2) BD(2) CD(3) AE(1) BE(2) CE(2) DE(3) AF(2) BF(1) CF(3) DF(3) EF(2) ACDF ABCDEF BDE CDEF Matrix for counting supports for the candidate level-2 nodes This matrix is maintained at the node Matrix counts for the 4 transactions on the right are indicated 04/05/14 56

57 Equivalence CLAss Transformation (ECLAT) algorithm " The ECLAT algorithm explores the vertical data format. The first scan of the database builds the TID_set of each single item. Starting with a single item (k = 1), the frequent (k + 1)-itemsets grown from a previous k-itemset can be generated according to the Apriori property, with a depth-first computation order (variants using hybrid-search as well). The computation is done by intersection of the TID_sets of the frequent k-itemsets to compute the TID_sets of the corresponding (k + 1)-item-sets. This process repeats, until no frequent itemsets or no candidate itemsets can be found. Besides taking advantage of the Apriori property in the generation of candidate (k+1)-itemset from frequent k-itemsets, another merit of this method is that there is no need to scan the database to find the support of (k+1)-itemsets (for k 1). This is because the TID_set of each k-itemset carries the complete information required for counting such support. 04/05/14 57

58 ECLAT algorithm cont " According to Zaki MJ (2000), the original ECLAT algorithm uses prefix-based equivalence classes and was evaluated applying bottom-up search in a breadth-first traversal manner. Zaki MJ (2000): Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12: /05/14 58

59 ECLAT algorithm cont " ECLAT store a list of transaction ids (tids) for each item, applies a vertical data layout Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E TID-list 04/05/14 59

60 ECLAT" Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A B AB traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory 04/05/14 60

61 Alternative SQL-related approaches for frequent itemset mining (I)" A work which mines the frequent itemsets with the vertical data format is (Holsheimer et al. 1995). Holsheimer M, Kersten M, Mannila H, Toivonen H (1995): A perspective on databases and data mining. In Proceeding of the 1995 international conference on knowledge discovery and data mining (KDD 95), Montreal, Canada, pp This work demonstrated that one could also explore the potential of solving data mining problems using the general purpose database management systems (dbms) with quite good results. 04/05/14 61

62 Alternative SQL-related approaches for frequent itemset mining (II)" PROjection Pattern Discovery, PROPAD, applies a frequent pattern growth approach for frequent itemset mining using a vertical data format. X. Shang, K.-U. Sattler, and I. Geist (2004): Efficient Frequent Pattern Mining in Relational Databases. 5. Workshop des GI-Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA The PROPAD method have the following merits: Avoids repeatedly scan the transaction table, only need one scan to generate transformed transaction table. Avoids complex joins between candidate itemsets tables and transaction tables, replacing by simple joins between smaller projected transaction tables and frequent itemsets tables. Avoids the cost of materializing frequent itemsets tree tables. 04/05/14 62

63 Alternative SQL-related approaches for frequent itemset mining (II) continued " PROPAD continue Idea: Instead of bottom up, in a top-down fashion extend frequent prefix by adding a single locally frequent item to it. Question: What does locally mean? Answer: To find the frequent itemsets that contain an item i, the only transactions that need to be considered are transactions that contain i. Definition: A frequent item i-related projected transaction table, denoted as PT_i, contains all frequent items (larger than i) in the transactions that contain i. Let s look at an example! 04/05/14 63

64 PROPAD example" Transactions Relational format Frequent items Filtered transactions Items co-occuring with item 1 Frequent items Projected table on item 1 Frequent items Items co-occuring with item 3 (and 1) Projected table on item 3 (and 1) 04/05/14 64

65 PROPAD frequent itemset tree " Discover all frequent itemsets by recursively filtering and projecting transactions in a depth-first-search manner until there are no frequent items in the filtered/projected table. 04/05/14 65

66 Alternative SQL-related approaches for frequent itemset mining (III)" SQL-based Frequent Pattern Mining with FP-growth implements an SQL-database version of an FP-growth-like approach. Xuequn Shang, Kai-Uwe Sattler, Ingolf Geist: SQL Based Frequent Pattern Mining with FP-Growth, Lecture Notes in Computer Science, Vol (January 2005), pp FP-tree represented using relational table Shows better performance than Apriori on large data sets or large patterns 04/05/14 66

67 Overview of frequent itemset mining" The following overview of frequent itemset mining is mainly based upon the survey: Frequent pattern mining: current status and future directions, by Jiawei Han, Hong Cheng, Dong Xin and Xifeng Yan, Data Mining and Knowledge Discovery (2007) 15: /05/14 67

68 Apriori principle, apriori algorithm and its extensions" Agrawal and Srikant (1994) observed an interesting downward closure property, called Apriori, among frequent k-itemsets: A k-itemset is frequent only if all of its sub-itemsets are frequent. This implies that frequent itemsets can be mined by first scanning the database to find the frequent 1-itemsets, then using the frequent 1-itemsets to generate candidate frequent 2-itemsets, and check against the database to obtain the frequent 2-itemsets. This process iterates until no more frequent k-itemsets can be generated for some k. This is the essence of the Apriori algorithm (Agrawal and Srikant 1994) and its alternative (Mannila et al. 1994). Agrawal R, Srikant R (1994): Fast algorithms for mining association rules. In: Proceedings of the 1994 international conference on very large data bases (VLDB 94), Santiago, Chile, pp Mannila H, Toivonen H, Verkamo AI (1994): Efficient algorithms for discovering association rules. In: Proceeding of the AAAI 94 workshop knowledge discovery in databases (KDD 94), Seattle, WA, pp /05/14 68

69 Apriori principle, apriori algorithm and its extensions (II)" Since the Apriori algorithm was proposed, there have been extensive studies on the improvements or extensions of Apriori: hashing technique (Park et al. 1995) partitioning technique (Savasere et al. 1995) sampling approach (Toivonen 1996) dynamic itemset counting (Brin et al. 1997) incremental mining (Cheung et al. 1996) parallel and distributed mining (Park et al. 1995; Agrawal and Shafer 1996; Cheung et al. 1996; Zaki et al. 1997), integrating mining with relational database systems (Sarawagi et al. 1998). Park JS, Chen MS, Yu PS (1995): An effective hash-based algorithm for mining association rules. In: Proceeding of the 1995 ACM-SIGMOD international conference on management of data (SIGMOD 95), San Jose, CA, pp Savasere A, Omiecinski E, Navathe S (1995): An efficient algorithm for mining association rules in large databases. In: Proceeding of the 1995 international conference on very large data bases (VLDB 95), Zurich, Switzerland, pp /05/14 69

70 Apriori principle, apriori algorithm and its extensions (III)" Toivonen H (1996): Sampling large databases for association rules. In: Proceeding of the 1996 international conference on very large data bases (VLDB 96), Bombay, India, pp Brin S, Motwani R, Ullman JD, Tsur S (1997): Dynamic itemset counting and implication rules for market basket analysis. In: Proceeding of the 1997 ACM-SIGMOD international conference on management of data (SIGMOD 97), Tucson, AZ, pp Cheung DW, Han J, Ng V, Wong CY (1996): Maintenance of discovered association rules in large an incremental updating technique. In: Proceeding of the 1996 international conference on data engineering (ICDE 96), New Orleans, LA, pp Park JS, Chen MS, Yu PS (1995): Efficient parallel mining for association rules. In: Proceeding of the 4th international conference on information and knowledge management, Baltimore, MD, pp Agrawal R, Shafer JC (1996): Parallel mining of association rules: design, implementation, and experience. IEEE Trans Knowl Data Eng 8: Cheung DW, Han J, Ng V, Fu A, Fu Y (1996) A fast distributed algorithm for mining association rules. In: Proceeding of the 1996 international conference on parallel and distributed information systems, Miami Beach, FL, pp Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997): Parallel algorithm for discovery of association rules. data mining knowl discov, 1: Sarawagi S, Thomas S, Agrawal R (1998) Integrating association rule mining with relational database systems: alternatives and implications. In: Proceeding of the 1998 ACM-SIGMOD international conference on management of data (SIGMOD 98), Seattle, WA, pp /05/14 70

71 Mining frequent itemsets without candidate generation (I)" In many cases, the Apriori algorithm significantly reduces the size of candidate sets using the Apriori principle. However, it can suffer from two-nontrivial costs: (1) generating a huge number of candidate sets, and (2) repeatedly scanning the database and checking the candidates by pattern matching. Han et al. (2000) J. Han et al. devised an FP-growth method that mines the complete set of frequent itemsets without candidate generation. FP-growth works in a divide-and-conquer way. The first scan of the database derives a list of frequent items in which items are ordered by frequency-descending order. According to the frequency-descending list, the database is compressed into a frequent-pattern tree, or FP-tree, which retains the itemset association information. The FP-tree is mined by starting from each frequent length-1 pattern (as an initial suffix pattern), constructing its conditional pattern base (a subdatabase, which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then constructing its conditional FP-tree, and performing mining recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree. The FP-growth algorithm transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix. It uses the least frequent items as a suffix, offering good selectivity. Performance studies demonstrate that the method substantially reduces search time. 04/05/14 71

72 Mining frequent itemsets without candidate generation (II)" There are many alternatives and extensions to the FP-growth approach: Depth-first generation of frequent itemsets by Agarwal et al. (2001), i.e. the Tree-projection algorithm previously introduced. Agarwal R, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent itemsets. J Parallel Distribut Comput 61: H-Mine, by Pei et al. (2001) which explores a hyper-structure mining of frequent patterns; Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang and Dongqing Yang, H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases, ICDM '01 Proc of the 2001 IEEE Int Conf on Data Mining, Pages Building alternative trees; exploring top-down and bottom-up traversal of such trees in pattern-growth mining by Liu et al. (2002; 2003); Liu J, Pan Y, Wang K, Han J (2002), Mining frequent item sets by opportunistic projection. Proc 2002 ACM SIGKDD Int Conf on knowledge discovery in databases (KDD 02), Edmonton, Canada, pp Liu G, Lu H, Lou W, Yu JX (2003), On computing, storing and querying frequent patterns.proc of the 2003 ACM SIGKDD Int Conf on knowledge discovery and data mining (KDD 03), Washington, DC, pp Fpgrowth*, an array-based implementation of prefix-tree-structure for efficient pattern growth mining by Grahne and Zhu (2003). Grahne G, Zhu J (2003), Efficiently using prefix-trees in mining frequent itemsets. Proc ICDM 03 international workshop on frequent itemset mining implementations (FIMI 03), Melbourne, FL, pp /05/14 72

73 Mining frequent itemsets using vertical data format" Both Apriori and FP-growth methods mine frequent patterns from a set of transactions in horizontal data format (i.e., {TID: itemset}), where TID is a transaction-id and itemset is the set of items bought in transaction TID. Alternatively, mining can also be performed with data presented in vertical data format (i.e., {item: TID_set}). Zaki (2000) proposed Equivalence CLASS Transformation (Eclat) algorithm by exploring the vertical data format (shown in earlier slides). Besides taking advantage of the Apriori property in the generation of candidate (k+1)- itemset from frequent k-itemsets, another merit of this method is that there is no need to scan the database to find the support of (k+1)-itemsets (for k 1). This is because the TID_set of each k-itemset carries the complete information required for counting such support. Another related work which mines the frequent itemsets with the vertical data format is (Holsheimer et al. 1995). This work demonstrated that, though impressive results have been achieved for some data mining problems using highly specialized and clever data structures, one could also explore the potential of solving data mining problems using the general purpose database management systems (dbms). 04/05/14 73

74 Mining closed and maximal frequent itemsets (I)" A major challenge in mining frequent patterns from a large data set is the fact that such mining often generates a huge number of patterns satisfying the min_sup threshold, especially when min_sup is set low. A large pattern will contain an exponential number of smaller, frequent sub-patterns. To overcome this problem, closed frequent pattern mining and maximal frequent pattern mining were proposed. The mining of frequent closed itemsets was proposed by Pasquier et al. (1999), where an Apriori-based algorithm called A-Close for such mining was presented. Other closed pattern mining algorithms include CLOSET (Pei et al. 2000), CHARM (Zaki and Hsiao 2002), CLOSET+ (Wang et al. 2003), FPClose (Grahne and Zhu 2003) and AFOPT (Liu et al. 2003). A more recent efficient algorithm, CFPgrowth (Schlegel et al. 2011), introduces a compact representation CFP-tree for representing closed frequent patterns. The main challenge in closed (maximal) frequent pattern mining is to check whether a pattern is closed (maximal). There are two strategies to approach this issue: (1) to keep track of the TID list of a pattern and index the pattern by hashing its TID values. This method is used by CHARM which maintains a compact TID list called a diffset; and (2) to maintain the discovered patterns in a pattern-tree similar to FP-tree. This method is exploited by CLOSET +, AFOPT and FPClose. Mining closed itemsets provides an interesting and important alternative to mining frequent itemsets since it inherits the same analytical power but generates a much smaller set of results. Better scalability and interpretability is achieved with closed itemset mining. 04/05/14 74

75 Mining closed and maximal frequent itemsets (II)" Mining max-patterns was first studied by Bayardo (1998), where MaxMiner (Bayardo 1998), an Apriori-based, level-wise, breadth-first search method was proposed to find maxitemset by performing superset frequency pruning and subset infrequency pruning for search space reduction. Another efficient method MAFIA, proposed by Burdick et al. (2001), uses vertical bitmaps to compress the transaction id list, thus improving the counting efficiency. Yang (2004) provided theoretical analysis of the (worst-case) complexity of mining maxpatterns. The complexity of enumerating maximal itemsets is shown to be NP-hard. Rameshet al. (2003) characterized the length distribution of frequent and maximal frequent itemset collections. The conditions are also characterized under which one can embed such distributions in a database. 04/05/14 75

76 Mining multilevel, multidimensional, and quantitative association rules" Since data items and transactions are (conceptually) organized in multilevel and/or multidimensional space, it is natural to extend mining frequent itemsets and their corresponding association rules to multi-level and multidimensional space. Multilevel association rules involve concepts at different levels of abstraction, whereas multidimensional association rules involve more than one dimension or predicate. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. For example, if the min_sup threshold is uniform across multi-levels, one can first mine higherlevel frequent itemsets and then mine only those itemsets whose corresponding high-level itemsets are frequent (Srikant and Agrawal 1995; Han and Fu 1995). Moreover, redundant rules can be filtered out if the lower-level rules can essentially be derived based on higher-level rules and the corresponding item distributions (Srikant and Agrawal 1995). Efficient mining can also be derived if min_sup varies at different levels (Han and Kamber 2006). Such methodology can be extended to mining multidimensional association rules when the data or transactions are located in multidimensional space, such as in a relational database or data warehouse (Kamber et al. 1997). 04/05/14 76

77 Mining multilevel, multidimensional, and quantitative association rules (II)" Our previously discussed frequent patterns and association rules are on discrete items, such as item name, product category, and location. However, one may like to find frequent patterns and associations for numerical attributes, such as salary, age, and scores. For numerical attributes, quantitative association rules can be mined, with a few alternative methods, including: Exploring the notion of partial completeness, by Srikant and Agrawal (1996); Mining binned intervals and then clustering mined quantitative association rules for concise representation as in the ARCS system, by Lent et al. (1997); Based on x-monotone and rectilinear regions by Fukuda et al. (1996) and Yoda et al. (1997), or Using distance-based (clustering) method over interval data, by Miller and Yang (1997). Mining quantitative association rules based on a statistical theory to present only those that deviate substantially from normal data was studied by Aumann and Lindell (1999). Zhang et al. (2004) considered mining statistical quantitative rules. 04/05/14 77

78 Mining high-dimensional datasets and mining colossal patterns (I)" The growth of bioinformatics has resulted in datasets with new characteristics. Microarray and mass spectrometry technologies, which are used for measuring gene expression level and cancer research respectively, typically generate only tens or hundreds of very highdimensional data (e.g., in 10, ,000 columns). Such datasets pose a great challenge for existing (closed) frequent itemset mining algorithms, since they have an exponential number of combinations of items with respect to the row length. Pan et al. (2003) proposed CARPENTER, a method for finding closed patterns in highdimensional biological datasets, which integrates the advantages of vertical data formats and pattern growth methods. By converting data into vertical data format {item: TID_set}, the TID_set can be viewed as rowset and the FP-tree so constructed can be viewed as a row enumeration tree. CARPENTER conducts a depth-first traversal of the row enumeration tree, and checks each rowset corresponding to the node visited to see whether it is frequent and closed. Pan et al. (2004) proposed COBBLER, to find frequent closed itemset by integrating row enumeration with column enumeration. Its efficiency has been demonstrated in experiments on a data set with high dimension and a relatively large number of rows. 04/05/14 78

79 Mining high-dimensional datasets and mining colossal patterns (II)" Liu et al. (2006) proposed TD-Close to find the complete set of frequent closed patterns in high dimensional data. It exploits a new search strategy, top-down mining, by starting from the maximal rowset, integrated with a novel row enumeration tree, which makes full use of the pruning power of the min_sup threshold to cut down the search space. Furthermore, an effective closeness-checking method is also developed that avoids scanning the dataset multiple times. Even with various kinds of enhancements, the above frequent, closed and maximal pattern mining algorithms still encounter challenges at mining rather large (called colossal) patterns, since the process will need to generate an explosive number of smaller frequent patterns. Colossal patterns are critical to many applications, especially in domains like bioinformatics. Zhu et al. (2007) investigated a novel mining approach, called Pattern-Fusion, to efficiently find a good approximation to colossal patterns. With Pattern-Fusion, a colossal pattern is discovered by fusing its small fragments in one step, whereas the incremental pattern-growth mining strategies, such as those adopted in Apriori and FP-growth, have to examine a large number of mid-sized ones. 04/05/14 79

80 Mining sequential patterns (I)" A sequence database consists of ordered elements or events, recorded with or without a concrete notion of time. There are many applications involving sequence data, such as customer shopping sequences, Web clickstreams, and biological sequences. Sequential pattern mining, the mining of frequently occurring ordered events or subsequences as patterns, was first introduced by Agrawal and Srikant (1995). Generalized Sequential Patterns (GSP), a representative Apriori-based sequential pattern mining algorithm, proposed by Srikant and Agrawal (1996), uses the downward-closure property of sequential patterns and adopts a multiple-pass, candidate generate-and-test approach. GSP also generalized their earlier notion in Agrawal and Srikant (1995) to include time constraints, a sliding time window, and user-defined taxonomies. Zaki (2001) developed a vertical format-based sequential pattern mining method called SPADE, which is an extension of vertical format-based frequent itemset mining methods, like Eclat and CHARM (Zaki 1998; Zaki and Hsiao 2002). 04/05/14 80

81 Mining sequential patterns (II)" In vertical data format, the database becomes a set of tuples of the form itemset : (sequence_id,event_id). The set of ID pairs for a given itemset forms the ID_list of the itemset. To discover the length-k sequence, SPADE joins the ID_lists of any two of its length- (k 1) subsequences. The length of the resulting ID_list is equal to the support of the length-k sequence. The procedure stops when no frequent sequences can be found or no sequences can be formed by such joins. The use of vertical data format reduces scans of the sequence database. The ID_lists carry the information necessary to compute the support of candidates. The basic search methodology of SPADE and GSP is breadth-first search and Apriori pruning. Both algorithms have to generate large sets of candidates in order to grow longer sequences. PrefixSpan, a pattern-growth approach to sequential pattern mining, was developed by Pei et al. (2001, 2004). PrefixSpan works in a divide-and-conquer way. The first scan of the database derives the set of length-1 sequential patterns. Each sequential pattern is treated as a prefix and the complete set of sequential patterns can be partitioned into different subsets according to different prefixes. To mine the subsets of sequential patterns, corresponding projected databases are constructed and mined recursively. 04/05/14 81

82 Mining sequential patterns (III)" A performance comparison of GSP, SPADE, and PrefixSpan shows that PrefixSpan has the best overall performance (Pei et al. 2004). SPADE, although weaker than PrefixSpan in most cases, outperforms GSP. The comparison also found that when there is a large number of frequent subsequences, all three algorithms run slowly. The problem can be partially solved by closed sequential pattern mining, where closed subsequences are those sequential patterns containing no supersequence with the same support. The CloSpan algorithm for mining closed sequential patterns was proposed by Yan et al. (2003). The method is based on a property of sequence databases, called equivalence of projected databases, stated as follows: Two projected sequence databases, S α = S β, α β, are equivalent if and only if the total number of items in S α is equal to the total number of items in S β, where S α is the projected database with respect to the prefix α. Based on this property, CloSpan can prune the non-closed sequences from further consideration during the mining process. The later BIDE algorithm, a bidirectional search for mining frequent closed sequences was developed by Wang and Han (2004), which can further optimize this process by projecting sequence datasets in two directions. 04/05/14 82

83 Mining sequential patterns (IV)" The studies of sequential pattern mining have been extended in several different ways: Mannila et al. (1997) consider frequent episodes in sequences, where episodes are essentially acyclic graphs of events whose edges specify the temporal before-and-after relationship but without timing-interval restrictions. Sequence pattern mining for plan failures was proposed in Zaki et al. (1998). Garofalakis et al. (1999) proposed the use of regular expressions as a flexible constraint specification tool that enables user-controlled focus to be incorporated into the sequential pattern mining process. Pinto et al. (2001) proposed the embedding of multi-dimensional, multilevel information into a transformed sequence database for sequential pattern mining. Pei et al. (2002) studied issues regarding constraint-based sequential pattern mining. CLUSEQ is a sequence clustering algorithm, developed by Yang and Wang (2003). An incremental sequential pattern mining algorithm, IncSpan, was proposed by Cheng et al. (2004). SeqIndex, efficient sequence indexing by frequent and discriminative analysis of sequential patterns, was studied by Cheng et al. (2005). A method for parallel mining of closed sequential patterns was proposed by Cong et al. (2005). A method, MSPX, for mining maximal sequential patterns by using multiple samples, was proposed by Luo and Chung (2005). 04/05/14 83

84 Mining sequential patterns (V)" Data mining for periodicity analysis has been an interesting theme in data mining. Özden et al. (1998) studied methods for mining periodic or cyclic association rules. Lu et al. (1998) proposed intertransaction association rules, which are implication rules whose two sides are totally ordered episodes with timing-interval restrictions (on the events in the episodes and on the two sides). Bettini et al. (1998) consider a generalization of intertransaction association rules. The notion of mining partial periodicity was first proposed by Han, Dong, and Yin, together with a max-subpattern hit set method (Han et al. 1999). Ma and Hellerstein (2001) proposed a method for mining partially periodic event patterns with unknown periods. Yang et al. (2003) studied mining asynchronous periodic patterns in time-series data. Mining partial order from unordered 0-1 data was studied by Gionis et al. (2003) and Ukkonen et al. (2005). Pei et al. (2005) proposed an algorithm for mining frequent closed partial orders from string sequences. 04/05/14 84

85 Mining structural patterns: graphs, trees and lattices" Many scientific and commercial applications need patterns that are more complicated than frequent itemsets and sequential patterns. Such sophisticated patterns go beyond sets and sequences, toward trees, lattices, and graphs. As a general data structure, graphs have become increasingly important in modeling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. Frequent substructures are the very basic patterns that can be discovered in a collection of graphs. Recent studies have developed several frequent substructure mining methods. Washio and Motoda (2003) conducted a survey on graph-based data mining. Holder et al. (1994) proposed SUBDUE to do approximate substructure pattern discovery based on minimum description length and background knowledge. Dehaspe et al. (1998) applied inductive logic programming to predict chemical carcinogenicity by mining frequent substructures. Besides these studies, there are two basic approaches to the frequent substructure mining problem: an Apriori-based approach and a pattern-growth approach. 04/05/14 85

86 Mining interesting frequent patterns" Although numerous scalable methods have been developed for mining frequent patterns and closed (maximal) patterns, such mining often generates a huge number of frequent patterns. People would like to see or use only interesting ones. What are interesting patterns and how to mine them efficiently? To answer such questions, many recent studies have contributed to mining interesting patterns or rules, including: constraint-based mining, mining incomplete or compressed patterns, and interestingness measure and correlation analysis. 04/05/14 86

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2017 Han, Kamber & Pei. All

More information

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods 6.1 Bibliographic Notes Association rule mining was first proposed by Agrawal, Imielinski, and Swami [AIS93].

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Data Mining for Knowledge Management. Association Rules

Data Mining for Knowledge Management. Association Rules 1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad

More information

Scalable Frequent Itemset Mining Methods

Scalable Frequent Itemset Mining Methods Scalable Frequent Itemset Mining Methods The Downward Closure Property of Frequent Patterns The Apriori Algorithm Extensions or Improvements of Apriori Mining Frequent Patterns by Exploring Vertical Data

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent

More information

Product presentations can be more intelligently planned

Product presentations can be more intelligently planned Association Rules Lecture /DMBI/IKI8303T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, Objectives Introduction What is Association Mining? Mining Association Rules

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/19/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB) Association rules Marco Saerens (UCL), with Christine Decaestecker (ULB) 1 Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004),

More information

Effectiveness of Freq Pat Mining

Effectiveness of Freq Pat Mining Effectiveness of Freq Pat Mining Too many patterns! A pattern a 1 a 2 a n contains 2 n -1 subpatterns Understanding many patterns is difficult or even impossible for human users Non-focused mining A manager

More information

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS ABSTRACT V. Purushothama Raju 1 and G.P. Saradhi Varma 2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P., India 2 Department

More information

SQL Based Frequent Pattern Mining with FP-growth

SQL Based Frequent Pattern Mining with FP-growth SQL Based Frequent Pattern Mining with FP-growth Shang Xuequn, Sattler Kai-Uwe, and Geist Ingolf Department of Computer Science University of Magdeburg P.O.BOX 4120, 39106 Magdeburg, Germany {shang, kus,

More information

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 5 SS Chung April 5, 2013 Data Mining: Concepts and Techniques 1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road

More information

Performance and Scalability: Apriori Implementa6on

Performance and Scalability: Apriori Implementa6on Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 13: 27/11/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Association Rules Mining:References

Association Rules Mining:References Association Rules Mining:References Zhou Shuigeng March 26, 2006 AR Mining References 1 References: Frequent-pattern Mining Methods R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 18: 01/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Roadmap. PCY Algorithm

Roadmap. PCY Algorithm 1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY

More information

A Taxonomy of Classical Frequent Item set Mining Algorithms

A Taxonomy of Classical Frequent Item set Mining Algorithms A Taxonomy of Classical Frequent Item set Mining Algorithms Bharat Gupta and Deepak Garg Abstract These instructions Frequent itemsets mining is one of the most important and crucial part in today s world

More information

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining Gyozo Gidofalvi Uppsala Database Laboratory Announcements Updated material for assignment 3 on the lab course home

More information

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai EFFICIENTLY MINING FREQUENT ITEMSETS IN TRANSACTIONAL DATABASES This article has been peer reviewed and accepted for publication in JMST but has not yet been copyediting, typesetting, pagination and proofreading

More information

Chapter 7: Frequent Itemsets and Association Rules

Chapter 7: Frequent Itemsets and Association Rules Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 VII.1&2 1 Motivational Example Assume you run an on-line

More information

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach Data Mining and Knowledge Discovery, 8, 53 87, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Mining Frequent Patterns Based on Data Characteristics

Mining Frequent Patterns Based on Data Characteristics Mining Frequent Patterns Based on Data Characteristics Lan Vu, Gita Alaghband, Senior Member, IEEE Department of Computer Science and Engineering, University of Colorado Denver, Denver, CO, USA {lan.vu,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2016 Han, Kamber & Pei. All

More information

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Guimei Liu Hongjun Lu Dept. of Computer Science The Hong Kong Univ. of Science & Technology Hong Kong, China {cslgm, luhj}@cs.ust.hk

More information

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets Azzedine Boukerche and Samer Samarah School of Information Technology & Engineering University of Ottawa, Ottawa, Canada

More information

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged. Frequent itemset Association&decision rule mining University of Szeged What frequent itemsets could be used for? Features/observations frequently co-occurring in some database can gain us useful insights

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Frequent Pattern Mining

Frequent Pattern Mining Frequent Pattern Mining...3 Frequent Pattern Mining Frequent Patterns The Apriori Algorithm The FP-growth Algorithm Sequential Pattern Mining Summary 44 / 193 Netflix Prize Frequent Pattern Mining Frequent

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Association Rules. A. Bellaachia Page: 1

Association Rules. A. Bellaachia Page: 1 Association Rules 1. Objectives... 2 2. Definitions... 2 3. Type of Association Rules... 7 4. Frequent Itemset generation... 9 5. Apriori Algorithm: Mining Single-Dimension Boolean AR 13 5.1. Join Step:...

More information

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang CARPENTER Find Closed Patterns in Long Biological Datasets Zhiyu Wang Biological Datasets Gene expression Consists of large number of genes Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases Data Mining: Concepts and Techniques Chapter 8 8.3 Mining sequence patterns in transactional databases Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques

An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques An improved approach of tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques Rana Krupali Parul Institute of Engineering and technology, Parul University, Limda,

More information

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS APPLYIG BIT-VECTOR PROJECTIO APPROACH FOR EFFICIET MIIG OF -MOST ITERESTIG FREQUET ITEMSETS Zahoor Jan, Shariq Bashir, A. Rauf Baig FAST-ational University of Computer and Emerging Sciences, Islamabad

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Chapter 6: Association Rules

Chapter 6: Association Rules Chapter 6: Association Rules Association rule mining Proposed by Agrawal et al in 1993. It is an important data mining model. Transaction data (no time-dependent) Assume all data are categorical. No good

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE 1 P.SIVA 2 D.GEETHA 1 Research Scholar, Sree Saraswathi Thyagaraja College, Pollachi. 2 Head & Assistant Professor, Department of Computer Application,

More information

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

Keywords: Mining frequent itemsets, prime-block encoding, sparse data Computing and Informatics, Vol. 32, 2013, 1079 1099 EFFICIENTLY USING PRIME-ENCODING FOR MINING FREQUENT ITEMSETS IN SPARSE DATA Karam Gouda, Mosab Hassaan Faculty of Computers & Informatics Benha University,

More information

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L Topics to be covered Market Basket Analysis, Frequent Itemsets, Closed Itemsets, and Association Rules; Frequent Pattern Mining, Efficient

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p YAFIMA: Yet Another Frequent Itemset Mining Algorithm Mohammad El-Hajj, Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton, AB, Canada {mohammad, zaiane}@cs.ualberta.ca ABSTRACT:

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Hongyan Liu 1 Jiawei Han 2 Dong Xin 2 Zheng Shao 2 1 Department of Management Science and Engineering, Tsinghua

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Apriori: Summary All items Count

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity Unil Yun and John J. Leggett Department of Computer Science Texas A&M University College Station, Texas 7783, USA

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Sequence Data Sequence Database: Timeline 10 15 20 25 30 35 Object Timestamp Events A 10 2, 3, 5 A 20 6, 1 A 23 1 B 11 4, 5, 6 B

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm? H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases Paper s goals Introduce a new data structure: H-struct J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang Int. Conf. on Data Mining

More information

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2 Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2 1: Institute of Mathematics and Informatics BAS, Sofia, Bulgaria 2: Hasselt University, Belgium 1 st Int. Conf. IMMM, 23-29.10.2011,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

D Data Mining: Concepts and and Tech Techniques

D Data Mining: Concepts and and Tech Techniques Data Mining: Concepts and Techniques (3 rd ed.) Chapter 5 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2009 Han, Kamber & Pei. All rights

More information

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN: International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-5, Issue-8 E-ISSN: 2347-2693 Comparative Study of Top Algorithms for Association Rule Mining B. Nigam *, A.

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Data Structure for Association Rule Mining: T-Trees and P-Trees

Data Structure for Association Rule Mining: T-Trees and P-Trees IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 6, JUNE 2004 1 Data Structure for Association Rule Mining: T-Trees and P-Trees Frans Coenen, Paul Leng, and Shakil Ahmed Abstract Two new

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@ccs.neu.edu November 1, 2015 Midterm Reminder Next Monday (Nov. 9), 2-hour (6-8pm) in class Closed-book exam,

More information

A Quantified Approach for large Dataset Compression in Association Mining

A Quantified Approach for large Dataset Compression in Association Mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 79-84 A Quantified Approach for large Dataset Compression in Association Mining

More information

Frequent pattern mining: current status and future directions

Frequent pattern mining: current status and future directions Data Min Knowl Disc (2007) 15:55 86 DOI 10.1007/s10618-006-0059-1 Frequent pattern mining: current status and future directions Jiawei Han Hong Cheng Dong Xin Xifeng Yan Received: 22 June 2006 / Accepted:

More information

Parallel Algorithms for Discovery of Association Rules

Parallel Algorithms for Discovery of Association Rules Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN

More information

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking Shariq Bashir National University of Computer and Emerging Sciences, FAST House, Rohtas Road,

More information

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm Annales Mathematicae et Informaticae 48 (2018) pp. 75 82 http://ami.uni-eszterhazy.hu Finding frequent closed itemsets with an extended version of the Eclat algorithm Laszlo Szathmary University of Debrecen,

More information

Mining Closed Itemsets: A Review

Mining Closed Itemsets: A Review Mining Closed Itemsets: A Review 1, 2 *1 Department of Computer Science, Faculty of Informatics Mahasarakham University,Mahasaraham, 44150, Thailand panida.s@msu.ac.th 2 National Centre of Excellence in

More information

TEMPORAL SEQUENTIAL PATTERN IN DATA MINING TASKS

TEMPORAL SEQUENTIAL PATTERN IN DATA MINING TASKS TEMPORAL SEQUENTIAL PATTERN Abstract IN DATA MINING TASKS DR. NAVEETA MEHTA Asst. Prof., MMICT&BM, M.M. University, Mullana navita80@gmail.com MS. SHILPA DANG Lecturer, MMICT&BM, M.M. University, Mullana

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Advance Association Analysis

Advance Association Analysis Advance Association Analysis 1 Minimum Support Threshold 3 Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set 4 Effect of Support

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Generating assoc. rules from frequent itemsets Assume that we have discovered the frequent itemsets and their support How do we generate association rules? Frequent itemsets: {1}

More information

A Modern Search Technique for Frequent Itemset using FP Tree

A Modern Search Technique for Frequent Itemset using FP Tree A Modern Search Technique for Frequent Itemset using FP Tree Megha Garg Research Scholar, Department of Computer Science & Engineering J.C.D.I.T.M, Sirsa, Haryana, India Krishan Kumar Department of Computer

More information

On Frequent Itemset Mining With Closure

On Frequent Itemset Mining With Closure On Frequent Itemset Mining With Closure Mohammad El-Hajj Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton AB, Canada T6G 2E8 Tel: 1-780-492 2860 Fax: 1-780-492 1071 {mohammad,

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

SeqIndex: Indexing Sequences by Sequential Pattern Analysis SeqIndex: Indexing Sequences by Sequential Pattern Analysis Hong Cheng Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {hcheng3, xyan, hanj}@cs.uiuc.edu

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction

More information

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint William Cheung and Osmar R. Zaïane University of Alberta, Edmonton, Canada {wcheung, zaiane}@cs.ualberta.ca Abstract

More information

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree Global Journal of Computer Science and Technology Software & Data Engineering Volume 13 Issue 2 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES 1 SALLAM OSMAN FAGEERI 2 ROHIZA AHMAD, 3 BAHARUM B. BAHARUDIN 1, 2, 3 Department of Computer and Information Sciences Universiti Teknologi

More information