Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises

Size: px

Start display at page:

Download "Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises"

Sheryl Walters
6 years ago
Views:

1 verview Frequent Pattern Mining comprises Frequent Pattern Mining hristian Borgelt School of omputer Science University of Konstanz Universitätsstraße, Konstanz, Germany Frequent Item Set Mining an Association Rule Inuction Frequent Sequence Mining Frequent Tree Mining Frequent Graph Mining Application Areas of Frequent Pattern Mining inclue Market Basket Analysis lick Stream Analysis Web Link Analysis Genome Analysis Drug Design (Molecular Fragment Mining) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Frequent Item Set Mining: Motivation Frequent Item Set Mining is a metho for market basket analysis. It aims at fining regularities in the shopping behavior of customers of supermarkets, mail-orer companies, on-line shops etc. Frequent Item Set Mining More specifically: Fin sets of proucts that are frequently bought together. Possible applications of foun frequent item sets: Improve arrangement of proucts in shelves, on a catalog s pages etc. Support cross-selling (suggestion of other proucts), prouct bunling. Frau etection, technical epenence analysis etc. ften foun patterns are expresse as association rules, for example: If a customer buys brea an wine, then she/he will probably also buy cheese. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

2 Frequent Item Set Mining: Basic otions Frequent Item Set Mining: Basic otions Let B = {i,..., i m } be a set of items. This set is calle the item base. Items may be proucts, special equipment items, service options etc. Any subset I B is calle an item set. An item set may be any set of proucts that can be bought (together). Let T = (t,..., t n ) with k, k n : t k B be a tuple of transactions over B. This tuple is calle the transaction atabase. A transaction atabase can list, for example, the sets of proucts bought by the customers of a supermarket in a given perio of time. Every transaction is an item set, but some item sets may not appear in T. Transactions nee not be pairwise ifferent: it may be t j = t k for j k. T may also be efine as a bag or multiset of transactions. The item base B may not be given explicitly, but only implicitly as B = n k= t k. Let I B be an item set an T a transaction atabase over B. A transaction t T covers the item set I or the item set I is containe in a transaction t T iff I t. The set K T (I) = {k {,..., n} I t k } is calle the cover of I w.r.t. T. The cover of an item set is the inex set of the transactions that cover it. It may also be efine as a tuple of all transactions that cover it (which, however, is complicate to write in a formally correct way). The value s T (I) = K T (I) is calle the (absolute) support of I w.r.t. T. The value σ T (I) = n K T (I) is calle the relative support of I w.r.t. T. The support of I is the number or fraction of transactions that contain it. Sometimes σ T (I) is also calle the (relative) frequency of I w.r.t. T. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Frequent Item Set Mining: Basic otions Frequent Item Set Mining: Formal Definition Alternative Definition of Transactions A transaction over an item base B is a pair t = (ti, J), where ti is a unique transaction ientifier an J B is an item set. A transaction atabase T = {t,..., t n } is a set of transactions. A simple set can be use, because transactions iffer at least in their ientifier. A transaction t = (ti, J) covers an item set I iff I J. The set K T (I) = {ti J B : t T : t = (ti, J) I J} is the cover of I w.r.t. T. Remark: If the transaction atabase is efine as a tuple, there is an implicit transaction ientifier, namely the position/inex of the transaction in the tuple. Given: a set B = {i,..., i m } of items, the item base, a tuple T = (t,..., t n ) of transactions over B, the transaction atabase, a number s min I, < s min n, a number σ min IR, < σ min, Desire: the set of frequent item sets, that is, or (equivalently) the minimum support. the set F T (s min ) = {I B s T (I) s min } or (equivalently) the set Φ T (σ min ) = {I B σ T (I) σ min }. ote that with the relations s min = nσ min an σ min = n s min the two versions can easily be transforme into each other. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

3 Frequent Item Sets: Example transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} frequent item sets items item items items : {a}: {a, c}: {a, c, }: {b}: {a, }: {a, c, e}: {c}: {a, e}: {a,, e}: {}: {b, c}: {e}: {c, }: {c, e}: {, e}: Searching for Frequent Item Sets In this example, the minimum support is s min = or σ min =. = %. There are = possible item sets over B = {a, b, c,, e}. There are frequent item sets (but only transactions). hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Properties of the Support of Item Sets Properties of the Support of Item Sets A brute force approach that traverses all possible item sets, etermines their support, an iscars infrequent item sets is usually infeasible: The number of possible item sets grows exponentially with the number of items. A typical supermarket offers (tens of) thousans of ifferent proucts. Iea: onsier the properties of an item set s cover an support, in particular: I : J I : K T (J) K T (I). This property hols, since t : I : J I : J t I t. Each aitional item is another conition that a transaction has to satisfy. Transactions that o not satisfy this conition are remove from the cover. It follows: I : J I : s T (J) s T (I). That is: If an item set is extene, its support cannot increase. ne also says that support is anti-monotone or ownwar close. From I : J I : s T (J) s T (I) it follows immeiately s min : I : J I : s T (I) < s min s T (J) < s min. That is: o superset of an infrequent item set can be frequent. This property is often referre to as the Apriori Property. Rationale: Sometimes we can know a priori, that is, before checking its support by accessing the given transaction atabase, that an item set cannot be frequent. f course, the contraposition of this implication also hols: s min : J : I J : s T (J) s min s T (I) s min. That is: All subsets of a frequent item set are frequent. This suggests a compresse representation of the set of frequent item sets (which will be explore later: maximal an close frequent item sets). hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

4 Reminer: Partially rere Sets Properties of the Support of Item Sets A partial orer is a binary relation over a set S which satisfies a, b, c S: a a a b b a a = b a b b c a c (reflexivity) (anti-symmetry) (transitivity) A set with a partial orer is calle a partially orere set (or poset for short). Let a an b be two istinct elements of a partially orere set (S, ). if a b or b a, then a an b are calle comparable. if neither a b nor b a, then a an b are calle incomparable. If all pairs of elements of the unerlying set S are comparable, the orer is calle a total orer or a linear orer. In a total orer the reflexivity axiom is replace by the stronger axiom: a b b a (totality) Monotonicity in alculus an Mathematical Analysis A function f : IR IR is calle monotonically non-ecreasing if x, y : x y f(x) f(y). A function f : IR IR is calle monotonically non-increasing if x, y : x y f(x) f(y). Monotonicity in rer Theory rer theory is concerne with arbitrary partially orere sets. The terms increasing an ecreasing are avoie, because they lose their pictorial motivation as soon as sets are consiere that are not totally orere. A function f : S R, where S an R are two partially orere sets, is calle monotone or orer-preserving if x, y S : x S y f(x) R f(y). A function f : S R is calle anti-monotone or orer-reversing if x, y S : x S y f(x) R f(y). In this sense the support of item sets is anti-monotone. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Properties of Frequent Item Sets Reminer: Partially rere Sets an Hasse Diagrams A subset R of a partially orere set (S, ) is calle ownwar close if for any element of the set all smaller elements are also in it: x R: y S : y x y R In this case the subset R is also calle a lower set. The notions of upwar close an upper set are efine analogously. For every s min the set of frequent item sets F T (s min ) is ownwar close w.r.t. the partially orere set ( B, ), where B enotes the powerset of B: s min : X F T (s min ): Y B : Y X Y F T (s min ). Since the set of frequent item sets is inuce by the support function, the notions of up- or ownwar close are transferre to the support function: Any set of item sets inuce by a support threshol s min is up- or ownwar close. F T (s min ) = {S B s T (S) s min } ( frequent item sets) is ownwar close, G T (s min ) = {S B s T (S) < s min } (infrequent item sets) is upwar close. A finite partially orere set (S, ) can be epicte as a (irecte) acyclic graph G, which is calle Hasse iagram. G has the elements of S as vertices. The eges are selecte accoring to: If x an y are elements of S with x < y (that is, x y an not x = y) an there is no element between x an y (that is, no z S with x < z < y), then there is an ege from x to y. Since the graph is acyclic (there is no irecte cycle), the graph can always be epicte such that all eges lea ownwar. The Hasse iagram of a total orer (or linear orer) is a chain. a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce Hasse iagram of ( {a,b,c,,e}, ). (Ege irections are omitte; all eges lea ownwar.) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

5 Searching for Frequent Item Sets Searching for Frequent Item Sets The stanar search proceure is an enumeration approach, that enumerates caniate item sets an checks their support. It improves over the brute force approach by exploiting the apriori property to skip item sets that cannot be frequent because they have an infrequent subset. The search space is the partially orere set ( B, ). The structure of the partially orere set ( B, ) helps to ientify those item sets that can be skippe ue to the apriori property. top-own search (from empty set/one-element sets to larger sets) Since a partially orere set can conveniently be epicte by a Hasse iagram, we will use such iagrams to illustrate the search. ote that the search may have to visit an exponential number of item sets. In practice, however, the search times are often bearable, at least if the minimum support is not chosen too low. Iea: Use the properties of the support to organize the search for all frequent item sets, especially the apriori property: I : J I : s T (I) < s min s T (J) < s min. Since these properties relate the support of an item set to the support of its subsets an supersets, it is reasonable to organize the search base on the structure of the partially orere set ( B, ). Hasse iagram for five items {a, b, c,, e} = B: a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce ( B, ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Hasse Diagrams an Frequent Item Sets transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} Hasse iagram with frequent item sets (s min = ): a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce The Apriori Algorithm [Agrawal an Srikant 99] Blue boxes are frequent item sets, white boxes infrequent item sets. abc abce abe ace bce abce hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

6 Searching for Frequent Item Sets The Apriori Algorithm Possible scheme for the search: Determine the support of the one-element item sets (a.k.a. singletons) an iscar the infrequent items / item sets. Form caniate item sets with two items (both items must be frequent), etermine their support, an iscar the infrequent item sets. Form caniate item sets with three items (all containe pairs must be frequent), etermine their support, an iscar the infrequent item sets. ontinue by forming caniate item sets with four, five etc. items until no caniate item set is frequent. This is the general scheme of the Apriori Algorithm. It is base on two main steps: caniate generation an pruning. All enumeration algorithms are base on these two steps in some form. function apriori (B, T, s min ) begin ( Apriori algorithm ) k := ; ( initialize the item set size ) E k := i B{{i}}; ( start with single element sets ) F k := prune(e k, T, s min ); ( an etermine the frequent ones ) while F k o begin ( while there are frequent item sets ) E k+ := caniates(f k ); ( create caniates with one item more ) F k+ := prune(e k+, T, s min ); ( an etermine the frequent item sets ) k := k + ; ( increment the item counter ) en; return k j= F j ; ( return the frequent item sets ) en ( apriori ) E j : caniate item sets of size j, F j : frequent item sets of size j. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining The Apriori Algorithm The Apriori Algorithm function caniates (F k ) begin ( generate caniates with k + items ) E := ; ( initialize the set of caniates ) forall f, f F k ( traverse all pairs of frequent item sets ) with f = {i,..., i k, i k } ( that iffer only in one item an ) an f = {i,..., i k, i k } ( are in a lexicographic orer ) an i k < i k o begin ( (this orer is arbitrary, but fixe) ) f := f f = {i,..., i k, i k, i k }; ( union has k + items ) if i f : f {i} F k ( if all subsets with k items are frequent, ) then E := E {f}; ( a the new item set to the caniates ) en; ( (otherwise it cannot be frequent) ) return E; ( return the generate caniates ) en ( caniates ) function prune (E, T, s min ) begin ( prune infrequent caniates ) forall e E o ( initialize the support counters ) s T (e) := ; ( of all caniates to be checke ) forall t T o ( traverse the transactions ) forall e E o ( traverse the caniates ) if e t ( if the transaction contains the caniate, ) then s T (e) := s T (e) + ; ( increment the support counter ) F := ; ( initialize the set of frequent caniates ) forall e E o ( traverse the caniates ) if s T (e) s min ( if a caniate is frequent, ) then F := F {e}; ( a it to the set of frequent item sets ) return F ; ( return the prune set of caniates ) en ( prune ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

7 Searching for Frequent Item Sets The Apriori algorithm searches the partial orer top-own level by level. Improving the aniate Generation ollecting the frequent item sets of size k in a set F k has rawbacks: A frequent item set of size k + can be forme in j = k(k + ) possible ways. (For infrequent item sets the number may be smaller.) As a consequence, the caniate generation step may carry out a lot of reunant work, since it suffices to generate each caniate item set once. Question: an we reuce or even eliminate this reunant work? More generally: How can we make sure that any caniate item set is generate at most once? Iea: Assign to each item set a unique parent item set, from which this item set is to be generate. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Searching for Frequent Item Sets Searching for Frequent Item Sets A core problem is that an item set of size k (that is, with k items) can be generate in k! ifferent ways (on k! paths in the Hasse iagram), because in principle the items may be ae in any orer. If we consier an item by item process of builing an item set (which can be imagine as a levelwise traversal of the partial orer), there are k possible ways of forming an item set of size k from item sets of size k by aing the remaining item. We have to search the partially orere set ( B, ) or its Hasse iagram. Assigning unique parents turns the Hasse iagram into a tree. Traversing the resulting tree explores each item set exactly once. Hasse iagram an a possible tree for five items: It is obvious that it suffices to consier each item set at most once in orer to fin the frequent ones (infrequent item sets nee not be generate at all). Question: an we reuce or even eliminate this variety? More generally: How can we make sure that any caniate item set is generate at most once? Iea: Assign to each item set a unique parent item set, from which this item set is to be generate. a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

8 Searching with Unique Parents Assigning Unique Parents Principle of a Search Algorithm base on Unique Parents: Base Loop: Traverse all one-element item sets (their unique parent is the empty set). Recursively process all one-element item sets that are frequent. Recursive Processing: For a given frequent item set I: Generate all extensions J of I by one item (that is, J I, J = I + ) for which the item set I is the chosen unique parent. For all J: if J is frequent, process J recursively, otherwise iscar J. Questions: How can we formally assign unique parents? How can we make sure that we generate only those extensions for which the item set that is extene is the chosen unique parent? Formally, the set of all possible/caniate parents of an item set I is Π(I) = {J I K : J K I}. In other wors, the possible parents of I are its maximal proper subsets. In orer to single out one element of Π(I), the canonical parent π c (I), we can simply efine an (arbitrary, but fixe) global orer of the items: i < i < i < < i n. Then the canonical parent of an item set I can be efine as the item set π c (I) = I {max i I i} (or π c(i) = I {min i I i}), where the maximum (or minimum) is taken w.r.t. the chosen orer of the items. Even though this approach is straightforwar an simple, we reformulate it now in terms of a canonical form of an item set, in orer to lay the founations for the stuy of frequent (sub)graph mining. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining anonical Forms The meaning of the wor canonical : (source: xfor Avance Learner s Dictionary Encyclopeic Eition) canon / kæn e n/ n general rule, stanar or principle, by which sth is juge: This film offens against all the canons of goo taste.... anonical Forms of Item Sets canonical /k e n a nikl/ aj... stanar; accepte.... A canonical form of something is a stanar representation of it. The canonical form must be unique (otherwise it coul not be stanar). evertheless there are often several possible choices for a canonical form. However, one must fix one of them for a given application. In the following we will efine a stanar representation of an item set, an later stanar representations of a graph, a sequence, a tree etc. This canonical form will be use to assign unique parents to all item sets. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

9 A anonical Form for Item Sets anonical Forms an anonical Parents An item set is represente by a coe wor; each letter represents an item. The coe wor is a wor over the alphabet B, the item base. There are k! possible coe wors for an item set of size k, because the items may be liste in any orer. By introucing an (arbitrary, but fixe) orer of the items, an by comparing coe wors lexicographically w.r.t. this orer, we can efine an orer on these coe wors. Example: abc < bac < bca < cab etc. for the item set {a, b, c} an a < b < c. The lexicographically smallest (or, alternatively, greatest) coe wor for an item set is efine to be its canonical coe wor. bviously the canonical coe wor lists the items in the chosen, fixe orer. Remark: These explanations may appear obfuscate, since the core iea an the result are very simple. However, the view evelope here will help us a lot when we turn to frequent (sub)graph mining. hristian Borgelt Frequent Pattern Mining Let I be an item set an w c (I) its canonical coe wor. The canonical parent π c (I) of the item set I is the item set escribe by the longest proper prefix of the coe wor w c (I). Since the canonical coe wor of an item set lists its items in the chosen orer, this efinition is equivalent to π c (I) = I {max i I i}. General Recursive Processing with anonical Forms: For a given frequent item set I: Generate all possible extensions J of I by one item (J I, J = I + ). Form the canonical coe wor w c (J) of each extene item set J. For each J: if the last letter of w c (J) is the item ae to I to form J an J is frequent, process J recursively, otherwise iscar J. hristian Borgelt Frequent Pattern Mining The Prefix Property Searching with the Prefix Property ote that the consiere item set coing scheme has the prefix property: The longest proper prefix of the canonical coe wor of any item set is a canonical coe wor itself. With the longest proper prefix of the canonical coe wor of an item set I we not only know the canonical parent of I, but also its canonical coe wor. Example: onsier the item set I = {a, b,, e}: The canonical coe wor of I is abe. The longest proper prefix of abe is ab. The coe wor ab is the canonical coe wor of π c (I) = {a, b, }. ote that the prefix property immeiately implies: Every prefix of a canonical coe wor is a canonical coe wor itself. (In the following both statements are calle the prefix property, since they are obviously equivalent.) The prefix property allows us to simplify the search scheme: The general recursive processing scheme with canonical forms requires to construct the canonical coe wor of each create item set in orer to ecie whether it has to be processe recursively or not. We know the canonical coe wor of every item set that is processe recursively. With this coe wor we know, ue to the prefix property, the canonical coe wors of all chil item sets that have to be explore in the recursion with the exception of the last letter (that is, the ae item). We only have to check whether the coe wor that results from appening the ae item to the given canonical coe wor is canonical or not. Avantage: hecking whether a given coe wor is canonical can be simpler/faster than constructing a canonical coe wor from scratch. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

10 Searching with the Prefix Property Searching with the Prefix Property: Examples Principle of a Search Algorithm base on the Prefix Property: Base Loop: Traverse all possible items, that is, the canonical coe wors of all one-element item sets. Recursively process each coe wor that escribes a frequent item set. Recursive Processing: For a given (canonical) coe wor of a frequent item set: Generate all possible extensions by one item. This is one by simply appening the item to the coe wor. heck whether the extene coe wor is the canonical coe wor of the item set that is escribe by the extene coe wor (an, of course, whether the escribe item set is frequent). If it is, process the extene coe wor recursively, otherwise iscar it. Suppose the item base is B = {a, b, c,, e} an let us assume that we simply use the alphabetical orer to efine a canonical form (as before). onsier the recursive processing of the coe wor ac (this coe wor is canonical, because its letters are in alphabetical orer): Since ac contains neither b nor e, its extensions are acb an ace. The coe wor acb is not canonical an thus it is iscare (because > b note that it suffices to compare the last two letters) The coe wor ace is canonical an therefore it is processe recursively. onsier the recursive processing of the coe wor bc: The extene coe wors are bca, bc an bce. bca is not canonical an thus iscare. bc an bce are canonical an therefore processe recursively. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Searching with the Prefix Property Searching with anonical Forms Exhaustive Search The prefix property is a necessary conition for ensuring that all canonical coe wors can be constructe in the search by appening extensions (items) to visite canonical coe wors. Suppose the prefix property woul not hol. Then: There exist a canonical coe wor w an a (proper) prefix v of w, such that v is not a canonical coe wor. Forming w by repeately appening items must form v first (otherwise the prefix woul iffer). When v is constructe in the search, it is iscare, because it is not canonical. As a consequence, the canonical coe wor w can never be reache. The simplifie search scheme can be exhaustive only if the prefix property hols. Straightforwar Improvement of the Extension Step: The consiere canonical form lists the items in the chosen item orer. If the ae item succees all alreay present items in the chosen orer, the result is in canonical form. If the ae item precees any of the alreay present items in the chosen orer, the result is not in canonical form. As a consequence, we have a very simple canonical extension rule (that is, a rule that generates all chilren an only canonical coe wors). Applie to the Apriori algorithm, this means that we generate caniates of size k + by combining two frequent item sets f = {i,..., i k, i k } an f = {i,..., i k, i k } only if i k < i k an j, j < k : i j < i j+. ote that it suffices to compare the last letters/items i k an i k if all frequent item sets are represente by canonical coe wors. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

11 Searching with anonical Forms anonical Parents an Prefix Trees Final Search Algorithm base on anonical Forms: Base Loop: Traverse all possible items, that is, the canonical coe wors of all one-element item sets. Recursively process each coe wor that escribes a frequent item set. Item sets, whose canonical coe wors share the same longest proper prefix are siblings, because they have (by efinition) the same canonical parent. This allows us to represent the canonical parent tree as a prefix tree or trie. anonical parent tree/prefix tree an prefix tree with merge siblings for five items: Recursive Processing: For a given (canonical) coe wor of a frequent item set: Generate all possible extensions by a single item, where this item succees the last letter (item) of the given coe wor. This is one by simply appening the item to the coe wor. If the item set escribe by the resulting extene coe wor is frequent, process the coe wor recursively, otherwise iscar it. This search scheme generates each caniate item set at most once. a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce a b c e a b c b c e c e e e b c c c e e e e e e c e e e e e hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining anonical Parents an Prefix Trees Search Tree Pruning a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce A (full) prefix tree for the five items a, b, c,, e. Base on a global orer of the items (which can be arbitrary). The item sets counte in a noe consist of all items labeling the eges to the noe (common prefix) an one item following the last ege label in the item orer. In applications the search tree tens to get very large, so pruning is neee. Structural Pruning: Extensions base on canonical coe wors remove superfluous paths. Explains the unbalance structure of the full prefix tree. Support Base Pruning: o superset of an infrequent item set can be frequent. (apriori property) o counters for item sets having an infrequent subset are neee. Size Base Pruning: Prune the tree if a certain epth (a certain size of the item sets) is reache. Iea: Sets with too many items can be ifficult to interpret. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

12 The rer of the Items The rer of the Items The structure of the (structurally prune) prefix tree obviously epens on the chosen orer of the items. In principle, the orer is arbitrary (that is, any orer can be use). However, the number an the size of the noes that are visite in the search iffers consierably epening on the orer. As a consequence, the execution times of frequent item set mining algorithms can iffer consierably epening on the item orer. Which orer of the items is best (leas to the fastest search) can epen on the frequent item set mining algorithm use. Avance methos even aapt the orer of the items uring the search (that is, use ifferent, but compatible orers in ifferent branches). Heuristics for choosing an item orer are usually base on (conitional) inepenence assumptions. Heuristics for hoosing the Item rer Basic Iea: inepenence assumption It is plausible that frequent item sets consist of frequent items. Sort the items w.r.t. their support (frequency of occurrence). Sort esceningly: Prefix tree has fewer, but larger noes. Sort asceningly: Extension of this Iea: Prefix tree has more, but smaller noes. Sort items w.r.t. the sum of the sizes of the transactions that cover them. Iea: the sum of transaction sizes also captures implicitly the frequency of pairs, triplets etc. (though, of course, only to some egree). Empirical evience: better performance than simple frequency sorting. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Searching the Prefix Tree a b c e a b c b c e c e e e b c c c e e e e e e c e e e e e a b c e a b c b c e c e e e b c c c e e e e e e c e e e e e Searching the Prefix Tree Levelwise (Apriori Algorithm Revisite) Apriori Eclat Breath-first/levelwise search (item sets of same size). Subset tests on transactions to fin the support of item sets. Depth-first search (item sets with same prefix). Intersection of transaction lists to fin the support of item sets. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

13 Apriori: Basic Ieas Apriori: Levelwise Search The item sets are checke in the orer of increasing size (breath-first/levelwise traversal of the prefix tree). The canonical form of item sets an the inuce prefix tree are use to ensure that each caniate item set is generate at most once. The alreay generate levels are use to execute a priori pruning of the caniate item sets (using the apriori property). (a priori: before accessing the transaction atabase to etermine the support) : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : Transactions are represente as simple arrays of items (so-calle horizontal transaction representation, see also below). The support of a caniate item set is compute by checking whether they are subsets of a transaction or by generating subsets of a transaction an fining them among the caniates. Example transaction atabase with items an transactions. Minimum support: %, that is, at least transactions must contain the item set. All sets with one item (singletons) are frequent full secon level is neee. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Apriori: Levelwise Search Apriori: Levelwise Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : Determining the support of item sets: For each item set traverse the atabase an count the transactions that contain it (highly inefficient). Better: Traverse the tree for each transaction an fin the item sets it contains (efficient: can be implemente as a simple (oubly) recursive proceure). Minimum support: %, that is, at least transactions must contain the item set. Infrequent item sets: {a, b}, {b, }, {b, e}. The subtrees starting at these item sets can be prune. (a posteriori: after accessing the transaction atabase to etermine the support) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

14 Apriori: Levelwise Search Apriori: Levelwise Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c :? e :? e :? :? e :? e :? : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c :? e :? e :? :? e :? e :? Generate caniate item sets with items (parents must be frequent). Before counting, check whether the caniates contain an infrequent item set. An item set with k items has k subsets of size k. The parent item set is only one of these subsets. The item sets {b, c, } an {b, c, e} can be prune, because {b, c, } contains the infrequent item set {b, } an {b, c, e} contains the infrequent item set {b, e}. a priori: before accessing the transaction atabase to etermine the support hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Apriori: Levelwise Search Apriori: Levelwise Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : nly the remaining four item sets of size are evaluate. o other item sets of size can be frequent. The transaction atabase is accesse to etermine the support. Minimum support: %, that is, at least transactions must contain the item set. The infrequent item set {c,, e} is prune. (a posteriori: after accessing the transaction atabase to etermine the support) Blue: a priori pruning, Re: a posteriori pruning. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

15 Apriori: Levelwise Search Apriori: Levelwise Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : e :? : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : e :? Generate caniate item sets with items (parents must be frequent). Before counting, check whether the caniates contain an infrequent item set. (a priori pruning) The item set {a, c,, e} can be prune, because it contains the infrequent item set {c,, e}. onsequence: o caniate item sets with four items. Fourth access to the transaction atabase is not necessary. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Apriori: oe rganization Apriori: oe rganization Iea: ptimize the organization of the counters an the chil pointers. Direct Inexing: Each noe is a simple array of counters. An item is use as a irect inex to fin the counter. Avantage: ounter access is extremely fast. Disavantage: Memory usage can be high ue to gaps in the inex space. Sorte Vectors: Each noe is a (sorte) array of item/counter pairs. A binary search is necessary to fin the counter for an item. Avantage: Memory usage may be smaller, no unnecessary counters. Disavantage: ounter access is slower ue to the binary search. Hash Tables: Each noe is a array of item/counter pairs (close hashing). The inex of a counter is compute from the item coe. Avantage: Faster counter access than with binary search. Disavantage: Higher memory usage than sorte arrays (pairs, fill rate). The orer of the items cannot be exploite. hil Pointers: The eepest level of the item set tree oes not nee chil pointers. Fewer chil pointers than counters are neee. It pays to represent the chil pointers in a separate array. The sorte array of item/counter pairs can be reuse for a binary search. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

16 Apriori: Item oing Apriori: Recursive ounting Items are coe as consecutive integers starting with (neee for the irect inexing approach). The size an the number of the gaps in the inex space epen on how the items are coe. Iea: It is plausible that frequent item sets consist of frequent items. Sort the items w.r.t. their frequency (group frequent items). Sort esceningly: prefix tree has fewer noes. Sort asceningly: there are fewer an smaller inex gaps. Empirical evience: sorting asceningly is better. Extension: Sort items w.r.t. the sum of the sizes of the transactions that cover them. Empirical evience: better than simple item frequencies. The items in a transaction are sorte (ascening item coes). Processing a transaction is a (oubly) recursive proceure. To process a transaction for a noe of the item set tree: Go to the chil corresponing to the first item in the transaction an count the suffix of the transaction recursively for that chil. (In the currently eepest level of the tree we increment the counter corresponing to the item instea of going to the chil noe.) Discar the first item of the transaction an process the remaining suffix recursively for the noe itself. ptimizations: Directly skip all items preceing the first item in the noe. Abort the recursion if the first item is beyon the last one in the noe. Abort the recursion if a transaction is too short to reach the eepest level. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Apriori: Recursive ounting Apriori: Recursive ounting transaction to count: {a, c,, e} current item set size: a c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : processing: a processing: c processing: e e c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c e : e : e : :? e :? e : processing: a processing: c c e c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : processing: a processing: e c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

17 Apriori: Recursive ounting Apriori: Recursive ounting processing: a processing: processing: e c e a : b : c : : e : e a b c b : c : : e : c : : e : : e : e : c e c : e : e : :? e :? e : processing: c c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : processing: a processing: e (skippe: too few items) e c e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : processing: c processing: e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c e : e : e : :? e :? e : hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Apriori: Recursive ounting Apriori: Recursive ounting processing: c processing: processing: e e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c e : e : e : :? e :? e : e processing: (skippe: too few items) e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c : e : e : :? e :? e : processing: c processing: e (skippe: too few items) e a : b : c : : e : a b c b : c : : e : c : : e : : e : e : c c e : e : e : :? e :? e : Processing a transaction (suffix) in a noe is easily implemente as a simple loop. For each item the remaining suffix is processe in the corresponing chil. If the (currently) eepest tree level is reache, counters are incremente for each item in the transaction (suffix). If the remaining transaction (suffix) is too short to reach the (currently) eepest level, the recursion is terminate. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

18 Apriori: Transaction Representation Apriori: Transactions as a Prefix Tree Direct Representation: Each transaction is represente as an array of items. The transactions are store in a simple list or array. rganization as a Prefix Tree: The items in each transaction are sorte (arbitrary, but fixe orer). Transactions with the same prefix are groupe together. Avantage: a common prefix is processe only once in the support counting. Gains from this organization epen on how the items are coe: ommon transaction prefixes are more likely if the items are sorte with escening frequency. However: an ascening orer is better for the search an this ominates the execution time (empirical evience). transaction atabase a,, e b, c, a, c, e a, c,, e a, e a, c, b, c a, c,, e b, c, e a,, e lexicographically sorte a, c, a, c,, e a, c,, e a, c, e a,, e a,, e a, e b, c b, c, b, c, e prefix tree representation a : b : c : : e : c : : e : e : : e : Items in transactions are sorte w.r.t. some arbitrary orer, transactions are sorte lexicographically, then a prefix tree is constructe. Avantage: ientical transaction prefixes are processe only once. e : hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Summary Apriori Basic Processing Scheme Breath-first/levelwise traversal of the partially orere set ( B, ). aniates are forme by merging item sets that iffer in only one item. Support counting can be one with a (oubly) recursive proceure. Avantages Perfect pruning of infrequent caniate item sets (with infrequent subsets). Searching the Prefix Tree Depth-First (Eclat, FP-growth an other algorithms) Disavantages an require a lot of memory (since all frequent item sets are represente). Support counting takes very long for large transactions. Software hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

19 Depth-First Search an onitional Databases Depth-First Search an onitional Databases A epth-first search can also be seen as a ivie-an-conquer scheme: First fin all frequent item sets that contain a chosen item, then all frequent item sets that o not contain it. General search proceure: Let the item orer be a < b < c <. Restrict the transaction atabase to those transactions that contain a. This is the conitional atabase for the prefix a. Recursively search this conitional atabase for frequent item sets an a the prefix a to all frequent item sets foun in the recursion. Remove the item a from the transactions in the full transaction atabase. This is the conitional atabase for item sets without a. Recursively search this conitional atabase for frequent item sets. With this scheme only frequent one-element item sets have to be etermine. Larger item sets result from aing possible prefixes. a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce split into subproblems w.r.t. item a blue : item set containing only item a. green: item sets containing item a (an at least one other item). re : item sets not containing item a (but at least one other item). green: nees con. atabase with transactions containing item a. re : nees con. atabase with all transactions, but with item a remove. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Depth-First Search an onitional Databases Depth-First Search an onitional Databases a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce split into subproblems w.r.t. item b a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce split into subproblems w.r.t. item b blue : item sets {a} an {a, b}. green: item sets containing both items a an b (an at least one other item). re : item sets containing item a (an at least one other item), but not item b. green: nees atabase with trans. containing both items a an b. re : nees atabase with trans. containing item a, but with item b remove. blue : item set containing only item b. green: item sets containing item b (an at least one other item), but not item a. re : item sets containing neither item a nor b (but at least one other item). green: nees atabase with trans. containing item b, but with item a remove. re : nees atabase with all trans., but with both items a an b remove. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

20 Formal Description of the Divie-an-onquer Scheme Formal Description of the Divie-an-onquer Scheme Generally, a ivie-an-conquer scheme can be escribe as a set of (sub)problems. The initial (sub)problem is the actual problem to solve. A subproblem is processe by splitting it into smaller subproblems, which are then processe recursively. All subproblems that occur in frequent item set mining can be efine by a conitional transaction atabase an a prefix (of items). The prefix is a set of items that has to be ae to all frequent item sets that are iscovere in the conitional transaction atabase. Formally, all subproblems are tuples S = (T, P ), where T is a conitional transaction atabase an P B is a prefix. The initial problem, with which the recursion is starte, is S = (T, ), where T is the transaction atabase to mine an the prefix is empty. A subproblem S = (T, P ) is processe as follows: hoose an item i B, where B is the set of items occurring in T. If s T (i) s min (where s T (i) is the support of the item i in T ): Report the item set P {i} as frequent with the support s T (i). Form the subproblem S = (T, P ) with P = P {i}. T comprises all transactions in T that contain the item i, but with the item i remove (an empty transactions remove). If T is not empty, process S recursively. In any case (that is, regarless of whether s T (i) s min or not): Form the subproblem S = (T, P ), where P = P. T comprises all transactions in T (whether they contain i or not), but again with the item i remove (an empty transactions remove). If T is not empty, process S recursively. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Divie-an-onquer Recursion Reminer: Searching with the Prefix Property Subproblem Tree b (T ab, {a, b}) c c (T a, {a}) (T ab c, {a, b}) (T abc, {a, b, c}) Branch to the left: Branch to the right: a b (T a b, {a}) c (T a bc, {a, c}) c (T a b c, {a}) (T, ) ā b (Tāb, {b}) c (Tābc, {b, c}) c (Tā, ) (Tāb c, {b}) inclue an item (first subproblem) exclue an item (secon subproblem) b c (Tā b, ) (Tā bc, {c}) c (Tā b c, ) (Items in the inices of the conitional transaction atabases T have been remove from them.) Principle of a Search Algorithm base on the Prefix Property: Base Loop: Traverse all possible items, that is, the canonical coe wors of all one-element item sets. Recursively process each coe wor that escribes a frequent item set. Recursive Processing: For a given (canonical) coe wor of a frequent item set: Generate all possible extensions by one item. This is one by simply appening the item to the coe wor. heck whether the extene coe wor is the canonical coe wor of the item set that is escribe by the extene coe wor (an, of course, whether the escribe item set is frequent). If it is, process the extene coe wor recursively, otherwise iscar it. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

21 Perfect Extensions Perfect Extensions: Examples The search can easily be improve with so-calle perfect extension pruning. Let T be a transaction atabase over an item base B. Given an item set I, an item i / I is calle a perfect extension of I w.r.t. T, iff the item sets I an I {i} have the same support: s T (I) = s T (I {i}) (that is, if all transactions containing the item set I also contain the item i). Perfect extensions have the following properties: If the item i is a perfect extension of an item set I, then i is also a perfect extension of any item set J I (provie i / J). This can most easily be seen by consiering that K T (I) K T ({i}) an hence K T (J) K T ({i}), since K T (J) K T (I). If X T (I) is the set of all perfect extensions of an item set I w.r.t. T (that is, if X T (I) = {i B I s T (I {i}) = s T (I)}), then all sets I J with J X T (I) have the same support as I (where M enotes the power set of a set M). transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} frequent item sets items item items items : {a}: {a, c}: {a, c, }: {b}: {a, }: {a, c, e}: {c}: {a, e}: {a,, e}: {}: {b, c}: {e}: {c, }: {c, e}: {, e}: c is a perfect extension of {b} since {b} an {b, c} both have support. a is a perfect extension of {, e} since {, e} an {a,, e} both have support. There are no other perfect extensions in this example for a minimum support of s min =. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Perfect Extension Pruning Perfect Extension Pruning onsier again the original ivie-an-conquer scheme: A subproblem S = (T, P ) is split into a subproblem S = (T, P ) to fin all frequent item sets that o contain an item i B an a subproblem S = (T, P ) to fin all frequent item sets that o not contain the item i. Suppose the item i is a perfect extension of the prefix P. Let F an F be the sets of frequent item sets that are reporte when processing S an S, respectively. It is I {i} F I F. The reason is that generally P = P {i} an in this case T = T, because all transactions in T contain item i (as i is a perfect extension). Therefore it suffices to solve one subproblem (namely S ). The solution of the other subproblem (S ) is constructe by aing item i. Perfect extensions can be exploite by collecting these items in the recursion, in a thir element of a subproblem escription. Formally, a subproblem is a triplet S = (T, P, X), where T is a conitional transaction atabase, P is the set of prefix items for T, X is the set of perfect extension items. nce ientifie, perfect extension items are no longer processe in the recursion, but are only use to generate all supersets of the prefix having the same support. onsequently, they are remove from the conitional transaction atabases. This technique is also known as hypercube ecomposition. The ivie-an-conquer scheme has basically the same structure as without perfect extension pruning. However, the exact way in which perfect extensions are collecte can epen on the specific algorithm use. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

22 Reporting Frequent Item Sets Global an Local Item rer With the escribe ivie-an-conquer scheme, item sets are reporte in lexicographic orer. This can be exploite for efficient item set reporting: The prefix P is a string, which is extene when an item is ae to P. Thus only one item nees to be formatte per reporte frequent item set, the prefix is alreay formatte in the string. Backtracking the search (return from recursion) removes an item from the prefix string. This scheme can spee up the output consierably. Example: a () a c () a c () a c e () a () a e () a e () b () b c () c () c () c e () () e () e () Up to now we assume that the item orer is (globally) fixe, an etermine at the very beginning base on heuristics. However, the escribe ivie-an-conquer scheme shows that a globally fixe item orer is more restrictive than necessary: The item use to split the current subproblem can be any item that occurs in the conitional transaction atabase of the subproblem. There is no nee to choose the same item for splitting sibling subproblems (as a global item orer woul require us to o). The same heuristics use for etermining a global item orer suggest that the split item for a given subproblem shoul be selecte from the (conitionally) least frequent item(s). As a consequence, the item orers may iffer for every branch of the search tree. However, two subproblems must share the item orer that is fixe by the common part of their paths from the root (initial subproblem). hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Item rer: Divie-an-onquer Recursion Global an Local Item rer Subproblem Tree b (T ab, {a, b}) (T a, {a}) (T ab, {a, b}) (T ab, {a, b, }) a b (T a b, {a}) e (T a be, {a, e}) ē (T a bē, {a}) (T, ) All local item orers start with a <... ā c (Tāc, {c}) f (Tācf, {c, f}) All subproblems on the left share a < b <..., All subproblems on the right share a < c <.... f (Tā, ) (Tāc f, {c}) c g (Tā c, ) (Tā cg, {g}) ḡ (Tā cḡ, ) Local item orers have avantages an isavantages: Avantage In some ata sets the orer of the conitional item frequencies iffers consierably from the global orer. Such ata sets can sometimes be processe significantly faster with local item orers (epening on the algorithm). Disavantage The ata structure of the conitional atabases must allow us to etermine conitional item frequencies quickly. ot having a globally fixe item orer can make it more ifficult to etermine conitional transaction atabases w.r.t. split items (epening on the employe ata structure). The gains from the better item orer may be lost again ue to the more complex processing / conitioning scheme. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

23 Transaction Database Representation Eclat, FP-growth an several other frequent item set mining algorithms rely on the escribe basic ivie-an-conquer scheme. They iffer mainly in how they represent the conitional transaction atabases. The main approaches are horizontal an vertical representations: Transaction Database Representation In a horizontal representation, the atabase is store as a list (or array) of transactions, each of which is a list (or array) of the items containe in it. In a vertical representation, a atabase is represente by first referring with a list (or array) to the ifferent items. For each item a list (or array) of ientifiers is store, which inicate the transactions that contain the item. However, this istinction is not pure, since there are many algorithms that use a combination of the two forms of representing a transaction atabase. Frequent item set mining algorithms also iffer in how they construct new conitional transaction atabases from a given one. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Transaction Database Representation Transaction Database Representation The Apriori algorithm uses a horizontal transaction representation: each transaction is an array of the containe items. ote that the alternative prefix tree organization is still an essentially horizontal representation. The alternative is a vertical transaction representation: For each item a transaction (inex/ientifier) list is create. The transaction list of an item i inicates the transactions that contain it, that is, it represents its cover K T ({i}). Avantage: the transaction list for a pair of items can be compute by intersecting the transaction lists of the iniviual items. Generally, a vertical transaction representation can exploit I, J B : K T (I J) = K T (I) K T (J). A combine representation is the frequent pattern tree (to be iscusse later). Horizontal Representation: List items for each transaction Vertical : a,, e : b, c, : a, c, e : a, c,, e : a, e : a, c, : b, c : a, c,, e 9: b, c, e : a,, e horizontal representation Representation: List transactions for each item a b c e vertical representation a b c e : : : : : : : : 9: : matrix representation hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9

24 Transaction Database Representation transaction atabase a,, e b, c, a, c, e a, c,, e a, e a, c, b, c a, c,, e b, c, e a,, e lexicographically sorte a, c, a, c,, e a, c,, e a, c, e a,, e a,, e a, e b, c b, c, b, c, e prefix tree representation a : b : c : : e : c : : e : e : : e : e : The Eclat Algorithm [Zaki, Parthasarathy, gihara, an Li 99] ote that a prefix tree representation is a compresse horizontal representation. Principle: equal prefixes of transactions are merge. This is most effective if the items are sorte esceningly w.r.t. their support. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Eclat: Basic Ieas Eclat: Subproblem Split The item sets are checke in lexicographic orer (epth-first traversal of the prefix tree). The search scheme is the same as the general scheme for searching with canonical forms having the prefix property an possessing a perfect extension rule (generate only canonical extensions). Eclat generates more caniate item sets than Apriori, because it (usually) oes not store the support of all visite item sets. As a consequence it cannot fully exploit the Apriori property for pruning. Eclat uses a purely vertical transaction representation. o subset tests an no subset generation are neee to compute the support. The support of item sets is rather etermine by intersecting transaction lists. ote that Eclat cannot fully exploit the Apriori property, because it oes not store the support of all explore item sets, not because it cannot know it. If all compute support values were store, it coul be implemente in such a way that all support values neee for full a priori pruning are available. a b 9 b 9 c 9 c 9 e 9 e 9 b c e onitional atabase for prefix a (st subproblem) onitional atabase with item a remove (n subproblem) a b b c c e e b c e onitional atabase for prefix a (st subproblem) onitional atabase with item a remove (n subproblem) hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9

25 Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a : b : c : : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : a : b : c : : e : Form a transaction list for each item. Here: bit array representation. gray: item is containe in transaction white: item is not containe in transaction Transaction atabase is neee only once (for the single item transaction lists). Intersect the transaction list for item a with the transaction lists of all other items (conitional atabase for item a). ount the number of bits that are set (number of containing transactions). This yiels the support of all item sets with the prefix a. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : a : b : c : : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : a : b : c : : e : The item set {a, b} is infrequent an can be prune. All other item sets with the prefix a are frequent an are therefore kept an processe recursively. Intersect the transaction list for the item set {a, c} with the transaction lists of the item sets {a, x}, x {, e}. Result: Transaction lists for the item sets {a, c, } an {a, c, e}. ount the number of bits that are set (number of containing transactions). This yiels the support of all item sets with the prefix ac. hristian Borgelt Frequent Pattern Mining 99 hristian Borgelt Frequent Pattern Mining

26 Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : a : b : c : : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : a : b : c : : e : Intersect the transaction lists for the item sets {a, c, } an {a, c, e}. Result: Transaction list for the item set {a, c,, e}. With Apriori this item set coul be prune before counting, because it was known that {c,, e} is infrequent. The item set {a, c,, e} is not frequent (support /%) an therefore prune. Since there is no transaction list left (an thus no intersection possible), the recursion is terminate an the search backtracks. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c : : e : The search backtracks to the secon level of the search tree an intersects the transaction list for the item sets {a, } an {a, e}. Result: Transaction list for the item set {a,, e}. Since there is only one transaction list left (an thus no intersection possible), the recursion is terminate an the search backtracks again. The search backtracks to the first level of the search tree an intersects the transaction list for b with the transaction lists for c,, an e. Result: Transaction lists for the item sets {b, c}, {b, }, an {b, e}. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

27 Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c : : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : nly one item set has sufficient support prune all subtrees. Since there is only one transaction list left (an thus no intersection possible), the recursion is terminate an the search backtracks again. Backtrack to the first level of the search tree an intersect the transaction list for c with the transaction lists for an e. Result: Transaction lists for the item sets {c, } an {c, e}. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : e : Intersect the transaction list for the item sets {c, } an {c, e}. Result: Transaction list for the item set {c,, e}. The item set {c,, e} is not frequent (support /%) an therefore prune. Since there is no transaction list left (an thus no intersection possible), the recursion is terminate an the search backtracks. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

28 Eclat: Depth-First Search Eclat: Depth-First Search : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : e : e : : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : e : e : The search backtracks to the first level of the search tree an intersects the transaction list for with the transaction list for e. Result: Transaction list for the item set {, e}. With this step the search is complete. The foun frequent item sets coincie, of course, with those foun by the Apriori algorithm. However, a funamental ifference is that Eclat usually only writes foun frequent item sets to an output file, while Apriori keeps the whole search tree in main memory. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Eclat: Depth-First Search Eclat: Representing Transaction Ientifier Lists : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} a b : c : : e : c : e : e : e : a : b : c : : e : b c c : : e : : e : e : ote that the item set {a, c,, e} coul be prune by Apriori without computing its support, because the item set {c,, e} is infrequent. The same can be achieve with Eclat if the epth-first traversal of the prefix tree is carrie out from right to left an compute support values are store. It is ebatable whether the potential gains justify the memory requirement. e : Bit Matrix Representations Represent transactions as a bit matrix: Each column correspons to an item. Each row correspons to a transaction. ormal an sparse representation of bit matrices: ormal: one memory bit per matrix bit (zeros are represente). Sparse : lists of row inices of set bits (transaction ientifier lists). (zeros are not represente) Which representation is preferable epens on the ratio of set bits to cleare bits. In most cases a sparse representation is preferable, because the intersections clear more an more bits. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

29 Eclat: Intersecting Transaction Lists Eclat: Filtering Transaction Lists function isect (src, src : tilist) begin ( intersect two transaction i lists ) var st : tilist; ( create intersection ) while both src an src are not empty o begin if hea(src) < hea(src) ( skip transaction ientifiers that are ) then src = tail(src); ( unique to the first source list ) elseif hea(src) > hea(src) ( skip transaction ientifiers that are ) then src = tail(src); ( unique to the secon source list ) else begin ( if transaction i is in both sources ) st.appen(hea(src)); ( appen it to the output list ) src = tail(src); src = tail(src); en; ( remove the transferre transaction i ) en; ( from both source lists ) return st; ( return the create intersection ) en; ( function isect() ) function filter (transb : list of tilist) begin ( filter a transaction atabase ) var conb : list of tilist; ( create conitional transaction atabase ) out : tilist; ( filtere tilist of other item ) for ti in hea(transb) o ( traverse the tilist of the split item ) containe[ti] := true; ( an set flags for containe tis ) for inp in tail(transb) o begin ( traverse tilists of the other items ) out := new tilist; ( create an output tilist an ) conb.appen(out); ( appen it to the conitional atabase ) for ti in inp o ( collect tis share with split item ) if containe[ti] then out.appen(ti); en ( ( containe is a global boolean array) ) for ti in hea(transb) o ( traverse the tilist of the split item ) containe[ti] := false; ( an clear flags for containe tis ) return conb; ( return the create conitional atabase ) en; ( function filter() ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Eclat: Item rer Eclat: Item rer onsier Eclat with transaction ientifier lists (sparse representation): Each computation of a conitional transaction atabase intersects the transaction list for an item (let this be list L) with all transaction lists for items following in the item orer. The lists resulting from the intersections cannot be longer than the list L. (This is another form of the fact that support is anti-monotone.) If the items are processe in the orer of increasing frequency (that is, if they are chosen as split items in this orer): Short lists (less frequent items) are intersecte with many other lists, creating a conitional transaction atabase with many short lists. Longer lists (more frequent items) are intersecte with few other lists, creating a conitional transaction atabase with few long lists. onsequence: The average size of conitional transaction atabases is reuce, which leas to faster processing / search. a b 9 b 9 c 9 c 9 e 9 e 9 b c e onitional atabase for prefix a (st subproblem) onitional atabase with item a remove (n subproblem) b 9 a a c 9 c 9 e 9 e 9 a c 9 e 9 onitional atabase for prefix b (st subproblem) onitional atabase with item b remove (n subproblem) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

30 Reminer (Apriori): Transactions as a Prefix Tree Eclat: Transaction Ranges transaction atabase a,, e b, c, a, c, e a, c,, e a, e a, c, b, c a, c,, e b, c, e a,, e lexicographically sorte a, c, a, c,, e a, c,, e a, c, e a,, e a,, e a, e b, c b, c, b, c, e prefix tree representation a : b : c : : e : c : : e : e : : e : Items in transactions are sorte w.r.t. some arbitrary orer, transactions are sorte lexicographically, then a prefix tree is constructe. Avantage: ientical transaction prefixes are processe only once. e : transaction atabase a,, e b, c, a, c, e a, c,, e a, e a, c, b, c a, c,, e b, c, e a,, e item frequencies a: b: c: : e: sorte by frequency a, e, c,, b a, c, e a, c, e, a, e a, c, c, b a, c, e, c, e, b a, e, lexicographically sorte : a, c, e : a, c, e, : a, c, e, : a, c, : a, e : a, e, : a, e, : c, e, b 9: c,, b : c, b The transaction lists can be compresse by combining consecutive transaction ientifiers into ranges. Exploit item frequencies an ensure subset relations between ranges from lower to higher frequencies, so that intersecting the lists is easy. a. c.. e b hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Eclat: Transaction Ranges / Prefix Tree Eclat: Difference sets (Diffsets) transaction atabase a,, e b, c, a, c, e a, c,, e a, e a, c, b, c a, c,, e b, c, e a,, e sorte by frequency a, e, c,, b a, c, e a, c, e, a, e a, c, c, b a, c, e, c, e, b a, e, lexicographically sorte : a, c, e : a, c, e, : a, c, e, : a, c, : a, e : a, e, : a, e, : c, e, b 9: c,, b : c, b prefix tree representation a : c : c : e : e : : b : e : : : b : b : Items in transactions are sorte by frequency, transactions are sorte lexicographically, then a prefix tree is constructe. The transaction ranges reflect the structure of this prefix tree. : In a conitional atabase, all transaction lists are filtere by the prefix: nly transactions containe in the transaction ientifier list for the prefix can be in the transaction ientifier lists of the conitional atabase. This suggests the iea to use iffsets to represent conitional atabases: I : a / I : D T (a I) = K T (I) K T (I {a}) D T (a I) contains the ientifiers of the transactions that contain I but not a. The support of irect supersets of I can now be compute as I : a / I : s T (I {a}) = s T (I) D T (a I). The iffsets for the next level can be compute by I : a, b / I, a b : D T (b I {a}) = D T (b I) D T (a I) For some transaction atabases, using iffsets spees up the search consierably. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

31 Eclat: Diffsets Summary Eclat Proof of the Formula for the ext Level: D T (b I {a}) = K T (I {a}) K T (I {a, b}) = {k I {a} t k } {k I {a, b} t k } = {k I t k a t k } {k I t k a t k b t k } = {k I t k a t k b / t k } = {k I t k b / t k } {k I t k b / t k a / t k } = {k I t k b / t k } {k I t k a / t k } = ({k I t k } {k I {b} t k }) ({k I t k } {k I {a} t k }) = (K T (I) K T (I {b}) (K T (I) K T (I {a}) = D(b I) D(a I) Basic Processing Scheme Depth-first traversal of the prefix tree (ivie-an-conquer scheme). Data is represente as lists of transaction ientifiers (one per item). Support counting is one by intersecting lists of transaction ientifiers. Avantages Depth-first search reuces memory requirements. Usually (consierably) faster than Apriori. Disavantages With a sparse transaction list representation (row inices) intersections are ifficult to execute for moern processors (branch preiction). Software hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining LM: Basic Ieas The item sets are checke in lexicographic orer (epth-first traversal of the prefix tree). The LM Algorithm Linear lose Item Set Miner [Uno, Asai, Uchia, an Arimura ] (version ) [Uno, Kiyomi an Arimura, ] (versions & ) Stanar ivie-an-conquer scheme (inclue/exclue items); recursive processing of the conitional transaction atabases. losely relate to the Eclat algorithm. Maintains both a horizontal an a vertical representation of the transaction atabase in parallel. Uses the vertical representation to filter the transactions with the chosen split item. Uses the horizontal representation to fill the vertical representation for the next recursion step (no intersection as in Eclat). Usually traverses the search tree from right to left in orer to reuse the memory for the vertical representation (fixe memory requirement, proportional to atabase size). hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

32 LM: ccurrence Deliver LM: Left to Right Processing : : : : : : : : 9: : e 9 a e b c a c e a c e a e a c b c a c e b c e a e a b c a e a b 9 e 9 c 9 a b e 9 c a c e ccurrence eliver scheme use by LM to fin the conitional transaction atabase for the first subproblem (nees a horizontal representation in parallel). e 9 a b c a c e etc. a b 9 c 9 e 9 a b 9 c 9 e 9 a b c e 9 a b 9 c 9 e 9 black: unprocesse part blue: split item re: conitional atabase The secon subproblem (exclue split item) is solve before the first subproblem (inclue split item). The algorithm is execute only on the memory that stores the initial vertical representation (plus the horizontal representation). If the transaction atabase can be loae, the frequent item sets can be foun. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Summary LM Basic Processing Scheme Depth-first traversal of the prefix tree (ivie-an-conquer scheme). Parallel horizontal an vertical transaction representation. Support counting is one uring the occurrence eliver process. Avantages Fairly simple ata structure an processing scheme. Very fast if implemente properly (an with aitional tricks). The SaM Algorithm Split an Merge Algorithm [Borgelt ] Disavantages Simple, straightforwar implementation is relatively slow. Software (option -Ao) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

33 SaM: Basic Ieas SaM: Preprocessing the Transaction Database The item sets are checke in lexicographic orer (epth-first traversal of the prefix tree). Stanar ivie-an-conquer scheme (inclue/exclue items). Recursive processing of the conitional transaction atabases. While Eclat uses a purely vertical transaction representation, SaM uses a purely horizontal transaction representation. This emonstrates that the traversal orer for the prefix tree an the representation form of the transaction atabase can be combine freely. a a c e b b c g b c f a b b e b c e b c a b f g: f: e: a: c: b: : s min = a e a c b c b c b a b e b e c b c b a b e a c e c b e b a b a b a c b c b c b b e a c e c b e b a b a c b c b b The ata structure use is a simply array of transactions. The two conitional atabases for the two subproblems forme in each step are create with a split step an a merge step. Due to these steps the algorithm is calle Split an Merge (SaM).. riginal transaction atabase.. Frequency of iniviual items.. Items in transactions sorte asceningly w.r.t. their frequency.. Transactions sorte lexicographically in escening orer (comparison of items inverte w.r.t. preceing step).. Data structure use by the algorithm. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining SaM: Basic perations SaM: Pseuo-oe e a c a c e remove e c b split a b e b a a b a b c b a prefix e a c b prefix e c b a c c b b a c c b c b c b c b b b b merge b Split Step: (on the left; for first subproblem) Move all transactions starting with the same item to a new array. Remove the common leaing item (avance pointer into transaction). Merge Step: (on the right; for secon subproblem) Merge the remainer of the transaction array an the copie transactions. The merge operation is similar to a mergesort phase. function SaM (a: array of transactions, ( conitional atabase to process ) p: set of items, ( prefix of the conitional atabase a ) s min : int) ( minimum support of an item set ) var i: item; ( buffer for the split item ) b: array of transactions; ( split result ) begin ( split an merge recursion ) while a is not empty o ( while the atabase is not empty ) i := a[].items[]; ( get leaing item of first transaction ) move transactions starting with i to b; ( split step: first subproblem ) merge b an the remainer of a into a; ( merge step: secon subproblem ) if s(i) s min then ( if the split item is frequent: ) p := p {i}; ( exten the prefix item set an ) report p with support s(i); ( report the foun frequent item set ) SaM(b, p, s min ); ( process the split result recursively, ) p := p {i}; ( then restore the original prefix ) en; en; en; ( function SaM() ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

34 SaM: Pseuo-oe Split Step SaM: Pseuo-oe Merge Step var i: item; ( buffer for the split item ) s: int; ( support of the split item ) b: array of transactions; ( split result ) begin ( split step ) b := empty; s := ; ( initialize split result an item support ) i := a[].items[]; ( get leaing item of first transaction ) while a is not empty ( while atabase is not empty an ) an a[].items[] = i o ( next transaction starts with same item ) s := s + a[].wgt; ( sum occurrences (compute support) ) remove i from a[].items; ( remove split item from transaction ) if a[].items is not empty ( if transaction has not become empty ) then remove a[] from a an appen it to b; else remove a[] from a; en; ( move it to the conitional atabase, ) en; ( otherwise simply remove it: ) en; ( empty transactions are eliminate ) ote that the split step also etermines the support of the item i. var c: array of transactions; ( buffer for remainer of source array ) begin ( merge step ) c := a; a := empty; ( initialize the output array ) while b an c are both not empty o ( merge split an remainer of atabase ) if c[].items > b[].items ( copy lex. smaller transaction from c ) then remove c[] from c an appen it to a; else if c[].items < b[].items ( copy lex. smaller transaction from b ) then remove b[] from b an appen it to a; else b[].wgt := b[].wgt +c[].wgt; ( sum the occurrences/weights ) remove b[] from b an appen it to a; remove c[] from c; ( move combine transaction an ) en; ( elete the other, equal transaction: ) en; ( keep only one copy per transaction ) while c is not empty o ( copy remaining transactions in c ) remove c[] from c an appen it to a; en; while b is not empty o ( copy remaining transactions in b ) remove b[] from b an appen it to a; en; en; ( secon recursion: execute by loop ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining SaM: ptimization SaM: Pseuo-oe Binary Search Base Merge If the transaction atabase is sparse, the two transaction arrays to merge can iffer substantially in size. In this case SaM can become fairly slow, because the merge step processes many more transactions than the split step. Intuitive explanation (extreme case): Suppose mergesort always merge a single element with the recursively sorte remainer of the array (or list). This version of mergesort woul be equivalent to insertion sort. As a consequence the time complexity worsens from (n log n) to (n ). Possible optimization: Moify the merge step if the arrays to merge iffer significantly in size. Iea: use the same optimization as in binary search base insertion sort. function merge (a, b: array of transactions) : array of transactions var l, m, r: int; ( binary search variables ) c: array of transactions; ( output transaction array ) begin ( binary search base merge ) c := empty; ( initialize the output array ) while a an b are both not empty o ( merge the two transaction arrays ) l := ; r := length(a); ( initialize the binary search range ) while l < r o ( while the search range is not empty ) m := l+r ; ( compute the mile inex ) if a[m] < b[] ( compare the transaction to insert ) then l := m + ; else r := m; ( an aapt the binary search range ) en; ( accoring to the comparison result ) while l > o ( while still before insertion position ) remove a[] from a an appen it to c; l := l ; ( copy lex. larger transaction an ) en; ( ecrement the transaction counter )... hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

35 SaM: Pseuo-oe Binary Search Base Merge SaM: ptimization an External Storage... remove b[] from b an appen it to c; ( copy the transaction to insert an ) i := length(c) ; ( get its inex in the output array ) if a is not empty an a[].items = c[i].items then c[i].wgt = c[i].wgt +a[].wgt; ( if there is another transaction ) remove a[] from a; ( that is equal to the one just copie, ) en; ( then sum the transaction weights ) en; ( an remove trans. from the array ) while a is not empty o ( copy remainer of transactions in a ) remove a[] from a an appen it to c; en; while b is not empty o ( copy remainer of transactions in b ) remove b[] from b an appen it to c; en; return c; ( return the merge result ) en; ( function merge() ) Applying this merge proceure if the length ratio of the transaction arrays excees : accelerates the execution on sparse ata sets. Accepting a slightly more complicate processing scheme, one may work with ouble source buffering: Initially, one source is the input atabase an the other source is empty. A split result, which has to be create by moving an merging transactions from both sources, is always merge to the smaller source. If both sources have become large, they may be merge in orer to empty one source. ote that SaM can easily be implemente to work on external storage: In principle, the transactions nee not be loae into main memory. Even the transaction array can easily be store on external storage or as a relational atabase table. The fact that the transaction array is processe linearly is avantageous for external storage operations. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Summary SaM Basic Processing Scheme Depth-first traversal of the prefix tree (ivie-an-conquer scheme). Data is represente as an array of transactions (purely horizontal representation). Support counting is one implicitly in the split step. Avantages Very simple ata structure an processing scheme. Easy to implement for operation on external storage / relational atabases. The RElim Algorithm Recursive Elimination Algorithm [Borgelt ] Disavantages an be slow on sparse transaction atabases ue to the merge step. Software hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

36 Recursive Elimination: Basic Ieas RElim: Preprocessing the Transaction Database The item sets are checke in lexicographic orer (epth-first traversal of the prefix tree). Stanar ivie-an-conquer scheme (inclue/exclue items). Recursive processing of the conitional transaction atabases. Avois the main problem of the SaM algorithm: oes not use a merge operation to group transactions with the same leaing item. RElim rather maintains one list of transactions per item, thus employing the core iea of raix sort. However, only transactions starting with an item are in the corresponing list. After an item has been processe, transactions are reassigne to other lists (base on the next item in the transaction). RElim is in several respects similar to the LM algorithm (as iscusse before) an closely relate to the H-mine algorithm (not covere in this lecture). same as for SaM e a c e c b e b a b a b a c b c b c b b. riginal transaction atabase.. Frequency of iniviual items.. Items in transactions sorte asceningly w.r.t. their frequency. b c a e b b b a c c b b. Transactions sorte lexicographically in escening orer (comparison of items inverte w.r.t. preceing step).. Data structure use by the algorithm (leaing items implicit in list). hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining RElim: Subproblem Split RElim: Pseuo-oe b c a e b b c a b b b b initial atabase b e eliminate c b a c c b b b c a prefix e b c The subproblem split of the RElim algorithm. The rightmost list is traverse an reassigne: once to an initially empty list array (conitional atabase for the prefix e, see top right) an once to the original list array (eliminating item e, see bottom left). These two atabases are then both processe recursively. ote that after a simple reassignment there may be uplicate list elements. function RElim (a: array of transaction lists, ( con. atabase to process ) p: set of items, ( prefix of the conitional atabase a ) s min : int) : int ( minimum support of an item set ) var i, k: item; ( buffer for the current item ) s: int; ( support of the current item ) n: int; ( number of foun frequent item sets ) b: array of transaction lists; ( conitional atabase for current item ) t, u: transaction list element; ( to traverse the transaction lists ) begin ( recursive elimination ) n := ; ( initialize the number of foun item sets ) while a is not empty o ( while conitional atabase is not empty ) i := last item of a; s := a[i].wgt; ( get the next item to process ) if s s min then ( if the current item is frequent: ) p := p {i}; ( exten the prefix item set an ) report p with support s; ( report the foun frequent item set )... ( create conitional atabase for i ) p := p {i}; ( an process it recursively, ) en; ( then restore the original prefix ) hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

37 RElim: Pseuo-oe RElim: Pseuo-oe if s s min then ( if the current item is frequent: )... ( report the foun frequent item set ) b := array of transaction lists; ( create an empty list array ) t := a[i].hea; ( get the list associate with the item ) while t nil o ( while not at the en of the list ) u := copy of t; t := t.succ; ( copy the transaction list element, ) k := u.items[]; ( go to the next list element, an ) remove k from u.items; ( remove the leaing item from the copy ) if u.items is not empty ( a the copy to the conitional atabase ) then u.succ = b[k].hea; b[k].hea = u; en; b[k].wgt := b[k].wgt +u.wgt; ( sum the transaction weight ) en; ( in the list weight/transaction counter ) n := n + + RElim(b, p, s min ); ( process the create atabase recursively )... ( an sum the foun frequent item sets, ) en; ( then restore the original item set prefix )... ( go on by reassigning ) ( the processe transactions )... t := a[i].hea; ( get the list associate with the item ) while t nil o ( while not at the en of the list ) u := t; t := t.succ; ( note the current list element, ) k := u.items[]; ( go to the next list element, an ) remove k from u.items; ( remove the leaing item from current ) if u.items is not empty ( reassign the note list element ) then u.succ = a[k].hea; a[k].hea = u; en; a[k].wgt := a[k].wgt +u.wgt; ( sum the transaction weight ) en; ( in the list weight/transaction counter ) remove a[i] from a; ( remove the processe list ) en; return n; ( return the number of frequent item sets ) en; ( function RElim() ) In orer to remove uplicate elements, it is usually avisable to sort an compress the next transaction list before it is processe. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining The k-items Machine The k-items Machine Introuce with LM algorithm (see above) to combine equal transaction suffixes. Iea: If the number of items is small, a bucket/bin sort scheme can be use to perfectly combine equal transaction suffixes. This scheme leas to the k-items machine (for small k). All possible transaction suffixes are represente as bit patterns; one bucket/bin is create for each possible bit pattern. A RElim-like processing scheme is employe (on a fixe ata structure). Leaing items are extracte with a table that is inexe with the bit pattern. Items are eliminate with a bit mask. : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} Empty -items machine (no transactions) transaction weights/multiplicities transaction lists (one per item) a. b. c.. -items machine after inserting the transactions transaction weights/multiplicities transaction lists (one per item) a. b. c.. Table of highest set bits for a -items machine (special instructions: bsr / lzcount): highest items/set bits of transactions (constant) *.* a. b. b. c. c. c. c a b ba _c c_a _cb cba a _b ba c c_a cb_ cba In this state the -items machine represents a special form of the initial transaction atabase of the RElim algorithm. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

38 The k-items Machine Summary RElim : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} -items machine after inserting the transactions transaction weights/multiplicities transaction lists (one per item) a. b. c.. After propagating the transaction lists transaction weights/multiplicities transaction lists (one per item) a. b. c.. Basic Processing Scheme Depth-first traversal of the prefix tree (ivie-an-conquer scheme). Data is represente as lists of transactions (one per item). Support counting is implicit in the (re)assignment step. Avantages Fairly simple ata structures an processing scheme. ompetitive with the fastest algorithms espite this simplicity. Disavantages RElim is usually outperforme by LM an FP-growth (iscusse later). Propagating the transactions lists is equivalent to occurrence eliver. onitional transaction atabases are create as in RElim plus propagation. Software hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining FP-Growth: Basic Ieas FP-Growth means Frequent Pattern Growth. The item sets are checke in lexicographic orer (epth-first traversal of the prefix tree). The FP-Growth Algorithm Frequent Pattern Growth Algorithm [Han, Pei, an Yin ] Stanar ivie-an-conquer scheme (inclue/exclue items). Recursive processing of the conitional transaction atabases. The transaction atabase is represente as an FP-tree. An FP-tree is basically a prefix tree with aitional structure: noes of this tree that correspon to the same item are linke into lists. This combines a horizontal an a vertical atabase representation. This ata structure is use to compute conitional atabases efficiently. All transactions containing a given item can easily be foun by the links between the noes corresponing to this item. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

39 FP-Growth: Preprocessing the Transaction Database Transaction Representation: FP-Tree a f a c e b b c b c a b b e b c e g c f a b : b: c: a: e: f: g: s min =. riginal transaction atabase.. Frequency of iniviual items. a c a e b b c b c b a b e b c e c b a. Items in transactions sorte esceningly w.r.t. their frequency an infrequent items remove. b b c b a b a b e c c a e a b c b c e FP-tree (see next slie). Transactions sorte lexicographically in ascening orer (comparison of items is the same as in preceing step).. Data structure use by the algorithm (etails on next slie). Buil a frequent pattern tree (FP-tree) from the transactions (basically a prefix tree with links between the branches that link noes with the same item an a heaer table for the resulting item lists). Frequent single item sets can be rea irectly from the FP-tree. Simple Example Database a f a c e b b c b c a b b e b c e g c f a b b b c b a b a b e c c a e a b c b c e : b: c: a: : b: b: c: c: c: a: a: a: frequent pattern tree e: e: e: e: hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Transaction Representation: FP-Tree Recursive Processing An FP-tree combines a horizontal an a vertical transaction representation. Horizontal Representation: prefix tree of transactions Vertical ote: the prefix tree is inverte, i.e. there are only parent pointers. hil pointers are not neee ue to the processing scheme (to be iscusse). In principle, all noes referring to the same item can be store in an array rather than a list. Representation: links between the prefix tree branches : b: c: a: : b: b: c: c: c: a: a: a: frequent pattern tree e: e: e: e: The initial FP-tree is projecte w.r.t. the item corresponing to the rightmost level in the tree (let this item be i). This yiels an FP-tree of the conitional transaction atabase (atabase of transactions containing the item i, but with this item remove it is implicit in the FP-tree an recore as a common prefix). From the projecte FP-tree the frequent item sets containing item i can be rea irectly. The rightmost level of the original (unprojecte) FP-tree is remove (the item i is remove from the atabase exclue split item). The projecte FP-tree is processe recursively; the item i is note as a prefix that is to be ae in eeper levels of the recursion. Afterwar the reuce original FP-tree is further processe by working on the next level leftwar. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

40 Projecting an FP-Tree Projecting an FP-Tree : b: c: a: b: b: c: : c: a: : c: a: b: c: b: c: a: a: e: e: e: e: By traversing the noe list for the rightmost item, all transactions containing this item can be foun. : b: c: a: b: : c: a: b: c: etache projection FP-tree with attache projection The FP-tree of the conitional atabase for this item is create by copying the noes on the paths to the root. A simpler, but usually equally efficient projection scheme is to extract a path to the root as a (reuce) transaction an to insert this transaction into a new FP-tree. For the insertion into the new tree, there are two approaches: Apart from a parent pointer (which is neee for the path extraction), each noe possesses a pointer to its first chil an right sibling. These pointers allow to insert a new transaction top-own. If the initial FP-tree has been built from a lexicographically sorte transaction atabase, the traversal of the item lists yiels the (reuce) transactions in lexicographical orer. This can be exploite to insert a transaction using only the heaer table. By processing an FP-tree from left to right (or from top to bottom w.r.t. the prefix tree), the projection may even reuse the alreay present noes an the alreay processe part of the heaer table (top-own FP-growth). In this way the algorithm can be execute on a fixe amount of memory. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Reucing the riginal FP-Tree FP-growth: Divie-an-onquer : b: c: a: e: : b: c: a: : b: c: a: e: : b: c: a: b: c: a: e: b: c: a: b: c: a: e: b: c: a: : c: a: e: : c: a: : c: a: e: : c: a: b: c: a: The original FP-tree is reuce by removing the rightmost level. e: This yiels the conitional atabase for item sets not containing the item corresponing to the rightmost level. b: c: a: a: b: c: : b: c: a: b: : c: a: b: c: e: b: c: onitional atabase with item e remove (secon subproblem) onitional atabase for prefix e (first subproblem) a: hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

41 Pruning a Projecte FP-Tree FP-growth: Implementation Issues Trivial case: If the item corresponing to the rightmost level is infrequent, the item an the FP-tree level are remove without projection. More interesting case: An item corresponing to a mile level is infrequent, but an item on a level further to the right is frequent. Example FP-Tree with an infrequent item on a mile level: a: b: c: : a: b: c: : c: : a: b: c: : a: c: : So-calle α-pruning or Bonsai pruning of a (projecte) FP-tree. Implemente by left-to-right levelwise merging of noes with same parents. ot neee if projection works by extraction an insertion. hains: If an FP-tree has been reuce to a chain, no projections are compute anymore. Rather all subsets of the set of items in the chain are forme an reporte. Rebuiling the FP-tree: An FP-tree may be projecte by extracting the (reuce) transactions escribe by the paths to the root an inserting them into a new FP-tree (see above). This makes it possible to change the item orer, with the following avantages: o nee for α- or Bonsai pruning, since the items can be reorere so that all conitionally frequent items appear on the left. o nee for perfect extension pruning, because the perfect extensions can be move to the left an are processe at the en with the chain optimization. However, there are also isavantages: Either the FP-tree has to be traverse twice or pair frequencies have to be etermine to reorer the items accoring to their conitional frequency. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining FP-growth: Implementation Issues FP-growth: Implementation Issues The initial FP-tree is built from an array-base main memory representation of the transaction atabase (eliminates the nee for chil pointers). This has the isavantage that the memory savings often resulting from an FP-tree representation cannot be fully exploite. However, it has the avantage that no chil an sibling pointers are neee an the transactions can be inserte in lexicographic orer. Each FP-tree noe has a constant size of / bytes ( integers, pointers). Allocating these through the stanar memory management is wasteful. (Allocating many small memory objects is highly inefficient.) Solution: The noes are allocate in one large array per FP-tree. As a consequence, each FP-tree resies in a single memory block. There is no allocation an eallocation of iniviual noes. (This may waste some memory, but is highly efficient.) An FP-tree can be implemente with only two integer arrays [Rasz ]: one array contains the transaction counters (support values) an one array contains the parent pointers (as the inices of array elements). This reuces the memory requirements to bytes per noe. Such a memory structure has avantages ue the way in which moern processors access the main memory: Linear memory accesses are faster than ranom accesses. Main memory is organize as a table with rows an columns. First the row is aresse an then, after some elay, the column. Accesses to ifferent columns in the same row can skip the row aressing. However, there are also isavantages: Programming projection an α- or Bonsai pruning becomes more complex, because less structure is available. Reorering the items is virtually rule out. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

42 Summary FP-Growth Basic Processing Scheme The transaction atabase is represente as a frequent pattern tree. An FP-tree is projecte to obtain a conitional atabase. Recursive processing of the conitional atabase. Avantages ften the fastest algorithm or among the fastest algorithms. Experimental omparison Disavantages More ifficult to implement than other approaches, complex ata structure. An FP-tree can nee more memory than a list or array of transactions. Software hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Experiments: Data Sets Experiments: Data Sets hess A ata set listing chess en game positions for king vs. king an rook. This ata set is part of the UI machine learning repository. items, 9 transactions (average) transaction size:, ensity:. ensus (a.k.a. Ault) A ata set erive from an extract of the US census bureau ata of 99, which was preprocesse by iscretizing numeric attributes. This ata set is part of the UI machine learning repository. items, transactions (average) transaction size:, ensity:. TIDK An artificial ata set generate with IBM s ata generator. The name is forme from the parameters given to the generator (for example: K = transactions, T = items per transaction). items, transactions average transaction size:., ensity:. BMS-Webview- A web click stream from a leg-care company that no longer exists. It has been use in the KDD cup an is a popular benchmark. 9 items, 9 transactions average transaction size:., ensity:. The ensity of a transaction atabase is the average fraction of all items occurring per transaction: ensity = average transaction size / number of items. The ensity of a transaction atabase is the average fraction of all items occurring per transaction: ensity = average transaction size / number of items hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

43 Experiments: Programs an Test System Experiments: Execution Times All programs are my own implementations. All use the same coe for reaing the transaction atabase an for writing the foun frequent item sets. Therefore ifferences in spee can only be the effect of the processing schemes. These programs an their source coe can be foun on my web site: Apriori Eclat & LM FP-Growth RElim SaM All tests were run on an Intel ore Qua with GB memory running Ubuntu Linux. LTS ( bit); programs were compile with G... chess Apriori Eclat LM FPgrowth SaM RElim census Apriori Eclat LM FPgrowth SaM RElim 9 TIDK Apriori Eclat LM FPgrowth SaM RElim webview Apriori Eclat LM FPgrowth SaM RElim 9 Decimal logarithm of execution time in secons over absolute minimum support. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Experiments: k-items Machine (here: k = ) Reminer: Perfect Extensions chess Apriori Eclat LM FPgrowth w/o m census Apriori Eclat LM FPgrowth w/o m TIDK Apriori Eclat LM FPgrowth w/o m webview Apriori Eclat LM FPgrowth w/o m The search can be improve with so-calle perfect extension pruning. Given an item set I, an item i / I is calle a perfect extension of I, iff I an I {i} have the same support (all transactions containing I contain i). Perfect extensions have the following properties: If the item i is a perfect extension of an item set I, then i is also a perfect extension of any item set J I (as long as i / J). If I is a frequent item set an X is the set of all perfect extensions of I, then all sets I J with J X (where X enotes the power set of X) are also frequent an have the same support as I. This can be exploite by collecting perfect extension items in the recursion, in a thir element of a subproblem escription: S = (T, P, X). 9 9 Decimal logarithm of execution time in secons over absolute minimum support. nce ientifie, perfect extension items are no longer processe in the recursion, but are only use to generate all supersets of the prefix having the same support. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

44 Experiments: Perfect Extension Pruning (with m) Experiments: Perfect Extension Pruning (w/o m) chess Apriori Eclat LM FPgrowth w/o pex TIDK Apriori Eclat LM FPgrowth w/o pex chess Apriori Eclat LM FPgrowth w/o pex TIDK Apriori Eclat LM FPgrowth w/o pex census Apriori Eclat LM FPgrowth w/o pex webview Apriori Eclat LM FPgrowth w/o pex census Apriori Eclat LM FPgrowth w/o pex webview Apriori Eclat LM FPgrowth w/o pex 9 9 Decimal logarithm of execution time in secons over absolute minimum support. 9 9 Decimal logarithm of execution time in secons over absolute minimum support. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Maximal Item Sets onsier the set of maximal (frequent) item sets: M T (s min ) = {I B s T (I) s min J I : s T (J) < s min }. Reucing the utput: lose an Maximal Item Sets That is: An item set is maximal if it is frequent, but none of its proper supersets is frequent. Since with this efinition we know that s min : I F T (s min ) : I M T (s min ) J I : s T (J) s min it follows (can easily be proven by successively extening the item set I) s min : I F T (s min ) : J M T (s min ) : I J. That is: Every frequent item set has a maximal superset. Therefore: s min : F T (s min ) = I M T (s min ) I hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

45 Mathematical Excursion: Maximal Elements Maximal Item Sets: Example Let R be a subset of a partially orere set (S, ). An element x R is calle maximal or a maximal element of R if y R : y x y = x. The notions minimal an minimal element are efine analogously. Maximal elements nee not be unique, because there may be elements x, y R with neither x y nor y x. Infinite partially orere sets nee not possess a maximal/minimal element. Here we consier the set F T (s min ) as a subset of the partially orere set ( B, ): The maximal (frequent) item sets are the maximal elements of F T (s min ): M T (s min ) = {I F T (s min ) J F T (s min ) : J I J = I}. That is, no superset of a maximal (frequent) item set is frequent. transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} The maximal item sets are: frequent item sets items item items items : {a}: {a, c}: {a, c, }: {b}: {a, }: {a, c, e}: {c}: {a, e}: {a,, e}: {}: {b, c}: {e}: {c, }: {c, e}: {, e}: {b, c}, {a, c, }, {a, c, e}, {a,, e}. Every frequent item set is a subset of at least one of these sets. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Hasse Diagram an Maximal Item Sets Limits of Maximal Item Sets transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} Re boxes are maximal item sets, white boxes infrequent item sets. Hasse iagram with maximal item sets (s min = ): a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce The set of maximal item sets captures the set of all frequent item sets, but then we know at most the support of the maximal item sets exactly. About the support of a non-maximal frequent item set we only know: s min : I F T (s min ) M T (s min ) : s T (I) max J M T (s min ),J I s T (J). This relation follows immeiately from I : J I : s T (I) s T (J), that is, an item set cannot have a lower support than any of its supersets. ote that we have generally s min : I F T (s min ) : s T (I) max J M T (s min ),J I s T (J). Question: an we fin a subset of the set of all frequent item sets, which also preserves knowlege of all support values? hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

46 lose Item Sets lose Item Sets onsier the set of close (frequent) item sets: T (s min ) = {I B s T (I) s min J I : s T (J) < s T (I)}. That is: An item set is close if it is frequent, but none of its proper supersets has the same support. Since with this efinition we know that s min : I F T (s min ) : I T (s min ) J I : s T (J) = s T (I) it follows (can easily be proven by successively extening the item set I) s min : I F T (s min ) : J T (s min ) : I J. That is: Every frequent item set has a close superset. Therefore: s min : F T (s min ) = I I T (s min ) However, not only has every frequent item set a close superset, but it has a close superset with the same support: s min : I F T (s min ) : J I : (Proof: see (also) the consierations on the next slie) J T (s min ) s T (J) = s T (I). The set of all close item sets preserves knowlege of all support values: s min : I F T (s min ) : s T (I) = max J T (s min ),J I s T (J). ote that the weaker statement s min : I F T (s min ) : s T (I) max J T (s min ),J I s T (J) follows immeiately from I : J I : s T (I) s T (J), that is, an item set cannot have a lower support than any of its supersets. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining lose Item Sets lose Item Sets: Example Alternative characterization of close (frequent) item sets: I is close s T (I) s min I = t k. k K T (I) Reminer: K T (I) = {k {,..., n} I t k } is the cover of I w.r.t. T. This is erive as follows: since k K T (I) : I t k, it is obvious that s min : I F T (s min ) : I t k, k K T (I) If I k K T (I) t k, it is not close, since k K T (I) t k has the same support. n the other han, no superset of k K T (I) t k has the cover K T (I). ote that the above characterization allows us to construct for any item set the (uniquely etermine) close superset that has the same support. transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} frequent item sets items item items items : {a}: {a, c}: {a, c, }: {b}: {a, }: {a, c, e}: {c}: {a, e}: {a,, e}: {}: {b, c}: {e}: {c, }: {c, e}: {, e}: All frequent item sets are close with the exception of {b} an {, e}. {b} is a subset of {b, c}, both have a support of ˆ= %. {, e} is a subset of {a,, e}, both have a support of ˆ= %. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

47 Hasse iagram an lose Item Sets Reminer: Perfect Extensions transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} Re boxes are close item sets, white boxes infrequent item sets. Hasse iagram with close item sets (s min = ): a b c e ab ac a ae bc b be c ce e abc ab abe ac ace ae bc bce be ce abc abce abe ace bce abce The search can be improve with so-calle perfect extension pruning. Given an item set I, an item i / I is calle a perfect extension of I, iff I an I {i} have the same support (all transactions containing I contain i). Perfect extensions have the following properties: If the item i is a perfect extension of an item set I, then i is also a perfect extension of any item set J I (as long as i / J). If I is a frequent item set an X is the set of all perfect extensions of I, then all sets I J with J X (where X enotes the power set of X) are also frequent an have the same support as I. This can be exploite by collecting perfect extension items in the recursion, in a thir element of a subproblem escription: S = (T, P, X). nce ientifie, perfect extension items are no longer processe in the recursion, but are only use to generate all supersets of the prefix having the same support. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining lose Item Sets an Perfect Extensions Relation of Maximal an lose Item Sets transaction atabase : {a,, e} : {b, c, } : {a, c, e} : {a, c,, e} : {a, e} : {a, c, } : {b, c} : {a, c,, e} 9: {b, c, e} : {a,, e} frequent item sets items item items items : {a}: {a, c}: {a, c, }: {b}: {a, }: {a, c, e}: {c}: {a, e}: {a,, e}: {}: {b, c}: {e}: {c, }: {c, e}: {, e}: c is a perfect extension of {b} as {b} an {b, c} both have support. a is a perfect extension of {, e} as {, e} an {a,, e} both have support. on-close item sets possess at least one perfect extension, close item sets o not possess any perfect extension. empty set empty set item base item base maximal (frequent) item sets close (frequent) item sets The set of close item sets is the union of the sets of maximal item sets for all minimum support values at least as large as s min : T (s min ) = M T (s) s {s min,s min +,...,n,n} hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

48 Mathematical Excursion: losure perators Mathematical Excursion: Galois onnections A closure operator on a set S is a function cl : S S, which satisfies the following conitions X, Y S: X cl(x) (cl is extensive) X Y cl(x) cl(y ) (cl is increasing or monotone) cl(cl(x)) = cl(x) (cl is iempotent) A set R S is calle close if it is equal to its closure: R is close R = cl(r). The close (frequent) item sets are inuce by the closure operator cl(i) = k K T (I) t k. restricte to the set of frequent item sets: T (s min ) = {I F T (s min ) I = cl(i)} Let (X, X ) an (Y, Y ) be two partially orere sets. A function pair (f, f ) with f : X Y an f : Y X is calle a (monotone) Galois connection iff A, A X : A X A f (A ) Y f (A ), B, B Y : B Y B f (B ) X f (B ), A X : B Y : A X f (B) B Y f (A). A function pair (f, f ) with f : X Y an f : Y X is calle an anti-monotone Galois connection iff A, A X : A X A f (A ) Y f (A ), B, B Y : B Y B f (B ) X f (B ), A X : B Y : A X f (B) B Y f (A). In a monotone Galois connection, both f an f are monotone, in an anti-monotone Galois connection, both f an f are anti-monotone. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Mathematical Excursion: Galois onnections Mathematical Excursion: Galois onnections Let the two sets X an Y be power sets of some sets U an V, respectively, an let the partial orers be the subset relations on these power sets, that is, let (X, X ) = ( U, ) an (Y, Y ) = ( V, ). Then the combination f f : X X of the functions of a Galois connection is a closure operator (as well as the combination f f : Y Y ). (i) A U : A f (f (A)) (a closure operator is extensive): Since (f, f ) is a Galois connection, we know hoose B = f (A): hoose A = f (B): A U : B V : A U : B V : A f (B) B f (A). A f (f (A)) f (A) f (A). }{{} =true f (B) f (B) B f }{{} (f (B)). =true (ii) A, A U : A A f (f (A )) f (f (A )) (a closure operator is increasing or monotone): This property follows immeiately from the fact that the functions f an f are both (anti-)monotone. If f an f are both monotone, we have A, A U : A A A, A U : f (A ) f (A ) A, A U : f (f (A )) f (f (A )). If f an f are both anti-monotone, we have A, A U : A A A, A U : f (A ) f (A ) A, A U : f (f (A )) f (f (A )). hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9

49 Mathematical Excursion: Galois onnections Galois onnections in Frequent Item Set Mining (iii) A U : f (f (f (f (A)))) = f (f (A)) (a closure operator is iempotent): onsier the partially orere sets ( B, ) an ( {,...,n}, ). Since both f f an f f are extensive (see above), we know A V : B V : A f (f (A)) f (f (f (f (A)))) B f (f (B)) f (f (f (f (B)))) hoosing B = f (A ) with A U, we obtain A U : f (A ) f (f (f (f (f (A ))))). Since (f, f ) is a Galois connection, we know A U : B V : A f (B) B f (A). hoosing A = f (f (f (f (A )))) an B = f (A ), we obtain A U : f (f (f (f (A )))) f (f (A )) f (A ) f (f (f (f (f (A ))))). }{{} =true (see above) Let f : B {,...,n}, I K T (I) = {k {,..., n} I t k } an f : {,...,n} B, J j J t j = {i B j J : i t j }. The function pair (f, f ) is an anti-monotone Galois connection: I, I B : I I f (I ) = K T (I ) K T (I ) = f (I ), J, J {,...,n} : J J f (J ) = k J t k k J t k = f (J ), I B : J {,...,n} : I f (J) = j J t j J f (I) = K T (I). As a consequence f f : B B, I k K T (I) t k is a closure operator. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Galois onnections in Frequent Item Set Mining lose Item Sets / Transaction Inex Sets Likewise f f : {,...,n} {,...,n}, J K T ( j J t j ) is also a closure operator. Furthermore, if we restrict our consierations to the respective sets of close sets in both omains, that is, to the sets B = {I B I = f (f (I)) = k K T (I) t k} an T = {J {,..., n} J = f (f (J)) = K T ( j J t j )}, there exists a -to- relationship between these two sets, which is escribe by the Galois connection: f = f B is a bijection with f = f = f T. (This follows immeiately from the facts that the Galois connection escribes closure operators an that a closure operator is iempotent.) Therefore fining close item sets with a given minimum support is equivalent to fining close sets of transaction inices of a given minimum size. Fining close item sets with a given minimum support is equivalent to fining close sets of transaction inices of a given minimum size. lose in the item set omain B : an item set I is close if aing an item to I reuces the support compare to I; aing an item to I loses at least one trans. in K T (I) = {k {,..., n} I t k }; there is no perfect extension, that is, no (other) item that is containe in all transactions t k, k K T (I). lose in the transaction inex set omain {,...,n} : a transaction inex set K is close if aing a transaction inex to K reuces the size of the transaction intersection I K = k K t k compare to K; aing a transaction inex to K loses at least one item in I K = k K t k ; there is no perfect extension, that is, no (other) transaction that contains all items in I K = k K t k. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9

50 Types of Frequent Item Sets: Summary Types of Frequent Item Sets: Summary Frequent Item Set Any frequent item set (support is higher than the minimal support): I frequent s T (I) s min lose (Frequent) Item Set A frequent item set is calle close if no superset has the same support: I close s T (I) s min J I : s T (J) < s T (I) Maximal (Frequent) Item Set A frequent item set is calle maximal if no superset is frequent: I maximal s T (I) s min J I : s T (J) < s min bvious relations between these types of item sets: All maximal item sets an all close item sets are frequent. All maximal item sets are close. items item items items + : {a} + : {a, c} + : {a, c, } + : {b}: {a, } + : {a, c, e} + : {c} + : {a, e} + : {a,, e} + : {} + : {b, c} + : {e} + : {c, } + : {c, e} + : {, e}: Frequent Item Set Any frequent item set (support is higher than the minimal support). lose (Frequent) Item Set (marke with + ) A frequent item set is calle close if no superset has the same support. Maximal (Frequent) Item Set (marke with ) A frequent item set is calle maximal if no superset is frequent. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining 9 Searching for lose Frequent Item Sets We know that it suffices to fin the close item sets together with their support: from them all frequent item sets an their support can be retrieve. Searching for lose an Maximal Item Sets The characterization of close item sets by I close s T (I) s min I = t k k K T (I) suggests to fin them by forming all possible intersections of the transactions (of at least s min transactions). However, on stanar ata sets, approaches using this iea are rarely competitive with other methos. Special cases in which they are competitive are omains with few transactions an very many items. Examples of such a omains are gene expression analysis an the analysis of ocument collections. hristian Borgelt Frequent Pattern Mining 99 hristian Borgelt Frequent Pattern Mining

51 arpenter: Enumerating Transaction Sets The arpenter algorithm implements the intersection approach by enumerating sets of transactions (or, equivalently, sets of transaction inices), intersecting them, an removing/pruning possible uplicates (ensuring close transaction inex sets). arpenter [Pan, ong, Tung, Yang, an Zaki ] This is one with basically the same ivie-an-conquer scheme as for the item set enumeration approaches, only that it is applie to transactions (that is, items an transactions exchange their meaning [Rioult et al. ]). The task to enumerate all transaction inex sets is split into two sub-tasks: enumerate all transaction inex sets that contain the inex enumerate all transaction inex sets that o not contain the inex. These sub-tasks are then further ivie w.r.t. the transaction inex : enumerate all transaction inex sets containing both inices an, inex, but not inex, inex, but not inex, neither inex nor inex, an so on recursively. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining arpenter: Enumerating Transaction Sets arpenter: List-base Implementation All subproblems in the recursion can be escribe by triplets S = (I, K, k). K {,..., n} is a set of transaction inices, I = k K t k is their intersection, an k is a transaction inex, namely the inex of the next transaction to consier. The initial problem, with which the recursion is starte, is S = (B,, ), where B is the item base an no transactions have been intersecte yet. A subproblem S = (I, K, k ) is processe as follows: Let K = K {k } an form the intersection I = I t k. If I =, o nothing (return from recursion). If K s min, an there is no transaction t j with j {,..., n} K such that I t j (that is, K is close), report I with support s T (I ) = K. Let k = k +. If k n, then form the subproblems S = (I, K, k ) an S = (I, K, k ) an process them recursively. Transaction ientifier lists are use to represent the current item set I (vertical transaction representation, as in the Eclat algorithm). The intersection consists in collecting all lists with the next transaction inex k. Example: transaction atabase t t t t t t t t a b c a e b c a b c b c a b e c e transaction ientifier lists a b c e collection for K = {} a b c for K = {, }, {, } a b c hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

52 arpenter: Table-/Matrix-base Implementation arpenter: Duplicate Removal/loseness heck Represent the ata set by a n B matrix M as follows [Borgelt et al. ] {, if item i / tk, m ki = {j {k,..., n} i t j }, otherwise. Example: transaction atabase t t t t t t t t a b c a e b c a b c b c a b e c e matrix representation a b c e t t t t t t t t The current item set I is simply represente by the containe items. An intersection collects all items i I with m ki > max{, s min K }. The intersection of several transaction inex sets can yiel the same item set. The support of the item set is the size of the largest transaction inex set that yiels the item set; smaller transaction inex sets can be skippe/ignore. This is the reason for the check whether there exists a transaction t j with j {,..., n} K such that I t j. This check is split into the two checks whether there exists such a transaction t j with j > k an with j {,..., k } K. The first check is easy, because such transactions are consiere in the recursive processing which can return whether one exists. The problematic secon check is solve by maintaining a repository of alreay foun close frequent item sets. In orer to make the look-up in the repository efficient, it is lai out as a prefix tree with a flat array top level. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Summary arpenter Basic Processing Scheme Enumeration of transactions sets (transaction ientifier sets). Intersection of the transactions in any set yiels a close item set. Duplicate removal/closeness check is one with a repository (prefix tree). Avantages Effectively linear in the number of items. Very fast for transaction atabases with many more items than transactions. Disavantages Exponential in the number of transactions. Very slow for transaction atabases with many more transactions than items. IsTa Intersecting Transactions [Mielikäinen ] (simple repository, no prefix tree) [Borgelt, Yang, ogales-aenas, armona-saez, an Pascual-Montano ] Software hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

53 Ista: umulative Transaction Intersections Ista: umulative Transaction Intersections Alternative approach: maintain a repository of all close item sets, which is upate by intersecting it with the next transaction [Mielikainen ]. To justify this approach formally, we consier the set of all close frequent item sets for s min =, that is, the set T () = {I B S T : S I = t S t}. The set T () satisfies the following simple recursive relation: () =, T {t} () = T () {t} {I s T () : I = s t}. Therefore we can start the proceure with an empty set of close item sets an then process the transactions one by one. In each step upate the set of close item sets by aing the new transaction t an the aitional close item sets that result from intersecting it with T (). In aition, the support of alreay known close item sets may have to be upate. The core implementation problem is to fin a ata structure for storing the close item sets that allows to quickly compute the intersections with a new transaction an to merge the result with the alreay store close item sets. For this we rely on a prefix tree, each noe of which represents an item set. The algorithm works on the prefix tree as follows: At the beginning an empty tree is create (ummy root noe); then the transactions are processe one by one. Each new transaction is first simply ae to the prefix tree. Any new noes create in this step are initialize with a support of zero. In the next step we compute the intersections of the new transaction with all item sets represente by the current prefix tree. A recursive proceure traverses the prefix tree selectively (epth-first) an matches the items in the tree noes with the items of the transaction. Intersecting with an inserting into the tree can be combine. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining Ista: umulative Transaction Intersections Ista: Data Structure transaction atabase t t t e c a e b c b a : : e c a : e b c a.: e b c a c b a typeef struct noe { /* a prefix tree noe */ int step; /* most recent upate step */ int item; /* assoc. item (last in set) */ int supp; /* support of item set */ struct noe *sibling; /* successor in sibling list */ struct noe *chilren; /* list of chil noes */ } DE;.: e.: e c.: e c Stanar first chil / right sibling noe structure. Fixe size of each noe allows for optimize allocation. Flexible structure that can easily be extene c c b c c b a c c b a The step fiel inicates whether the support fiel was alreay upate. b a b a b a b a b a b a The step fiel is an incremental marker, so that it nee not be cleare in a separate traversal of the prefix tree. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

54 Ista: Pseuo-oe Ista: Pseuo-oe voi isect (DE noe, DE **ins) { /* intersect with transaction */ int i; /* buffer for current item */ DE *; /* to allocate new noes */ while (noe) { /* traverse the sibling list */ i = noe->item; /* get the current item */ if (trans[i]) { /* if item is in intersection */ while (( = *ins) && ( item > i)) ins = &->sibling; /* fin the insertion position */ if ( /* if an intersection noe with */ && (->item == i)) { /* the item alreay exists */ if (->step >= step) ->supp--; if (->supp < noe->supp) ->supp = noe->supp; ->supp++; /* upate intersection support */ ->step = step; } /* an set current upate step */ else { /* if there is no corresp. noe */ = malloc(sizeof(de)); ->step = step; /* create a new noe an */ ->item = i; /* set item an support */ ->supp = noe->supp+; ->sibling = *ins; *ins = ; ->chilren = ULL; } /* insert noe into the tree */ if (i <= imin) return; /* if beyon last item, abort */ isect(noe->chilren, &->chilren); } else { /* if item is not in intersection */ if (i <= imin) return; /* if beyon last item, abort */ isect(noe->chilren, ins); } /* intersect with subtree */ noe = noe->sibling; /* go to the next sibling */ } /* en of while (noe) */ } /* isect() */ hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Ista: Keeping the Repository Small Ista: Keeping the Repository Small In practice we will not work with a minimum support s min =. Removing intersections early, because they o not reach the minimum support is ifficult: in principle, enough of the transactions to be processe in the future coul contain the item set uner consieration. Improve processing with item occurrence counters: In an initial pass the frequency of the iniviual items is etermine. The obtaine counters are upate with each processe transaction. They always represent the item occurrences in the unprocesse transactions. Base on these counters, we can apply the following pruning scheme: Suppose that after having processe k of a total of n transactions the support of a close item set I is s Tk (I) = x. Let y be the minimum of the counter values for the items containe in I. If x + y < s min, then I can be iscare, because it cannot reach s min. ne has to be careful, though, because I may be neee in orer to form subsets, namely those that result from intersections of it with new transactions. These subsets may still be frequent, even though I is not. As a consequence, an item set I is not simply remove, but those items are selectively remove from it that o not occur frequently enough in the remaining transactions. Although in this way non-close item sets may be constructe, no problems for the final output are create: either the reuce item set also occurs as the intersection of enough transactions an thus is close, or it will not reach the minimum support threshol an then it will not be reporte. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

55 Summary Ista Experimental omparison: Data Sets Basic Processing Scheme umulative intersection of transactions (incremental/on-line/stream mining). ombine intersection an repository extensions (one traversal). Aitional pruning is possible for batch processing. Avantages Effectively linear in the number of items. Very fast for transaction atabases with many more items than transactions. Disavantages Exponential in the number of transactions. Very slow for transaction atabases with many more transactions than items Software Yeast Gene expression ata for baker s yeast (saccharomyces cerevisiae). transactions (experimental conitions), about, items (genes) I Gene expression ata from the Stanfor I ancer Microarray Project. transactions (experimental conitions), about, items (genes) Thrombin hemical fingerprints of compouns (not) bining to Thrombin (a.k.a. fibrinogenase, (activate) bloo-coagulation factor II etc.). 99 transactions (compouns), 9, items (binary features) BMS-Webview- transpose A web click stream from a leg-care company that no longer exists. 9 transactions (originally items), 9 items (originally transactions). hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Experimental omparison: Programs an Test System Experimental omparison: Execution Times The arpenter an IsTa programs are my own implementations. Both use the same coe for reaing the transaction atabase an for writing the foun frequent item sets. These programs an their source coe can be foun on my web site: arpenter IsTa yeast IsTa arp. table arp. lists FP-close LM nci IsTa arp. table arp. lists The versions of FP-close (FP-growth with filtering for close frequent item sets) an LM have been taken from the Frequent Itemset Mining Implementations (FIMI) Repository (see FP-close won the FIMI Workshop competition in, LM in. thrombin IsTa arp. table arp. lists FP-close LM webview tpo. IsTa arp. table arp. lists FP-close LM All tests were run on an Intel ore Qua Q9@GHz with GB memory running Ubuntu Linux. LTS ( bit); programs were compile with G... Decimal logarithm of execution time in secons over absolute minimum support. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

56 Filtering Frequent Item Sets If only close item sets or only maximal item sets are to be foun with item set enumeration approaches, the foun frequent item sets have to be filtere. Searching for lose an Maximal Item Sets with Item Set Enumeration Some useful notions for filtering an pruning: The hea H B of a search tree noe is the set of items on the path leaing to it. It is the prefix of the conitional atabase for this noe. The tail L B of a search tree noe is the set of items that are frequent in its conitional atabase. They are the possible extensions of H. ote that h H : l L : h < l (provie the split items are chosen accoring to a fixe orer). E = {i B H h H : h > i} is the set of exclue items. These items are not consiere anymore in the corresponing subtree. ote that the items in the tail an their support in the conitional atabase are known, at least after the search returns from the recursive processing. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Hea, Tail an Exclue Items lose an Maximal Item Sets a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce A (full) prefix tree for the five items a, b, c,, e. The blue boxes are the frequent item sets. For the encircle search tree noes we have: re: hea H = {b}, tail L = {c}, exclue items E = {a} green: hea H = {a, c}, tail L = {, e}, exclue items E = {b} When filtering frequent item sets for close an maximal item sets the following conitions are easy an efficient to check: If the tail of a search tree noe is not empty, its hea is not a maximal item set. If an item in the tail of a search tree noe has the same support as the hea, the hea is not a close item set. However, the inverse implications nee not hol: If the tail of a search tree noe is empty, its hea is not necessarily a maximal item set. If no item in the tail of a search tree noe has the same support as the hea, the hea is not necessarily a close item set. The problem are the exclue items, which can still rener the hea non-close or non-maximal. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

57 lose an Maximal Item Sets lose an Maximal Item Sets heck the Defining onition Directly: lose Item Sets: heck whether i E : K T (H) K T (i) or check whether (t k H). k K T (H) If either is the case, H is not close, otherwise it is. ote that the intersection can be compute transaction by transaction. It can be conclue that H is close as soon as the intersection becomes empty. Maximal Item Sets: heck whether i E : s T (H {i}) s min. If this is the case, H is not maximal, otherwise it is. hecking the efining conition irectly is trivial for the tail items, as their support values are available from the conitional transaction atabases. As a consequence, all item set enumeration approaches for close an maximal item sets check the efining conition for the tail items. However, checking the efining conition can be ifficult for the exclue items, since aitional ata (beyon the conitional transaction atabase) is neee to etermine their occurrences in the transactions or their support values. It can epen on the atabase structure use whether a check of the efining conition is efficient for the exclue items or not. As a consequence, some item set enumeration algorithms o not check the efining conition for the exclue items, but rely on a repository of alreay foun close or maximal item sets. With such a repository it can be checke in an inirect way whether an item set is close or maximal. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining hecking the Exclue Items: Repository hecking the Exclue Items: Repository Each foun maximal or close item set is store in a repository. (Preferre ata structure for the repository: prefix tree) It is checke whether a superset of the hea H with the same support has alreay been foun. If yes, the hea H is neither close nor maximal. Even more: the hea H nee not be processe recursively, because the recursion cannot yiel any close or maximal item sets. Therefore the current subtree of the search tree can be prune. ote that with a repository the epth-first search has to procee from left to right. We nee the repository to check for possibly existing close or maximal supersets that contain one or more exclue item(s). Item sets containing exclue items are consiere only in search tree branches to the left of the consiere noe. Therefore these branches must alreay have been processe in orer to ensure that possible supersets have alreay been recore. a b c e a b c ab ac a ae bc b be c ce e b c c abc ab abe ac ace ae bc bce be ce c abc abce abe ace bce abce A (full) prefix tree for the five items a, b, c,, e. Suppose the prefix tree woul be traverse from right to left. For none of the frequent item sets {, e}, {c, } an {c, e} it coul be etermine with the help of a repository that they are not maximal, because the maximal item sets {a, c, }, {a, c, e}, {a,, e} have not been processe then. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

58 hecking the Exclue Items: Repository hecking the Exclue Items: Repository If a superset of the current hea H with the same support has alreay been foun, the hea H nee not be processe, because it cannot yiel any maximal or close item sets. The reason is that a foun proper superset I H with s T (I) = s T (H) contains at least one item i I H that is a perfect extension of H. The item i is an exclue item, that is, i / L (item i is not in the tail). (If i were in L, the set I woul not be in the repository alreay.) If the item i is a perfect extension of the hea H, it is a perfect extension of all supersets J H with i / J. All item sets explore from the search tree noe with hea H an tail L are subsets of H L (because only the items in L are conitionally frequent). onsequently, the item i is a perfect extension of all item sets explore from the search tree noe with hea H an tail L, an therefore none of them can be close. It is usually avantageous to use not just a single, global repository, but to create conitional repositories for each recursive call, which contain only the foun close item sets that contain H. With conitional repositories the check for a known superset reuces to the check whether the conitional repository contains an item set with the next split item an the same support as the current hea. (ote that the check is execute before going into recursion, that is, before constructing the extene hea of a chil noe. If the check fins a superset, the chil noe is prune.) The conitional repositories are obtaine by basically the same operation as the conitional transaction atabases (projecting/conitioning on the split item). A popular structure for the repository is an FP-tree, because it allows for simple an efficient projection/conitioning. However, a simple prefix tree that is projecte top-own may also be use. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining lose an Maximal Item Sets: Pruning Hea Union Tail Pruning If only close item sets or only maximal item sets are to be foun, aitional pruning of the search tree becomes possible. Perfect Extension Pruning / Parent Equivalence Pruning (PEP) Given an item set I, an item i / I is calle a perfect extension of I, iff the item sets I an I {i} have the same support: s T (I) = s T (I {i}) (that is, if all transactions containing I also contain the item i). Then we know: J I : s T (J {i}) = s T (J). As a consequence, no superset J I with i / J can be close. Hence i can be ae irectly to the prefix of the conitional atabase. Let X T (I) = {i i / I s T (I {i}) = s T (I)} be the set of all perfect extension items. Then the whole set X T (I) can be ae to the prefix. Perfect extension / parent equivalence pruning can be applie for both close an maximal item sets, since all maximal item sets are close. If only maximal item sets are to be foun, even more aitional pruning of the search tree becomes possible. General Iea: All frequent item sets in the subtree roote at a noe with hea H an tail L are subsets of H L. Maximal Item Set ontains Hea Tail Pruning (MFIHUT) If we fin out that H L is a subset of an alreay foun maximal item set, the whole subtree can be prune. This pruning metho requires a left to right traversal of the prefix tree. Frequent Hea Tail Pruning (FHUT) If H L is not a subset of an alreay foun maximal item set an by some clever means we iscover that H L is frequent, H L can immeiately be recore as a maximal item set. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

59 Alternative Description of lose Item Set Mining Alternative Description of lose Item Set Mining In orer to avoi reunant search in the partially orere set ( B, ), we assigne a unique parent item set to each item set (except the empty set). Analogously, we may structure the set of close item sets by assigning unique close parent item sets. [Uno et al. ] Let be an item orer an let I be a close item set with I k n t k. Let i I be the (uniquely etermine) item satisfying s T ({i I i < i }) > s T (I) an s T ({i I i i }) = s T (I). Intuitively, the item i is the greatest item in I that is not a perfect extension. (All items greater than i can be remove without affecting the support.) Let I = {i I i < i } an X T (I) = {i B I s T (I {i}) = s T (I)}. Then the canonical parent π (I) of I is the item set π (I) = I {i X T (I ) i > i }. Intuitively, to fin the canonical parent of the item set I, the reuce item set I is enhance by all perfect extension items following the item i. ote that k n t k is the smallest close item set for a given atabase T. ote also that the set {i X T (I ) i > i } nee not contain all items i > i, because a perfect extension of I {i } nee not be a perfect extension of I, since K T (I ) K T (I {i }). For the recursive search, the following formulation is useful: Let I B be a close item set. The canonical chilren of I (that is, the close item sets that have I as their canonical parent) are the item sets J = I {i} {j X T (I {i}) j > i} with j I : i > j an {j X T (I {i}) j < i} = X T (J) =. The union with {j X T (I {i}) j > i} represents perfect extension or parent equivalence pruning: all perfect extensions in the tail of I {i} are immeiately ae. The conition {j X T (I {i}) j < i} = expresses that there must not be any perfect extensions among the exclue items. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Experiments: Reminer Types of Frequent Item Sets hess A ata set listing chess en game positions for king vs. king an rook. This ata set is part of the UI machine learning repository. ensus A ata set erive from an extract of the US census bureau ata of 99, which was preprocesse by iscretizing numeric attributes. This ata set is part of the UI machine learning repository. TIDK An artificial ata set generate with IBM s ata generator. The name is forme from the parameters given to the generator (for example: K = transactions). BMS-Webview- A web click stream from a leg-care company that no longer exists. It has been use in the KDD cup an is a popular benchmark. All tests were run on an Intel ore Qua Q9@GHz with GB memory running Ubuntu Linux. LTS ( bit); programs compile with G... chess frequent close maximal census frequent close maximal 9 TIDK frequent close maximal 9 webview frequent close maximal 9 Decimal logarithm of the number of item sets over absolute minimum support. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

60 Experiments: Mining lose Item Sets Experiments: Mining Maximal Item Sets chess Apriori Eclat LM FPgrowth TIDK Apriori Eclat LM FPgrowth chess Apriori Eclat LM FPgrowth TIDK Apriori Eclat LM FPgrowth census Apriori Eclat LM FPgrowth webview Apriori Eclat LM FPgrowth census Apriori Eclat LM FPgrowth webview Apriori Eclat LM FPgrowth 9 9 Decimal logarithm of execution time in secons over absolute minimum support. 9 9 Decimal logarithm of execution time in secons over absolute minimum support. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Aitional Frequent Item Set Filtering General problem of frequent item set mining: The number of frequent item sets, even the number of close or maximal item sets, can excee the number of transactions in the atabase by far. Aitional Frequent Item Set Filtering Therefore: Aitional filtering is necessary to fin the relevant or interesting frequent item sets. General iea: ompare support to expectation. Item sets consisting of items that appear frequently are likely to have a high support. However, this is not surprising: we expect this even if the occurrence of the items is inepenent. Aitional filtering shoul remove item sets with a support close to the support expecte from an inepenent occurrence. hristian Borgelt Frequent Pattern Mining 9 hristian Borgelt Frequent Pattern Mining

61 Aitional Frequent Item Set Filtering Aitional Frequent Item Set Filtering Full Inepenence Evaluate item sets with fi(i) = s T (I) n I i I s T ({i}) an require a minimum value for this measure. (ˆp T is the probability estimate base on T.) = ˆp T (I) i I ˆp T ({i}). Incremental Inepenence Evaluate item sets with ii (I) = min i I n s T (I) s T (I {i}) s T ({i}) = min i I an require a minimum value for this measure. (ˆp T is the probability estimate base on T.) ˆp T (I) ˆp T (I {i}) ˆp T ({i}). Assumes full inepenence of the items in orer to form an expectation about the support of an item set. Avantage: If I contains inepenent items, the minimum ensures a low value. Avantage: an be compute from only the support of the item set an the support values of the iniviual items. Disavantage: If some item set I scores high on this measure, then all J I are also likely to score high, even if the items in J I are inepenent of I. Disavantages: We nee to know the support values of all subsets I {i}. If there exist high scoring inepenent subsets I an I with I >, I >, I I = an I I = I, the item set I still receives a high evaluation. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Aitional Frequent Item Set Filtering Summary Frequent Item Set Mining Subset Inepenence Evaluate item sets with si (I) = min J I,J n s T (I) s T (I J) s T (J) = an require a minimum value for this measure. (ˆp T is the probability estimate base on T.) Avantage: min J I,J Detects all cases where a ecomposition is possible an evaluates them with a low value. ˆp T (I) ˆp T (I J) ˆp T (J). Disavantages: We nee to know the support values of all proper subsets J. Improvement: Use incremental inepenence an in the minimum consier only items {i} for which I {i} has been evaluate high. This captures subset inepenence incrementally. With a canonical form of an item set the Hasse iagram can be turne into a much simpler prefix tree ( ivie-an-conquer scheme using conitional atabases). Item set enumeration algorithms iffer in: the traversal orer of the prefix tree: (breath-first/levelwise versus epth-first traversal) the transaction representation: horizontal (item arrays) versus vertical (transaction lists) versus specialize ata structures like FP-trees the types of frequent item sets foun: frequent versus close versus maximal item sets (aitional pruning methos for close an maximal item sets) An alternative are transaction set enumeration or intersection algorithms. Aitional filtering is necessary to reuce the size of the output. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

Biological Backgroun Example Application: Fining euron Assemblies in eural Spike Data Diagram of a typical myelinate vertebrate motoneuron (source: Wikipeia, Ruiz-Villarreal ), showing the main parts

hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Biological Backgroun Biological Backgroun Structure of a prototypical neuron (simplifie) terminal button synapse

62 Biological Backgroun Example Application: Fining euron Assemblies in eural Spike Data Diagram of a typical myelinate vertebrate motoneuron (source: Wikipeia, Ruiz-Villarreal ), showing the main parts involve in its signaling activity like the enrites, the axon, an the synapses. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining Biological Backgroun Biological Backgroun Structure of a prototypical neuron (simplifie) terminal button synapse enrites (Very) simplifie escription of neural information processing Axon terminal releases chemicals, calle neurotransmitters. These act on the membrane of the receptor enrite to change its polarization. (The insie is usually mv more negative than the outsie.) Decrease in potential ifference: excitatory synapse Increase in potential ifference: inhibitory synapse nucleus axon myelin sheath cell boy (soma) If there is enough net excitatory input, the axon is epolarize. The resulting action potential travels along the axon. (Spee epens on the egree to which the axon is covere with myelin.) When the action potential reaches the terminal buttons, it triggers the release of neurotransmitters. hristian Borgelt Frequent Pattern Mining hristian Borgelt Frequent Pattern Mining

Association rule mining

Association rule mining Association rule induction: Originally designed for market basket analysis. Aims at finding patterns in the shopping behavior of customers of supermarkets, mail-order companies,