TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

Similar documents
Concurrent Apriori Data Mining Algorithms

Available online at Available online at Advanced in Control Engineering and Information Science

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Module Management Tool in Software Development Organizations

Algorithms for Frequent Pattern Mining of Big Data

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallelism for Nested Loops with Non-uniform and Flow Dependences

An Optimal Algorithm for Prufer Codes *

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Virtual Machine Migration based on Trust Measurement of Computer Node

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

A Binarization Algorithm specialized on Document Images and Photos

Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework

Discovering Relational Patterns across Multiple Databases

Cluster Analysis of Electrical Behavior

ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Transaction-Consistent Global Checkpoints in a Distributed Database System

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

The Research of Support Vector Machine in Agricultural Data Classification

Outline. CHARM: An Efficient Algorithm for Closed Itemset Mining. Introductions. Introductions

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Fast Computation of Shortest Path for Visiting Segments in the Plane

Load Balancing for Hex-Cell Interconnection Network

Hermite Splines in Lie Groups as Products of Geodesics

Conditional Speculative Decimal Addition*

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

A Heuristic for Mining Association Rules In Polynomial Time*

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

USING GRAPHING SKILLS

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Parallel Closed Frequent Pattern Mining on PC Cluster

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

A Heuristic for Mining Association Rules In Polynomial Time

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Programming in Fortran 90 : 2017/2018

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Mathematics 256 a course in differential equations for engineering students

Esc101 Lecture 1 st April, 2008 Generating Permutation

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

CMPS 10 Introduction to Computer Science Lecture Notes

Hierarchical clustering for gene expression data analysis

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

A New Approach For the Ranking of Fuzzy Sets With Different Heights

CE 221 Data Structures and Algorithms

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

A User Selection Method in Advertising System

Machine Learning: Algorithms and Applications

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Meta-heuristics for Multidimensional Knapsack Problems

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

A Combined Approach for Mining Fuzzy Frequent Itemset

Array transposition in CUDA shared memory

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1

A METHOD FOR FACTOR SCREENING OF SIMULATION EXPERIMENTS BASED ON ASSOCIATION RULE MINING

A Similarity Measure Method for Symbolization Time Series

CS1100 Introduction to Programming

CSE 326: Data Structures Quicksort Comparison Sorting Bound

ELEC 377 Operating Systems. Week 6 Class 3

Performance Evaluation of Information Retrieval Systems

Maintaining temporal validity of real-time data on non-continuously executing resources

Efficient Distributed File System (EDFS)

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Assembler. Building a Modern Computer From First Principles.

Optimization of integrated circuits by means of simulated annealing. Jernej Olenšek, Janez Puhan, Árpád Bűrmen, Sašo Tomažič, Tadej Tuma

Boundary-Based Time Series Sorting

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Analysis of Non-coherent Fault Trees Using Ternary Decision Diagrams

Deep Classification in Large-scale Text Hierarchies

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Support Vector Machines

Classifier Selection Based on Data Complexity Measures *

3D Virtual Eyeglass Frames Modeling from Multiple Camera Image Data Based on the GFFD Deformation Method

User Authentication Based On Behavioral Mouse Dynamics Biometrics

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

S1 Note. Basis functions.

Solving two-person zero-sum game by Matlab

A Clustering Algorithm for Chinese Adjectives and Nouns 1

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Ontology Generator from Relational Database Based on Jena

PHYSICS-ENHANCED L-SYSTEMS

Wireless Sensor Networks Fault Identification Using Data Association

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

STING : A Statistical Information Grid Approach to Spatial Data Mining

Transcription:

TF 2 P-growth: An Effcent Algorthm for Mnng Frequent Patterns wthout any Thresholds Yu HIRATE, Ego IWAHASHI, and Hayato YAMANA Graduate School of Scence and Engneerng, Waseda Unversty {hrate, ego, yamana}@yama.nfo.waseda.ac.jp Abstract Conventonal frequent pattern mnng algorthms requre some user-specfed mnmum support, and then mne frequent patterns wth support values that are hgher than the mnmum support. As t s dffcult to predct how many frequent patterns wll be mned wth a specfed mnmum support, the Top-k mnng concept has been proposed. The Top-k Mnng concept s based on an algorthm for mnng frequent patterns wthout a mnmum support, but wth the number of most k frequent patterns ordered accordng to ther support values. However, the Top-k mnng concept stll requres a threshold k. Therefore, users must decde the value of k before ntatng mnng. In ths paper, we propose a new mnng algorthm, called TF 2 P-growth, whch does not requre any thresholds. Ths algorthm mnes patterns wth the descendng order of ther support values wthout any thresholds and returns frequent patterns to users sequentally wth short response tme. 1. Introducton Due to recent developments n network nfrastructure and both prce reducton and ncreases n capacty of storage devces, t has become commonplace to archve large amounts of data. It s mportant to analyze such large data sets because they may contan new knowledge. The dscovery of such knowledge requres data mnng technology. Both the long response tme when mnng large amounts of data and usablty wth regard to specfcaton of some threshold values are fundamental problems n data mnng, especally n frequent pattern mnng. Many algorthms have been proposed to resolve these problems. Conventonal frequent pattern mnng algorthms are classfed nto two categores: the canddate-generatonand-test approach and the pattern-growth approach. Canddate-generaton-and-test approach algorthms, such as Apror[1], suffer from both the generaton of huge numbers of canddates and scannng of the dataset many tmes to count the frequency of generated patterns, resultng n long response tmes. Although most frequent pattern mnng algorthms proposed pror to 2000 were based on the canddategeneraton-and-test approach, another approach called pattern-growth has also been proposed. Pattern-growth approach algorthms, such as FP-growth[2], scan the dataset only a few tmes. Moreover, pattern-growth approach algorthms mne frequent patterns wthout generatng any canddates. Thus, n most cases, algorthms based on the pattern-growth approach mne frequent patterns faster than those based on the canddategeneraton-and-test approach. However, the user must stll wat for a long tme to mne large numbers of frequent patterns f users fal to specfy a mnmum support. On the other hand, algorthms based on another concept have also been proposed. Ths s called concept mnng, examples of whch nclude maxmal pattern mnng[][4], closed pattern mnng[5][6], and Top-k mnng[7][8]. Usng maxmal pattern mnng or closed pattern mnng concept algorthms, users can mne frequent patterns contanng only ther superset patterns excludng any subset patterns. Usng Top-k mnng concept algorthms, users can mne the most k-frequent patterns n descendng order of support wthout specfyng a mnmum support. Wth regard to usablty, the Top-k mnng concept s mportant because the user does not have to specfy a mnmum support, whch s usually dffcult to choose when mnng a moderate number of frequent patterns. The Top-k mnng concept algorthms Itemset-Loop/ItemsetLoop and TFP-Mnng were proposed by Fu et al. [7] and Han et al. [8] n 2000 and 2002, respectvely. Itemset-Loop/Itemset-Loop mnes the most k-frequent patterns wth lengths shorter than the user-specfed value of m. In contrast, TFP-Mnng mnes the most k-closed frequent patterns wth lengths longer than the userspecfed value of m. However, such Top-k mnng concept algorthms stll requre a threshold k before the ntaton of mnng. Therefore, users must decde the value of k. When the value of k s too large, mnng takes a long tme. In contrast, usng a value of k that s

too small usually results n mnng only useless patterns even though the mnng tme s short. Therefore, there are stll some dffcultes n specfyng the value of k. In ths paper, we propose a new mnng algorthm, called TF 2 P-growth, whch does not requre any threshold values. TF 2 P-growth, based on FP-growth, mnes patterns wth descendng order of support values. Then, t returns frequent patterns to users sequentally wth a short response tme. The remander of ths paper s organzed as follows. The terms used are explaned n secton 2. Related works are descrbed n secton. Then, we ntroduce our proposed method n secton 4. Secton 5 reports the performance evaluaton of our proposed method. In secton 6, we summarze our work and dscuss some future research drectons. 2.Terms defnton Let I = { 1, 2,, n} be a set of tems. An temset X s a non-empty subset of I. An temset wth m tems s called an m-temset. Duple < td, X > s called a transacton where td s a transacton dentfer and X s an temset. A transacton database TDB s a set of transactons. Gven a transacton database TDB, the support of an temset X, denoted as sup(x ), s the number of transactons ncludng the temset X. A frequent pattern s defned as the temset whose support s hgher than the mnmum support mn_sup. When temsets are algned by ther support n descendng order, the support of the k-th temset s denoted as α. Then, the Top-k frequent patterns are defned as the temsets whose support values are hgher than α.. Related Works In ths secton, researches on both basc frequent pattern mnng and concept mnng are descrbed..1. Basc frequent pattern mnng algorthms.1.1. Apror[1]. Apror s a basc breadth frst algorthm. The theory of Apror s based on the fact that the temset X ' contanng temset X s never frequent f temset X s nfrequent. Based on the theory, Apror teratvely generates a set of canddate frequent patterns whose lengths are ( k + 1) from the k-temsets (for k 1). Then, ther correspondng supports are checked. There are many varants that have mproved on Apror by further reducng the number of canddates generated[9], or by reducng the number of TDB scans[10]..1.2. FP-growth[2]. In 2000, Han et al. proposed the FP-growth algorthm the frst pattern-growth concept algorthm. FP-growth constructs an FP-tree structure and mnes frequent patterns by traversng the constructed FPtree. The FP-tree structure s an extended prefx-tree structure nvolvng crucal condensed nformaton of frequent patterns. a) FP-tree structure The FP-tree structure has suffcent nformaton to mne complete frequent patterns. It conssts of a prefxtree of frequent 1-temset and a frequent-tem header table. Each node n the prefx-tree has three felds: tem-name, count, and node-lnk. tem-name s the name of the tem. count s the number of transactons that consst of the frequent 1-tems on the path from root to ths node. node-lnk s the lnk to the next same temname node n the FP-tree. Each entry n the frequent-tem header table has two felds: tem-name and head of node-lnk. tem-name s the name of the tem. head of node-lnk s the lnk to the frst same tem-name node n the prefx-tree. b) Constructon of FP-tree FP-growth has to scan the TDB twce to construct an FP-tree. The frst scan of TDB retreves a set of frequent tems from the TDB. Then, the retreved frequent tems are ordered by descendng order of ther supports. The ordered lst s called an F-lst. In the second scan, a tree T whose root node R labeled wth null s created. Then, the followng steps are appled to every transacton n the TDB. Here, let a transacton represent [ p P] where p s the frst tem of the transacton and P s the remanng tems. In each transacton, nfrequent tems are dscarded. Then, only the frequent tems are sorted by the same order of F-lst. Call nsert_tree ( p P, to construct an FP-tree. The functon nsert_tree ( p P, appends a transacton [ p P] to the root node R of the tree T. Pseudo code of the functon nsert_tree ( p P, s shown n Fgure 1. An example of an FP-tree s shown n Fgure 2. Ths FP-tree s constructed from the TDB shown n Table 1 wth mn_sup =. In Fgure 2, every node s represented by ( tem name : count). Lnks to next same temname node are represented by dotted arrows.

Table 1. Sample TDB TID Items Frequent Items 100 f, a, c, d, g,, m, p f, c, a, m, p 200 a, b, c, f, l, m, o f, c, a, b, m 00 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n f, c, a, m, p functon nsert_tree ( p P, { } let N be a drect chld node of R, such that N s tem-name = p s tem-name. f ( R has a drect chld node N ) { ncrement N s count by 1. } else{ create a new node M lnked under the R. set M s tem-name equal to p. set M s count equal to 1. } call nsert_tree ( P, N). Fgure 1. Pseudo code of nsert_tree ( p P, c) FP-growth FP-growth mnes frequent patterns from an FP-tree. To generate complete frequent patterns, FP-growth traverses all the node-lnks from head of node-lnks n the FPtree s header table. For any frequent tem a, all possble frequent patterns ncludng a can be mned by followng a s node-lnk startng from a s head n the FP-tree header table. In detal, a s prefx path from a s node to root node s extracted at frst. Then, the prefx path s transformed nto a s condtonal pattern base, whch s a lst of tems that occur before a wth the support values of all the tems along the lst. Then, FP-growth constructs a s condtonal FP-tree contanng only the paths n a s condtonal pattern base. It then mnes all the frequent patterns ncludng tem a from a s condtonal FP-tree. For example, we descrbe how to mne all the frequent patterns ncludng tem p from the FP-tree shown n Fgure 2. For node p, FP-growth mnes a frequent pattern (p:) by traversng p s node-lnks through node (p:2) to node (p:1). Then, t extracts p s prefx paths; Header Table Item f c a b m p Head of node-lnks Fgure 2. Example of an FP-tree {(f:2,c:2,a:2,m:2),(c:1,)} p s condtonal pattern base Header Table Item c m:2 p:2 Head of node-lnks Fgure. p s condtonal FP-tree <f:4,c:,a:,m:2> and <c:1,>. To study whch tems appear together wth p, the transformed path <f:2,c:2,a:2,m:2> s extracted from <f:4,c:,a:,m:2> because the support value of p s 2. Smlarly, we have <c:1,>. The set of these paths {(f:2,c:2,a:2,m:2),(c:1,)} s called p s condtonal pattern base. FP-growth then constructs p s condtonal FP-tree contanng only the paths n p s condtonal pattern base as shown n Fgure. As only c s an tem occurrng more than mn_sup appearng n p s condtonal pattern base, p s condtonal FP-tree leads to only one branch (c:). Hence, only one frequent pattern (cp:) s mned. The fnal frequent patterns ncludng tem p are (p:) and (cp:)..2. Concept mnng algorthms root.2.1. Maxmal Pattern Mnng concept. Basc frequent pattern mnng often mnes a huge number of frequent patterns. However, t s dffcult to fnd new knowledge from such huge numbers of frequent patterns. To resolve ths problem, maxmal pattern mnng algorthms, such as Max-Mner[] and FPmax[4], whch mne only the maxmal frequent patterns, were proposed. c: root f:4 c:1 c: a: m:1 p:1

Defnton1 (Maxmal frequent pattern) A pattern X s defned as a maxmal frequent pattern ff the followng two condtons are satsfed smultaneously: (1) The support value of X s hgher than mn_sup. (2) There exsts no pattern X whose support value s hgher than mn_sup, where X s any superset of X..2.2. Closed Pattern Mnng concept. Smlar to the maxmal pattern mnng concept, the closed pattern mnng concept was proposed to reduce the number of patterns generated. Closed pattern mnng algorthms, such as CLOSET[6] and FP-close[5], mne only closed frequent patterns. Defnton2 (Closed Frequent pattern) A pattern X s defned as a closed frequent pattern ff the followng two condtons are satsfed smultaneously: (1) The support value of X s hgher than mn_sup. (2) There exsts no pattern X whose support value s hgher than mn_sup, where X s a superset of X and s ncluded n all the transactons that nclude X..2.. Top-k Mnng concept. Generally, t s dffcult to predct how many frequent patterns wll be mned from a user-specfed mn_sup. If mn_sup s low, a huge number of frequent patterns are mned. On the other hand, f the mn_sup s large, a small number of frequent patterns are mned. Thus, t s dffcult for users to decde the mn_sup value. To avod such dffcultes, the Top-k mnng concept was proposed to mne the most k frequent patterns wth descendng order of support wthout specfyng mn_sup[7][8]. 4. Proposed Algorthm 4.1. Problems wth Conventonal Algorthms The Top-k mnng concept s mportant to enhance usablty for real applcatons for data mnng. However, the Top-k mnng concept stll requres a threshold k and users must decde the value of k before ntatng mnng. 4.2. Overvew of the proposed algorthm Our proposed algorthm, TF 2 P-growth, mnes patterns wth descendng order of support values wthout specfyng any thresholds. Then, t returns frequent patterns to users sequentally wth short response tmes. Frst, n secton 4., we propose Top-k FP-growth, whch s a Top-k mnng concept algorthm extended from FP-growth. Second, n secton 4.4, we propose TF 2 P- growth based on Top-k FP-growth. 4.. Top-k FP-growth Our proposed Top-k FP-growth algorthm s the base algorthm of TF 2 P-growth. Ths algorthm generates Topk patterns wthout a threshold of mn_sup but wth a threshold k value. 4..1. Extenson from FP-growth. To reduce addtonal computaton, we extended FP-growth for Top-k FPgrowth n three ponts, a) settng the nternal threshold Border_sup, b) reducng the number of patterns generated from the FP-tree, and c) outputtng frequent patterns. a) Settng of Border_sup Defnton (Border_sup) Border_sup s defned as the support value of k-th frequent 1-temset. Ths means that there are at least k 1-temsets wth support values hgher than Border_sup. Top-k FP-growth constructs an FP-tree usng Border_sup as a threshold. Border_sup s an nternal threshold and ts value s defned automatcally. Thus, users do not have to be concerned wth Border_sup. In concrete terms, frequent tems, whch are the prmtves of FP-tree constructon, are 1-temsets whose support values are hgher than Border_sup. [Lemma4.1] If the support value of 1-temsett s lower than Border_sup, t cannot be used to generate most k frequent patterns. [Ratonale] Let α be any 1-temset whose support s lower than Border_sup. Let β be any temset. Then, the followng expresson s satsfed. sup({ α, β}) sup({ α}) < Border_sup The above expresson shows that the support values of any temsets ncludng the 1-temset whose support value s lower than Border_sup are lower than Border_sup. In addton, t s clear that the number of temsets whose support values are hgher than Border_sup s more than k, based on the defnton of Border_sup. Thus, we are able to lmt the number of frequent 1-temsets that are prmtves of the FP-tree to the number of 1-temsets whose support values are more than Border_sup. For example, gven the TDB shown n Table 1, wth k = 6, Border_sup s because the support value of the 6-th frequent 1-temset s. Thus, the prmtves of the FP-tree becomes f, c, a, b, m, and p. b) Reducng the number of patterns generated from the FP-tree Pattern generaton from the constructed FP-tree by traversng all tems node-lnk drves more than k patterns.

To reduce both the number of patterns generated and the number of traversng node-lnks, a Reducton Array s adopted n Top-k FP-growth as shown n Fgure 4. Top-k FP-growth stores both the patterns generated from the FPtree and ther support values sequentally nto a Reducton Array. The prmtves of the Reducton Array are sorted by descendng order of ther support values after every traversal of one node-lnk. Defnton 4 (Boundary_sup) Boundary_sup s the support value of the stored k-th pattern n the Reducton Array. Intally, Boundary_sup s set to 0, but ts value ncreases after the generaton of k patterns from an FPtree. After traversng the node-lnk of each tem α n an FPtree, but before traversng the node-lnk of the next tem β n the FP-tree, Top-k FP-growth compares the support value of tem β ( = sup( β )) wth Boundary_sup. If the expresson sup(β ) < Boundary_sup s satsfed, t termnates pattern generaton from the FP-tree. On the other hand, f the expresson sup(β ) < Boundary_sup s not satsfed, t contnues pattern generaton from the FPtree. The reason why traversal of node-lnks s termnated f the support value of the next tem s lower than Boundary_sup s descrbed below. None of the patterns generated after traversng the node-lnk of tem β have support values hgher than sup(β ). Thus, f the expresson sup(β ) < Boundary_sup s satsfed, no patterns ncludng tem β have support values that are hgher than Boundary_sup. Moreover, every tem γ located under the tem β n the FP-tree s header table and all of the patterns generated by traversng the node-lnk of tem γ have support values lower than Boundary_sup. Thus, Top-k FP-growth can generate Top-k patterns n proporton even f the generaton of patterns from the FP-tree termnates when the expresson sup(β ) < Boundary_sup s satsfed. An example of the Reducton Array s shown n Fgure 4. Fgure 4 shows a Reducton Array after traversal of the node-lnk of an tem a, gven the TDB shown n Table 1 wth k=6. In Fgure 4, as the support value of the 6-th pattern {c, a} s, Boundary_sup s defned as. Before traversng the next node-lnk of an tem {b}, Top-k FPgrowth compares sup(b) wth Boundary_sup. In ths case, as sup(b ) equals Boundary_sup, Top-k FP-growth contnues traversng the node-lnk of tem {b}. k=6 Reducton Array Fgure 4. Example of a Reducton Array c) Output of frequent patterns Even f we adopt both extenson a) and extenson b), the number of frequent patterns generated from the FPtree s stll greater than k. Thus, Top-k FP-growth outputs the most k frequent patterns referrng to the Reducton Array by descendng order of ther support values. 4..2. Top-k FP-growth Algorthm. The Top-k FPgrowth algorthm s shown below. INPUT TDB k (number of frequent patterns) OUTPUT most k frequent patterns (descendng order) METHOD 1. Scan TDB, count support of all 1-temsets. 2. Set Border_sup to the support value of k-th 1-temset (descendng order). Then, generate an F-lst.. Construct an FP-tree accordng to the F-lst. 4. Generate frequent patterns from the FP-tree. Durng generaton, at every traversal of a node-lnk, recalculate Boundary_sup for reducton of the number of patterns generated. 5. Output the most k frequent patterns among generated patterns from the FP-tree, referrng to the Reducton Array. 4.4. TF 2 P-growth Boundary_sup= Header Table Item Lke other Top-k mnng concept algorthms, users stll have to specfy the value k before executon of Top-k FPgrowth proposed n 4.. To resolve ths problem, n ths secton we propose TF 2 P-growth based on Top-k FPgrowth. TF 2 P-growth mnes frequent patterns wth descendng order of support values wthout specfyng any thresholds. The mned frequent patterns are output to users every n c -patterns where n c s the chunk sze. By default, n c s set to 1000 1. The process of the TF 2 P-growth algorthm descrbed below. Users ntate TF 2 P-growth wthout specfyng any thresholds. Then, t sequentally returns the Top 1000 f c a b m p Sup 4 4 compare Head m:1 c:2 a:2 {} f: c:1 p:1 m:1 p:1 1 Users may change the chunk sze n to any number.

patterns, Top 1001 to 2000 patterns, Top 2001 to 000 patterns, etc. Once users have receved the Top 1000 patterns, they can ntate nterpretaton of these patterns. When the user s satsfed wth the mned frequent patterns, they can termnate mnng, or they may termnate mnng whenever they want. As the ntal results are the Top-n c patterns, whch can be set to a small number, t s possble to shorten the response tme from ntaton of the mnng untl generaton of the frst part of the results. 4.4.1. TF 2 P-growth algorthm. The TF 2 P-growth algorthm s shown below. INPUT TDB OUTPUT frequent patterns (descendng order of support) METHOD 1. Scan TDB to count support of all 1-temsets. 2. Set to 1.. Set n to n c, where n c s 1000 by default. 4. Set Border_sup to the support value of the n-th 1- temset (descendng order). Then, generate an F-lst. 5. Construct an FP-tree accordng to the F-lst. 6. Generate frequent patterns from the FP-tree. Durng generaton, at every node-lnk traversal, re-calculate Boundary_sup for reducton of the number of patterns generated. 7. Output the (n c ( -1)+1)th to (n c )th frequent patterns among the patterns generated from the FPtree. 8. Increment, then go to. 5. Performance Evaluaton In ths secton, we present performance evaluatons of TF 2 P-growth. We evaluated TF 2 P-growth wth regard to (1) comparson of the executon tme of FP-growth, Top-k FP-growth, and TF 2 P-growth, and (2) scalablty of TF 2 P- growth. We used T10I4D1000k, by IBM quest synthetc data generaton code[11], as a dataset. All of the experments were performed on a 2.4 GHz Pentum4 PC machne wth 1 GB of man memory, runnng RedHat Lnux 9.0, kernel verson 2.4.20. All of the programs were wrtten n C++ and compled wth gcc.2.2. 5.1. Comparson of Executon tme We compared the performance of TF 2 P-growth wth both FP-growth and Top-k FP-growth. In real data mnng applcatons, users have dffculty n settng the mn_sup or the value k for the frst tme. Therefore, users must execute frequent pattern mnng recursvely changng the mn_sup or the value of k. Number of Freq. Patterns (patterns) TF^2P-growth Top k FP-growth (execute recursvely wth 1000 ntervals of the value k) FP-growth (execute recursvely wth 0.01% ntervals of the mn_sup) FP-growth (execute recursvely wth 0.05% ntervals of the mn_sup) 10000 9000 8000 7000 6000 5000 4000 000 2000 1000 0 0 100 200 00 400 500 600 700 Tme (sec) Fgure 5. Executon Tme vs. Mned Frequent Patterns In ths evaluaton, we measured the executon tme of FP-growth and Top-k FP-growth n the followng manners. In the case of FP-growth whose threshold s the mn_sup, we measured the executon tme n two patterns executng FP-growth recursvely n the range of 0.% mn_sup 0.1% wth an nterval equal to (1) 0.01% and (2) 0.05%. In the case of Top-k FP-growth whose threshold s the value k, we measured the executon tme n the followng pattern executng Top-k FP-growth recursvely by changng the value k from 1,000 to 10,000 wth an nterval of 1000. The expermental results are shown n Fgure 5. Frst, as t s dffcult to set the approprate mn_sup when users execute FP-growth, mn_sup must ntally be set hgh. Then, users can decrease the value of mn_sup slowly, e.g., wth an nterval of 0.01%. However, ths results n slow generaton of frequent patterns. On the other hand, usng TF 2 P-growth, users can obtan larger numbers of frequent patterns n the same tme n comparson wth usng FP-growth 2. Second, usng TF 2 P-growth, users can obtan large numbers of frequent patterns n the equvalent executon tme n comparson wth usng Top-k FP-growth recursvely. Ths s because TF 2 P-growth reduces the number of TDB scans by reusng the F-lst generated n the frst cycle of Top-k FP-growth. These expermental results ndcate that use of TF 2 P- growth has the advantage of requrng no threshold to be set. Moreover, users can obtan more frequent patterns n the equvalent executon tme n comparson wth usng FP-growth or Top-k FP-growth. 2 If users could know a relaton of mn_sup and the number of mned frequent patterns before the ntaton of mnng, FP-growth drves much more patterns n comparson to TF 2 P-growth such as the result of the nterval 0.05% n Fgure 5. However, t s general that users don t know the relaton.

Number of Freq. Patterns (patterns) 10000 9000 8000 7000 6000 5000 4000 000 2000 1000 0 0 500 1000 1500 Tme (sec) T10I4D1000k T10I4D5000k T10I4D10000k Fgure 6. Executon tme wth dfferent numbers of transactons 5.2. Scalablty of TF 2 P-growth We also evaluated the scalablty of TF 2 P-growth. In ths evaluaton, we prepared the same dataset wth dfferent numbers of transactons, T10I4D5000k and T10I4D10000k. The expermental results are shown n Fgure 6. As shown n Fgure 6, the executon tme of TF 2 P- growth ncreases lnearly as the dataset sze ncreases. Ths means that even f users apply TF 2 P-growth for a very large dataset, they can obtan the Top 1000 patterns and can avod the stuaton where they may obtan only a small number of frequent patterns after a long computaton tme. 6. Conclusons In ths paper, we proposed a new frequent pattern mnng algorthm, called TF 2 P-growth, whch mnes frequent patterns n descendng order of support wthout specfyng any threshold values. By applyng the proposed algorthm to the dataset T10I4D1000k, we confrmed that the followng two advantages. Frst, users can execute the mnng process wthout specfyng any threshold values. Second, users can mne more frequent patterns n comparson wth those mned by executng FP-growth recursvely wth changng mn_sup. For example, TF 2 P- growth mnes the 9,000 most frequent pattern twce as fast as FP-growth executed recursvely by changng mn_sup from 0.% to 0.1% wth 0.01% ntervals. Future work wll nvolve adoptng the maxmal or closed pattern mnng concept nto the proposed algorthm, and parallelzng the proposed algorthm. Acknowledgments Ths research was funded n part by both e-socety: the Comprehensve Development Foundaton Software of MEXT (Mnstry of Educaton, Culture, Sports, Scence, and Technology) and 21-century COE Programs: ICT Productve Academa of MEXT. References [1] R.Agrawl and R.Srkant, Fast Algorthms for Mnng Assocaton Rules, In Proc. of VLDB 94, pp. 487-499, Santago, Chle, Sept. 1994. [2] J. Han, J. Pe and P.S. Yu, Mnng Frequent Patterns wthout Canddate Generaton, In Proc. of the ACM SIGMOD Conference on Management of Data, pp.1-12, 2000. [] R.J.Bayard, Effcently Mnng Long Patterns from Databases, In Proc. of the ACM SIGMOD Conference on Management of Data, pp. 85-9, 1998. [4] G.Grahne and J.Zhu, Hgh Performance Mnng of Maxmal Frequent Itemsets, In Proc. of SIAM 0 Workshop on Hgh Performance Data Mnng, 200. [5] G. Grahne and J. Zhu, Effcently Usng Prefx-trees n Mnng Frequent Itemsets, In Proc. of the IEEE ICDM Workshop on Frequent Itemset Mnng Implementatons, 200. [6] J.Pe, J.Han and R.Mao, CLOSET: An Effcent Algorthm for Mnng Frequent Closed Itemsets, In Proc. of DMKD 00, 2000. [7] A.W.-C Fu., R.W.-W. Kwong and J.Tang, Mnng N-most Interestng Itemsets, In Proc. of the ISMIS 00, 2000. [8] J. Han, J. Wang, Y. Lu and P. Tzvetkov, Mnng Top-k Frequent Closed Patterns wthout Mnmum Support, In Proc. of IEEE ICDM Conference on Data Mnng, 2002. [9] J.S. Park, M.Chen, P.S. Yu, An effectve hash-based algorthms for mnng assocaton rules, In Proc. of the ACM SIGMOD Conference on Management of Data, pp.175-186, 1996. [10] A. Savasere, E. Omecnsk, and S. Navathe, An Effcent Algorthm for Mnng Assocaton Rules n Large Databases, In Proc. of VLDB 95, pp.42-444, 1995. [11] IBM Quest Data Mnng Project. Quest synthetc data generaton code. http://www.almaden.bm.com/software/quest/resources/datasets /syndata.html