PLWAP Sequential Mining: Open Source Code

Size: px

Start display at page:

Download "PLWAP Sequential Mining: Open Source Code"

Lesley Whitehead
6 years ago
Views:

1 PL Sequentil Mining: Open Source Code C.I. Ezeife School of Computer Science University of Windsor Windsor, Ontrio N9B 3P4 Yi Lu Deprtment of Computer Science Wyne Stte University Detroit, Michign Yi Liu School of Computer Science University of Windsor Windsor, Ontrio N9B 3P4 ABSTRACT PL lgorithm uses preorder linked, position coded version of tree nd elimintes the need to recursively re-construct intermedite trees during sequentil mining s done y tree technique. PL produces significnt reduction in response time chieved y the lgorithm nd provides position code mechnism for rememering the stored dtse, thus, eliminting the need to re-scn the originl dtse s would e necessry for pplictions like those incrementlly mintining mined frequent ptterns, performing strem or dynmic mining. This pper presents open source code for oth the PL nd lgorithms descriing our implementtions nd experimentl performnce nlysis of these two lgorithms on synthetic dt generted with IBM quest dt genertor. An implementtion of the Apriori-like GSP sequentil mining lgorithm is lso discussed nd sumitted. A we log pre-processor for producing rel input to the lgorithms is mde ville too. Keywords sequentil ptterns, we usge mining, tree, pre-order linkge 1. INTRODUCTION Bsic ssocition rule mining with the Apriori lgorithm [1] finds dtse items (ttriutes) tht occur most often together in dtse trnsctions. Thus, given set of trnsctions (similr to dtse records), where ech trnsction is set of items (ttriutes), n ssocition rule is of the form X Y, where X nd Y re sets of items nd X Y =. Assocition rule mining lgorithms generlly first find ll frequent ptterns (itemsets) s ll comintions of items or ttriutes with support (percentge (A full version of this pper is ville in the Journl of Dt Mining nd Knowledge Discovery 1, 5-38, 25.) Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd the full cittion on the first pge. To copy otherwise, to repulish, to post on servers or to redistriute to lists, requires prior specific permission nd/or fee. OSDM 5, August 21, 25, Chicgo, Illinois, USA. Copyright 25 ACM /5/8...$5... occurrence in the entire dtse), greter or equl to predefined minimum support. Then, in the second stge of mining, ssocition rules re generted from ech frequent pttern y defining ll possile comintions of rule ntecedent (hed) nd consequent (til) from items composing the frequent ptterns such tht ntecedent consequent = nd ntecedent consequent = frequentpttern. Then, only rules with confidence (numer of trnsctions tht contin the rule divided y the numer of trnsctions contining the ntecedent) greter thn or equl to pre-defined minimum confidence re retined s vlule, while the rest re pruned. Sequentil mining is n extension of sic ssocition rule mining tht ccommodtes ordered set of items or ttriutes, where the sme item my e repeted in sequence. While sic frequent pttern hs set of non-ordered items tht hve occurred together up to minimum support threshold, frequent sequentil pttern hs sequence of ordered items tht hve occurred frequently in dtse trnsctions t lest s often s the minimum support threshold. Thus, the mesures of support nd confidence used in ssocition rule mining for deciding frequent itemsets re used in sequentil mining for deciding frequent sequences. Just s n i-itemset contins i items, n n-sequence contins n ordered items (events). One ppliction of sequentil mining is we usge mining for finding the reltionship mong different we users ccesses from we ccess logs [5], [11], [4] nd [19]. Anlysis of these ccess dt cn help for server performnce enhncement nd direct mrketing in e-commerce s well s we personliztion. Before pplying sequentil mining techniques to we log dt, the we log trnsctions re pre-processed to group them into set of ccess sequences for ech user identifier nd to crete we ccess sequences in the form of trnsction dtse (e.g., dc, eecc, fec, fec). Access sequence S = e 1e 2...e l is clled susequence of n ccess sequence, S = e 1e 2...e n,ndsis super-sequence of S, denoted s S S, if nd only if for every event e j in S, there is n equl event e k in S, while the order tht events occurred insisthesmestheorder of events in S. For exmple, with S =, S=cd, S is susequence of S, nd c is susequence of S, lthough there is occurring etween nd c in S. In the sequence cd, while cd is suffix susequence of, is prefix susequence of cd. Techniques for mining sequentil ptterns from we logs fll into Apriori or non-apriori. This pper presents discussions of the implementtions of three key sequentil mining lgorithms PL, nd GSP used in [7]. 26

2 1.1 Relted Work Work on mining sequentil ptterns in we log include the GSP [3], the PSP [13], the G sequence [18] nd the grph trversl [15] lgorithms. Agrwl nd Sriknt proposed three lgorithms (Apriori, AprioriAll, AprioriSome) for sequentil mining in [2]. The GSP (Generlized Sequentil Ptterns) [3] lgorithm is 2 times fster thn the Apriori lgorithm. The GSP Algorithm mkes multiple psses over dt. The first pss determines the frequent 1-item ptterns (L 1). Ech susequent pss strts with seed set: the frequent sequences found in the previous pss (L k 1 ). The seed set is used to generte new cndidte sequences (C k ) y performing n Apriori gen join of L k 1 with (L k 1 ). This join requires tht every sequence s in the first L k 1 joins with other sequences s in the second L k 1 if the lst k-2 elements of s re the sme s the first k-2 elements of s. For exmple, if frequent 3-sequence set L 3 hs the following 6 sequences: {((1,2)(3)), ((1,2)(4)), ((1)(3, 4)), ((1,3)(5)), ((2)(3,4)), ((2)(3)(5))}, tootinfrequent4- sequences, every frequent 3-sequence should join with the other 3-sequences tht hve the sme first two elements s its lst two elements. Sequence s=((1,2)(3)) joins with s =((2)(3,4)) to generte cndidte 4-sequence ((1,2)(3,4)) since the lst 2 elements of s, (2)(3), mtch with the first 2 elements of s. Similrly, ((1,2) (3)) joins with ((2)(3)(5)) to form ((1,2)(3)(5)). There re no more mtching sequences to join in L 3. The join phse is followed with the pruning phse, when the cndidte sequences with ny of their contiguous (k-1)-susequences hving support count less thn the minimum support, re dropped. The dtse is scnned for supports of the remining cndidte k-sequences to find frequent k-sequences(l k ), which ecome the seed for the next pss, cndidte (k+1)-sequences. The lgorithm termintes when there re no frequent sequences t the end of pss, or when there re no cndidte sequences generted. The GSP lgorithm uses hsh tree to reduce the numer of cndidtes tht re checked for support in the dtse. The PSP (Prefix Tree For Sequentil Ptterns) [13] pproch is much similr to the GSP lgorithm [3], ut stores the dtse on more concise prefix tree with the lef nodes crrying the supports of the sequences. At ech step k, the dtse is rowsed for counting the support of current cndidtes. Then, the frequent sequence set, L k is uilt. The Grph Trversl mining [14], [15], uses simple unweighted grph to store we sequences nd grph trversl lgorithm similr to Apriori lgorithm to trverse the grph in order to compute the k-cndidte set from the (k- 1)-cndidte sequences without performing the Apriori-gen join. From the grph, if cndidte node is lrge, the djcency list of the node is retrieved. The dtse still hs to e scnned severl times to compute the support of ech cndidte sequence lthough the numer of computed cndidte sequences is drsticlly reduced from tht of the GSP lgorithm. Other tree sed pproches include [18] clled G sequence mining. This lgorithm uses wildcrds, templtes nd construction of Aggregte tree for mining. The FP-tree structure [9] first reorders nd stores the frequent non-sequentil dtse trnsction items on prefix tree, in descending order of their supports such tht dtse trnsctions shre common frequent prefix pths on the tree. Then, mining the tree is ccomplished y recursive construction of conditionl pttern ses for ech Tle 1: Smple We Access Sequence Dtse for -tree TID We ccess sequence Frequent susequence 1 dc c 2 eecc cc 3 fec c 4 fcfc cc frequent 1-item (in ordered list clled f-list), strting with the lowest in the tree. Conditionl FP-tree is constructed for ech frequent conditionl pttern hving more thn one pth, while mximl mined frequent ptterns consist of conctention of items on ech single pth with their suffix f-list item. FreeSpn [8] like the FP-tree method, lists the f- list in descending order of support, ut it is developed for sequentil pttern mining. PrefixSpn [16] is pttern-growth method like FreeSpn, which reduces the serch spce for extending lredy discovered prefix pttern p y projecting portion of the originl dtse tht contins ll necessry dt for mining sequentil ptterns grown from p. We ccess pttern tree (), is non-apriori lgorithm, proposed y Pei et l. [17]. The -tree stores the we log dt in prefix tree formt similr to the frequent pttern tree [9] (FP-tree). lgorithm first scns the we log to compute ll frequent individul events, then it constructs -tree over the set of frequent individul events of ech trnsction efore it recursively mines the constructed tree y uilding conditionl tree for ech conditionl suffix frequent pttern found. The process of recursive mining of conditionl suffix tree ends when it hs only one rnch or is empty. An exmple ppliction of the -tree lgorithm for finding ll frequent events in the we log (constructing the -tree nd mining the ccess ptterns from the tree) is shown with the dtse in Tle 1. Suppose the minimum support threshold is set t 75%, which mens n ccess sequence, s should hve count of 3 out of 4 records in our exmple, to e considered frequent. Constructing tree, entils first scnning dtse once, to otin events tht re frequent,,, c. When constructing the -tree, the non-frequent (like d, e, f) prt of every sequence is discrded. Only the frequent su-sequences shown in column three of Tle 1 re used s input. With the frequent sequence in ech trnsction, the -tree lgorithm first stores the frequent items s heder nodes for linking ll nodes of their type in the -tree in the order the nodes re inserted. When constructing the -tree, virtul root (Root) is first inserted. Then, ech frequent sequence in the trnsction is used to construct rnch from the Root to lef node of the tree. Ech event in sequence is inserted s node with count 1 from Root if tht node type does not yet exist, ut the count of the node is incresed y 1 if the node type lredy exists. Also, the hed link for the inserted event is connected (in roken lines) to the newly inserted node from the lst node of its type tht ws inserted or from the heder node of its type if it is the very first node of tht event type inserted. Once the frequent sequentil dt re stored on the complete -tree (Figure 1), the tree is mined for frequent ptterns strting with the lowest frequent event in the heder list, in our exmple, 27

3 H e d e r T l e : 3 H e d e r T l e : 4 C o n d i t i o n l W A P - t r e e c ( ) C o n d i t i o n l W A P - t r e e c ( ) Figure 1: Construction of the Originl Tree H e d e r T l e strting from frequent event c s the following discussion shows. From the -tree of Figure 1, it first computes prefix sequence of the se c or the conditionl sequence se of c s: : 2; ; c; : -1; ; ; : -1 The conditionl sequence list of suffix event is otined y following the heder link of the event nd reding the pth from the root to ech node (excluding the node). After discrding the non-frequent prt c in the ove sequences, the conditionl sequences sed on c re listed elow: : 2; ; ; : -1; ; ; : -1. Using these conditionl sequences, conditionl tree, -tree c, is uilt using the sme method s shown in Figure 1. The re-construction of trees tht progressed s suffix sequences c, c discovered frequent ptterns found long this line c, c nd c. The recursion continues with the suffix pth c, c. The lgorithm keeps running, finding the conditionl sequence ses of c,,. Figure2shows the trees for mining conditionl pttern se c. After mining the whole tree, discovered frequent pttern set is: {c, c, c, c, c, c, c,,,,,, }. Although -tree lgorithm scns the originl dtse only twice nd voids the prolem of generting explosive cndidte sets s in Apriori-like lgorithm, its min drwck is recursively re-constructing lrge numers of intermedite -trees nd ptterns during mining tking up computing resources. Pre-Order linked tree lgorithm (PL) [7], [1], is version of the tree lgorithm tht ssigns unique inry position code to ech tree node nd performs the heder node linkges pre-order fshion (root, left, right). Both the pre-order linkge nd inry position codes enle the PL to directly mine the sequentil ptterns from the one initil tree strting with prefix sequence, without re-constructing the intermedite trees. To ssign position codes to PL node, the root hs null code, nd the leftmost child of ny prent node hs code tht ppends 1 to the position code of its prent, while the position code of ny other node hs ppended to the position code of its nerest left siling. The PL technique presents much etter performnce thn tht chieved y the -tree technique s shown in extensive performnce nlysis. : 3 C o n d i t i o n l W A P - t r e e c ( c ) H e d e r T l e : 4 C o n d i t i o n l W A P - t r e e c ( d ) C o n d i t i o n l W A P - t r e e c ( e ) Figure 2: Reconstruction of Trees for Mining Conditionl Pttern Bse c 1.2 Motivtions nd Contriutions PL lgorithm [7], recently proposed sequentil mining tool in the Journl of Dt Mining nd Knowledge Discovery hs mny ttrctive fetures tht mkes it suitle s uilding lock for mny other sophisticted sequentil dt mining pproches like incrementl mining [6], we clssifiction nd personliztion. This pper supplements the detiled nd forml presenttions of the PL lgorithm, its properties nd theorems given in [7] y focusing on detils of the code implementtions of PL, nd GSP sequentil miners s well s mking ville rel we log dt pre-processor nd providing further indepth nlysis. This pper thus, contriutes y discussing our code implementtions of the PL nd two other key sequentil mining lgorithms ( nd GSP) used in performnce studies of work in [7]. These lgorithms hve een tested thoroughly on pulicly ville dt sets ( index.shtml) nd with rel dt. Through this pper, rel we log pre-processor for prepring rel input dt for the miners is lso mde ville (sk uthor if needed). A more indepth performnce nlysis tht confirms the stility of the PL lgorithm is nother contriution of pper. Such code vilility motivtes doption of lgorithm s uilding lock for more sophisticted dt mining processes like strem nd dynmic mining, distriuted, incrementl, drill-down nd roll-up mining, semntic mining nd sensor network mining. 1.3 Outline of the Pper Section 2 discusses exmple mining with the Pre-Order Linked -Tree (PL) lgorithm. Section 3 discusses 28

4 the C++ implementtions of the PL, nd the GSP lgorithms for sequentil mining. Section 4 discusses experimentl performnce nlysis, while section 5 presents conclusions nd future work. 2. AN EXAMPLE SEQUENTIAL MINING WITH PL ALGORITHM Unlike the conditionl serch in -tree mining, which is sed on finding common suffix sequence first, the PL technique finds the common prefix sequences first. The min ide is to find frequent pttern y progressively finding its common frequent susequences strting with the first frequent event in frequent pttern. For exmple, if cd is frequent pttern to e discovered, the lgorithm progressively finds suffix sequences d, cd, cd nd cd. The PL tree, on the other hnd, would find the prefix event first, then, using the suffix trees of node, it will find the next prefix susequence nd continuing with the suffix tree of, it will find the next prefix susequence c nd finlly, cd. Thus, the ide of PL is to use the suffix trees of the lst frequent event in n m-prefix sequence to recursively extend the susequence to m+1 sequence y dding frequent event tht occurred in the suffix trees. Using the position codes of the nodes, the PL is le to know the descendnt nd siling nodes of oldest prent nodes on the suffix root set of frequent heder element eing checked for ppending to prefix susequence if it is frequent in the suffix root set under considertion. An element is frequent if the sum of the supports of the oldest prent nodes on ll its suffix root sets is greter thn or equl to the minimum support. Assume we wnt to mine the we ccess dtse (WASD) of Tle 1 for frequent sequences given minimum support of 75% or 3 trnsctions. Constructing nd mining the PL tree goes through the following steps. (1) Scn WASD (column 2 of Tle 1 once to find ll frequent individul events, L s {:4, :4, c:4}. The events d:1, e:2, f:2 hve supports less thn the minimum support of 3 nd (2) Scn WASD gin, construct PL-tree over the set of individul frequent events (column 3 of Tle 1), y inserting ech sequence from root to lef, leling ech node s (node event: count: position code). Then, fter ll events re inserted, trverse the tree pre-order wy to connect the heder link nodes. Figure 3 shows the completely constructed PL tree with the pre-order linkges. (3) Recursively mine the PL-tree using common prefix pttern serch: The lgorithm strts to find the frequent sequence with the frequent 1-sequence in the set of frequent events(fe) {,, c}. For every frequent event in FE nd the suffix trees of current conditionl PL-tree eing mined, it follows the linkge of this event to find the first occurrence of this frequent event in every current suffix tree eing mined, nd dds the support count of ll first occurrences of this frequent event in ll its current suffix trees. If the count is greter thn the minimum support threshold, then this event is ppended (conctented) to the lst susequence in the list of frequent sequences, F. The suffix trees of these first occurrence events in the previously mined conditionl suffix PL-trees re now in turn, used for mining the next event. To otin this conditionl PL-tree, we only need to rememer the roots of the current suffix trees, which re stored for next round mining. For exmple, the Figure 3: Construction of PL-Tree Using Pre- Order Trversl lgorithm strts y mining the tree in Figure 4() for the first element in the heder linkge list, following the link to find the first occurrencies of nodes in :3:1 nd :1:11 of the suffix trees of the Root since this is the first time the whole tree is pssed for mining frequent 1-sequence. Now, the list of mined frequent ptterns F is {} since the count of event in this current suffix trees is 4 (sum of :3:1 nd :1:11 counts), nd more thn the minimum support of 3. The mining of frequent 2-sequences tht strt with event would continue with the next suffix trees of rooted t {:3:11, :1:111} shown in Figure 4() s unshdowed nodes. The ojective here is to find if 2-sequences, nd c re frequent using these suffix trees. In order to confirm frequent, we need to confirm event frequent in the current suffix tree set, nd similrly, to confirm frequent, we should gin follow the link to confirm event frequent using this suffix tree set, sme for c. The process continues to otin sme frequent sequence set {,, c,,, c,c,c,,,c,c,c} s the -tree lgorithm. 3. C++ IMPLEMENTATIONS OF THREE SE- QUENTIAL MINING ALGORITHMS Although, we ly more emphsis on the PL lgorithm developed in our l, we lso provide the source codes of the nd GSP lgorithms used for performnce nlysis. The source codes re discussed under seven hedings of (1) development nd running environment, (2) input dt formt nd files, (3) minimum support formt, (4) output dt formt nd files, (5) functions used in the progrm, (6) dt structures used in the progrm nd (7) dditionl informtion. All of the codes re documented with informtion on this section nd more for code redility, mintinility nd extendility. Ech progrm is stored in.cpp file nd compiled with g++ filenme.cpp. It is worth noting tht ll three lgorithms were implemented in the yer 22, when no other versions of the nd GSP lgorithms were pulicly ville. While we cnnot still find pulicly ville version of the GSP progrm, there re some sutle differences etween our implementtion of the lgorithm nd tht now ville t tszhu/softwre.html. One difference we hve noticed is with the input formts of the two implementtions: while Alert version ccepts the input in one sequence, we ccept list of sequences elonging to different users s in we log mining. It lso seems like Alert 29

5 c : c : : :{ 1, 111, 111 1, 11, :{ 11, 1, 11 1 } c :{ 111 1, , 111, ( ) c : c : : } , } : c : : s u f i x t r e { 1, 1 1 } ( ) s u f i x t r e c s u f i x t r e { 11 1, 11 11, } { } ( c ) ( d ) c c : c : : Figure 4: Mining PL-Tree to Find Frequent Sequence Strting with version is memory-only s intermedite pttern input is not sved on disk, while it is sved on disk in our implementtion. The impliction is tht running the memory-only version on n environment with less memory thn dt size would result in operting system swpping of dt onto disk. 3.1 C++ Implementtion of the PL Sequentil Mining Algorithm This is the PL lgorithm progrm, which is sed on the description in [7], C.I. Ezeife nd Y. Lu. Mining We Log sequentil Ptterns with Position Coded Pre-Order Linked -tree in DMKD DEVELOPMENT ENVIRONMENT: Although initil version is developed under the hrdwre/softwre environment specified elow, the progrm runs on more powerful nd fster multiprocessor UNIX environments s well. Initil environment is: (i)hrdwre: Intel Celeron 4 PC, 64M Memory; (ii)operting system: Windows 98; (iii)development tool: Inprise(Borlnd) C++ Builder 6.. The lgorithm is developed under C++ Builder 6., ut compiles nd runs under ny stndrd C++ development tool. 2. INPUT FILES AND FORMAT: Input file is test.dt: For simplifying input process of the progrm, we ssume tht ll input dt hve een preprocessed such tht ll events elonging to the sme user id hve een gthered together, nd formed s sequence nd sved in text file, clled, test.dt. The test.dt file my e composed of hundreds of thousnds of lines of sequences where ech line represents we ccess sequence for ech user. Every line of the input dt file ( test.dt ) includes UserID, length of sequence nd the sequence which re seprted y t spces. An exmple input line is: Here, 1 represents UserID, the length of sequence is 5, the sequence is 1,2,4,1,3. 3. MINIMUM SUPPORT FORMAT: The progrm lso needs to ccept vlue etween nd 1 s minimum support. The minimum support input is entered interctively y the user during the execution of the progrm when prompted. For minimum support of 5%, user should type.5, nd for minsupport of 5%, user should type.5, nd so on. 4. OUTPUT FILES AND FORMAT: result PL.dt: Once the progrm termintes, we cn find the result frequent ptterns in file nmed result PL.dt. It my contin lines of ptterns, ech representing frequent pttern. 5. FUNCTIONS USED IN THE CODE: (i)buildtree: uilds the PL tree, (ii)uildlinkge: Builds the pre-order linkge for PL tree, (iii)mkecode: mkes the position code for node, (1v)checkPosition: checks the position etween ny two nodes in the PL tree, (v)miningprocess: mines sequentil frequent ptterns from the PL tree using position codes. 6. DATA STRUCTURE: Three struct re used in this progrm: (i) the node struct indictes PL node which contins the following informtion:.the event nme,.the numer of occurrence of the event, c. link to the position code, d. length of position code, e. the linkge to next node with sme event nme in PL tree, f. pointer to its left son, g. pointer to its right siling, h. pointer to its prent, i. the numer of its sons. (ii) position code struct implemented s linked list of unsigned integer to mke it possile to hndle dt of ny size without exceeding the mximum integer size. (iii) linkge struct. 7. ADDITIONAL INFORMATION: The run time is displyed on the screen with strt time, end time nd totl seconds for running the progrm. 3.2 C++ Implementtion of the Sequentil Mining Algorithm This is the lgorithm progrm sed on the descriptionin[17]: JinPei,JiweiHn,BehzdMortzvisl, nd Hu Zhu, Mining Access Ptterns Efficiently from We Logs, PAKDD 2. 1.DEVELOPMENT ENVIRONMENT: The sme s descried for PL nd cn run on ny UNIX system s well. 2. INPUT FILES AND FORMAT: i) Input file test.dt: Pre-processed sequentil input records re red from the file test.dt. The test.dt file, which is the sme formt s descried for PL ove. ii) Input file middle.dt: used to sve the conditionl middle ptterns. During the tree mining process, following the linkges, once the sum of support for n event is found greter thn minimum support, ll its prefix conditionl ptterns re sved in the middle.dt file for next round mining. The formt of middle.dt is s follows: Ech line includes the length of the sequence, the occurrence of the sequence, nd the events in the sequence. For exmple,givenlineinmiddle.dt: ,the length of sequence is 5, 4 indictes the sequence occurred 4 times in the previous conditionl tree nd the sequence is 1,2,4,1,3. 3. MINIMUM SUPPORT: 3

6 The progrm lso needs to ccept vlue etween nd 1 for minimum support or frequency. The user is prompted for minimum support when the progrm strts. 4. OUTPUT FILES AND FORMAT: result.dt Once the progrm termintes, the result ptterns re in file nmed result.dt, which my contin lines of ptterns. 5. FUNCTIONS USED IN THE CODE: (i)buildtree: Builds the tree/conditionl tree (ii)miningprocess: produces sequentil pttern/conditionl prefix su-pttern from tree/conditionl tree. 6. DATA STRUCTURE: three struct re used in this progrm: (i) the node struct indictes node which contins the following informtion:.the event nme,.the numer of occurrence of the event, c. the linkge to next node sme event nme in tree, d. pointer to its left son, e. pointer to its rights siling, f. pointer to its prent. (ii) linkge struct descried in the progrm. 7. ADDITIONAL INFORMATION: The run time is displyed on the screen with strt time, end time nd totl seconds for running the progrm. 3.3 C++ Implementtion of the GSP Sequentil Mining Algorithm This is GSP lgorithm implementtion, which demonstrtes the result descried in [3]: R. Sriknt nd R. Agrwl. Mining sequentil ptterns: Generliztions nd performnce improvements, DEVELOPMENT ENVIRONMENT: The sme s descried for PL nd runs in ny UNIX system s well. 2. INPUT FILES AND FORMAT: Input file test.dt: The min input file test.dt hd the sme formt s for PL descried ove. 3. MINIMUM SUPPORT: The progrm lso needs to ccept vlue etween nd 1 s minimum support. The user is prompted for minimum support when the progrm strts. 4. OUTPUT FILES AND FORMAT: result GSP.dt At the termintion of the progrm, the result ptterns re in file nmed result GSP.dt, which my contin lines of frequent ptterns. 5. FUNCTIONS USED IN THE CODE: (i)gsp: reds the file nd mines levelwise ccording to the lgorithm. 6. DATA STRUCTURES: There re struct for i) cndidte sequence list nd its count nd ii) sequence. 7. ADDITIONAL INFORMATION: The run time is displyed on the screen with strt time, end time nd totl seconds for running the progrm. 4. PERFORMANCE AND EXPERIMENTAL ANALYSIS OF THREE ALGORITHMS The PL lgorithm elimintes the need to store numerous intermedite trees during mining, thus, drsticlly cutting off huge memory ccess costs. The PL nnottes ech tree node with inry position code for quicker mining of the tree. This section compres the experimentl performnce nlysis of PL with the tree mining nd the Apriori-like GSP lgorithms. The three lgorithms were initilly implemented with C++ lnguge running under Inprise C++ Builder environment. All initil experiments were performed on 4MHz Celeron PC mchine with 64 megytes memory running Windows 98 (for work in [7]). Current experiments re conducted with the sme implementtions of the progrms nd still on synthetic dtsets generted using the resource dt genertor code from Resources/index.shtml. This synthetic dtset hs een used y most pttern mining studies [3, 12, 17]. Experiments were lso run on rel dtsets generted from we log dt of University of Windsor Computer Science server nd pre-processed with our we log clener code. The correctness of the implementtions were confirmed y checking tht the frequent ptterns generted for the sme dtset y ll lgorithms re the sme. A high speed UNIX SUN microsystem with totl of M memory nd 8 x 12 MHz processor speed is used for these experiments. The synthetic dtsets consist of sequences of events, where ech event represents n ccessed we pge. The prmeters shown elow re used to generte the dt sets. D Numer of sequences in the dtse C : Averge length of the sequences S : Averge length of mximl potentilly frequent sequence N : numerofevents For exmple, C1.S5.N2.D6K mens tht C = 1, S =5, N = 2, nd D = 6K. It represents group of dt with verge length of the sequences s 1, the verge length of mximl potentilly frequent sequence is 5, the numer of individul events in the dtse is 2, nd the totl numer of sequences in dtse is 6 thousnd. The dtsets with different prmeters test different spects of the lgorithms. Generlly, if the numer of these four prmeters ecomes lrger, the execution time ecomes longer. Experiments re conducted to test the ehvior of the lgorithms with respect to the four prmeters, minimum support threshold, dtse sizes, numer of dtse tle ttriutes nd the verge length of sequences. For ech experiment, while one of the prmeters chnges, others re pegged t some constnt vlues. Oservtions re mde t three levels of smll, medium nd lrge (e.g., smll dtse size my consist of tle with records less thn 4 thousnds, medium size tle hs etween 5 nd 2 thousnds records, while lrge hs over 3 thousnd). Experiment 1: Execution time trend with different minimum supports (smll size dtse, 4K records): This experiment uses fixed size dtse nd different minimum support to compre the performnce of PL, nd GSP lgorithms. The dtsets re descried s C2.5.S5.N5.D4K, nd lgorithms re tested with minimum supports etween.5% nd 2% ginst the 4 thousnd (4K) dtse with 5 ttriutes nd verge sequence length of 2.5. The results of this experiment re shownintle2ndfigure5withthenumeroffrequent ptterns found reported s Fp. From this result, it cn e seen tht t minimum support of.5%, the numer of frequent ptterns found is highest nd is 2729, the PL lgorithm rn lmost 3 times fster thn the lgorithm. As the minimum support increses, the numer of frequent ptterns found decreses nd the gin in performnce y the PL lgorithm over the lgorithm decreses. It cn e seen tht the more the numer of frequent ptterns found in dtset, the higher the performnce gin chieved y the PL lgorithm over the lgorithm. This is 31

7 Tle 2: Execution Times for Dtset t Different Minimum Supports (smll dtse) Alg Runtime (in secs) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL CPU execution time (in secs) PL ecuse the two lgorithms spend out the sme mount of time scnning the dtse nd constructing the tree, ut the PL lgorithm sves on storing nd reding intermedite re-constructed tree, which re s mny s the numer of frequent ptterns found. The execution times of the two lgorithms re the sme when there re nerly no frequent ptterns. CPU execution time (in secs) Minimum Support (in %) PL Figure 5: Execution Times Trend with Different Minimum Supports (smll dtse) Experiment 2: Execution times trend with different minimum supports (Medium size dtse, 2K records): This experiment uses fixed size dtse nd different minimum support to compre the performnce of PL, nd GSP lgorithms. The lgorithms re tested with minimum supports etween.5% nd 15% ginst the 2 thousnd (2K) dtse, 5 ttriutes nd verge sequence length of 2.5. The results of this experiment re shown in Tle 3 nd Figure 6 with the numer of frequent ptterns found reported s Fp. It cn e seen tht the trend in performnce is the sme s with smll dtse. When the minimum support reches 15%, there re no frequent ptterns found nd the running times of the two lgorithms ecome the sme. Experiment 3: Execution Times for Dtset t Different Minimum Supports (lrge dtse): This experiment uses fixed size dtse nd different minimum support to compre the performnce of PL, nd GSP lgorithms. The lgorithms re tested with mini Minimum Support (in %) Figure 6: Execution Times Trend with Different Minimum Supports (medium dtse) Tle 3: Execution Times for Dtset t Different Minimum Supports (medium dtse) Alg Runtime (in secs) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL mum supports etween.5% nd 15% ginst one million (1M) record dtse. The results of this experiment re shown in Tle 4 nd Figure 7. Experiment 3: Execution Times for Dtset t Different Minimum Supports (lrge dtse ut very low minimum supports): This experiment uses fixed size dtse nd different minimum support to compre the performnce of PL nd lgorithms (GSP not included ecuse running times get too ig or process is killed). The lgorithms re tested with minimum supports etween.1% nd.2% ginst one million (1M) record dtse. The results of this experiment re shown in Tle 5 nd Figure 8. Tle 4: Execution Times for Dtset t Different Minimum Supports (lrge dtse) Alg Runtime (in secs) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL

8 CPU execution time (in secs) PL Tle 5: Execution Times for Dtset t Different Minimum Supports (lrge dtse nd low minsupports) Alg Runtime (in secs) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp 222K 161K PL Minimum Support (in %) Figure 7: Execution Times for Dtset t Different Minimum Supports (lrge dtse) CPU execution time (in secs) 14, 12, 1, 8, 6, 4, 2, Minimum Support (in %) PL Figure 8: Execution Times for Dtset t Different Minimum Supports (lrge dtse nd low minsupports) Experiment 4: Execution times trend with different dtse sizes (smll size dtse, 2K to 14K): This experiment uses fixed minimum support nd different size dtse to compre the performnce of PL, nd GSP lgorithms. The lgorithms re tested on dtses of sizes 2K to 14K t minimum support of 1%. The gin in performnce y the PL lgorithm is constnt cross different sizes ecuse the numer of frequent ptterns in different sized dtsets generted y the dt genertor seem to e out the sme t prticulr minimum support. The results of this experiment re shown in Tle 6 nd Figure 9. Experiment 5: Execution times trend with different dtse sizes (medium size dtse, 2K to 2K): This experiment uses fixed minimum support nd different size dtse to compre the performnce of PL nd lgorithms. The lgorithms re tested with dtse sizes etween 2 nd 2 thousnds t minimum support of 1%. The results of this experiment re shown in Tle 7 nd CPU execution time (in secs) K 4K 6K 8K 1K 12K 14K Dtse Sizes PL Figure 9: Execution Times Trend with Different Minimum Supports (smll dtse) Figure 1. Experiment 6: Execution times trend with different dtse sizes (lrge size dtse, 3K to 9K): This experiment uses fixed minimum support nd different size dtse to compre the performnce of PL, nd GSP lgorithms. The lgorithms re tested with dtse sizes etween 3K nd 9K records t minimum support of 1%. The results of this experiment re shown in Tle 8 nd Figure 11. Since the CPU execution time difference etween the nd PL from this experiment, seems to diminish s the dtse size increses, further experiment ws run on lrger dtses of etween one million nd 2 million records to check if nd when would run fster thn the PL. Result of this experiment Tle 6: Execution Times for Different Dtse Sizes t Minsupport (smll dtse) Alg Runtime (in secs) t Different Supports(%) 2K 4K 6K 8K 1K 12K 14K GSP PL

9 CPU execution time (in secs) PL CPU execution time (in secs) PL 2K 4K 6K 8K 1K 2K 3K 4K 5K 6K 7K 8K 9K Dtse Sizes Dtse Sizes Figure 1: Execution Times Trend with Different Dtse sizes (medium dtse) Figure 11: Execution Times Trend with Different Dtse sizes (lrge dtse) Tle 7: Execution Times for Different Dtse Sizes t Minsupport of 1%(medium dtse) Alg Runtime (in secs) t Different D sizes(%) 2K 4K 6K 8K 1K 2K GSP PL Tle 9: Execution Times for Different Dtse Sizes t Minsupport of 1% (lrger dtse) Alg Runtime (in secs) t Different DB sizes(%) 1M 2M 4M 8M 1M 2M PL on lrger dtses is given in Tle 9. From this experiment, we found tht t the sme minimum support of 1%, when the dtse size increses much, there re few or no frequent ptterns found ecuse n item needs to pper in out 2 thousnd sequentil records in the 2 million dtse to e frequent nd tht is not very possile in the synthetic dt leding to out the sme execution times for the two lgorithms. However, run on the sme 2M records t minimum support of.1% found 35 ptterns nd took 56 seconds of CPU time for PL, while the tree lgorithm could not complete successfully. Experiments were lso run to check the ehvior of the lgorithms with vrying sequence lengths nd numer of dtse ttriutes, nd the PL lwys runs fster thn the lgorithm when the verge sequence length is less thn 2 nd there re some found frequent ptterns. How- Tle 8: Execution Times for Different Dtse Sizes t Minsupport of 1% (lrge dtse) Alg Runtime (in secs) t Different DB sizes(%) 3K 4K 5K 6K 7K 8K 9K GSP K 13K 15K 17K PL ever, the PL performnce strts to degrde when the verge sequence length of the dtse is more thn 2 ecuse with extremely long sequences, there re nodes with position codes tht re more thn mxint, in our current implementtion, we use numer of vriles to store node s position code tht re linked together. Thus, mnging nd retrieving the position code for excessively long sequences could turn out to e time consuming. Our experiment on rel dt of size 1K hving out 25 ttriutes with only few very long sequences of up to 166 items while most of other records re less thn 7 items long, revels PL fster thn y fctor of over 11 times t very low minimum support of.5% with their execution times s 449, 5157, t.6%, they re 234, 1187 nd t.7%, 144, 171 respectively. The execution times of the two lgorithms strt eing the sme from minimum support 1% when there re no found frequent ptterns. A test on memory usge of the nd PL lgorithms revels out the sme mount of memory llocted to oth progrms. 5. CONCLUSIONS This pper discusses the source code implementtion of the PL lgorithm presented in [7] s well s our implementtions of two other sequentil mining lgorithms [17] nd GSP [3] tht PL ws compred with. Extensive experimentl studies were conducted on the three implemented lgorithms using IBM quest synthetic dtsets. From the experiments, it ws discovered tht the PL lgorithm, which mines pre-order linked, position coded version of the tree, lwys outperforms the other two 34

10 lgorithms nd is much more efficient. PL improves on the performnce of the efficient tree-sed tree lgorithm y mining with position codes nd their suffix tree root sets, rther thn storing intermedite trees recursively. Thus, it sves on processing time nd more so when the numer of frequent ptterns increses nd the minimum support threshold is low. PL s performnce seems to degrde some with very long sequences hving sequence length more thn 2 ecuse of the increse in the size of position code tht it needs to uild nd process for very deep PL tree. For mining sequentil ptterns from we logs or dtses, the following spects my e considered for future work. The PL lgorithm could e extended to hndle sequentil pttern mining in lrge trditionl dtses to hndle concurrency of events. The position code fetures of the PL tree provides mechnism for concisely storing smll items not represented in the tree for future incrementl refreshment of mined ptterns. It lso enles esy multilevel mining of frequent ptterns t detiled (e.g., item level like city level or word level) to more generlized clss level (e.g., country level or ook level) through rollup mining with the sme tree y providing levelwise pre-order heder linkges. This my mke it useful in severl sequence oriented pplictions like frequent word usges in collections of documents (ooks), cell phone cll sequence record dt, intrusion detection nd sensor network mining s well s iologicl sequence dt. A more efficient implementtion of the position code mngement for long sequences would mke the PL more sclle. 6. ACKNOWLEDGMENTS This reserch ws supported y the Nturl Science nd Engineering Reserch Council (NSERC) of Cnd under n operting grnt (OGP ) nd University of Windsor grnt. 7. REFERENCES [1] R. Agrwl nd R. Sriknt. Fst lgorithms for mining ssocition rules in lrge dtses. In Proceedings of the 2th Interntionl Conference on very Lrge Dtses Sntigo, Chile, pges , [2] R. Agrwl nd R. Sriknt. Mining sequentil ptterns. In Proceedings of the 11th Interntionl Conference on Dt Engineering (ICDE 95) Tipei Tiwn, pges 3 14, [3] R. S. R. Agrwl. Mining sequentil ptterns: Generliztions nd performnce improvements. In Proceedings of the Fifth Interntionl Conference On Extending Dtse Technology (EDBT 96) Avignon Frnce, pges 3 17, [4] J. Borges nd M. Levene. Dt mining of user nvigtion ptterns. In Proceedings of the KDD Workshop on We Mining Sn Diego Cliforni, pges 31 36, [5] O. Etzioni. The world wide we: Qugmire or gold mine. Communictions of the ACM, 39(1):65 68, [6] C. Ezeife nd M. Chen. Mining we sequentil ptterns incrementlly with revised plwp tree. In Proceedings of the fifth Interntionl Conference on We-Age Informtion Mngement (WAIM 24), pges Springer Verlg, July 23. [7] C. Ezeife nd Y. Lu. Mining we log sequentil ptterns with position coded pre-order linked wp-tree. Interntionl Journl of Dt Mining nd Knowledge Discovery (DMKD) Kluwer Pulishers, 1(1):5 38, 25. [8] J.Hn,J.Pei,B.Mortzvi-Asl,Q.Chen,U.Dyl, nd M. Hsu. Freespn: Frequent pttern-projected sequentil pttern mining. In Proceedings of the 2 Int. Conference on Knowledge Discovery nd Dt Mining (KDD ), pges , 2. [9] J.Hn,J.Pei,Y.Yin,ndR.Mo.Miningfrequent ptterns without cndidte genertion: A frequent-pttern tree pproch. Interntionl Journl of Dt Mining nd Knowledge Discovery, 8(1):53 87, Jn 24. [1] Y. Lu nd C. Ezeife. Position coded pre-order linked wp-tree for we log sequentil pttern mining. In Proceedings of the 7th Pcific-Asi Conference on Knowledge Discovery nd Dt Mining (PAKDD 23), pges Springer, My 23. [11] S. Mdri, S. Bhowmick, W. Ng, nd E. Lim. Smpling lrge dtses for finding ssocition rules. In Proceedings of the first Interntionl Dt Wrehousing nd Knowledge Discovery DWk99, [12] F. Mssegli, F. Cthl, nd P. Poncelet. Psp : Prefix tree for sequentil ptterns. In Proceedings of the 2nd Europen Symposium on Principles of Dt Mining nd Knowledge Discovery (PKDD 98) Nntes Frnce LNAI, pges , [13] F. Mssegli, P. Poncelet, nd R. Cicchetti. An efficient lgorithm for we usge mining. Networking nd Informtion Systems Journl (NIS), 2(5-6):571 63, [14] A. Nnopoulos nd Y. Mnolopoulos. Finding generlized pth ptterns for we log dt mining. Dt nd Knowledge Engineering, 37(3): , 2. [15] A. Nnopoulos nd Y. Mnolopoulos. Mining ptterns from grph trversls. Dt nd Knowledge Engineering, 37(3): , 21. [16] J. Pei, J. Hn, B. Mortzvi-Asl, nd H. Pinto. Prefixspn: Mining sequentil ptterns efficiently y prefix-projected pttern growth. In Proceedings of the 1 Interntionl Conference on Dt Engineering (ICDE 1), pges , 21. [17] J. Pei, J. Hn, B. Mortzvi-Asl, nd H. Zhu. Mining ccess ptterns efficiently from we logs. In Proceedings of the Pcific-Asi Conference on Knowledge Discovery nd Dt Mining (PAKDD ) Kyoto Jpn, 2. [18] M. Spiliopoulou. The lorious wy from dt mining to we mining. Journl of Computer Systems Science nd Engineering Specil Issue on Semntics of the We, 14: , [19] J. Srivstv, R. Cooley, M. Deshpnde, nd P. Tn. We usge mining: Discovery nd pplictions of usge ptterns from we dt. SIGKDD Explortions, 1, 2. 35

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring