ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System

ApproxMGMSP: A Scalable Method of Mnng Approxmate Multdmensonal Sequental Patterns on Dstrbuted System Changha Zhang, Kongfa Hu, Zhux Chen, Lng Chen Department of Computer Scence and Engneerng, Yangzhou Unversty, Yangzhou 225009,Chna Ysheng Dong Department of Computer Scence and Engneerng, Southeast Unversty, Nanjng 210096,Chna Abstract We present a scalable and effectve algorthm called ApproxMGMSP (Approxmate Mnng of Global Multdmensonal Sequental Patterns) to solve the problem of mnng the multdmensonal sequental patterns for large databases n the dstrbuted envronment. Our method dffers from prevous related works of mnng multdmensonal patterns on dstrbuted system. The man dfference s that an approxmate mnng method s used n large multdmensonal sequence database frstly. In ths paper, to convert the mnng on the multdmensonal sequental patterns to sequental patterns, the multdmensonal nformaton s embedded nto the correspondng sequences. Then the sequences are clustered, summarzed, and analyzed on the dstrbuted stes, and the local patterns could be obtaned by the effectve approxmate sequental pattern mnng method. Fnally, the global multdmensonal sequental patterns could be quckly mned by hgh vote sequental pattern model after collectng all the local patterns on one ste. Both the theores and the experments ndcate that ths method could smplfy the problem of mnng the multdmensonal sequental patterns and avod mnng the redundant nformaton. The global sequental patterns could be obtaned effectvely by the scalable method after reducng the cost of communcaton. 1. Introducton Sequental pattern mnng has become an essental data mnng task, wth broad applcatons, ncludng web log analyss, market and customer analyss, pattern dscovery n proten sequences, and mnng XML query access patterns for cachng. However, mnng multdmensonal sequental patterns could extract more useful nformaton than mnng sequental patterns. At present, databases and data warehouses wth huge amount of data make data mnng on PC not very effectve, especally can not make the need of the ablty of data process on functon and performance. In actual applcatons, most large nformaton systems are dstrbuted, such as the data access of large nterregonal shoppng markets. So, dstrbuted multdmensonal patterns mnng s proposed n order to deal wth ths problem frstly. At present, many multdmensonal sequental pattern mnng-related researches have been advanced. such as the well-known algorthms UnSeq, PSFP and HYBRID[1]. However, the overall performance of these algorthms s not hgh n mnng global multdmensonal patterns for the large amount of data scattered n dstrbuted envronment. So the ssue only can be solved by the dstrbuted or parallel data mnng technology. In 2003, S.C. Zhang proposed the technque of dstrbuted mnng of mult-database[2] to resolve the problem, and then the methods of global assocaton rule mnng[3] and exceptonal sequental patterns mnng[4] n dfferent data sources were also proposed. Recently H.C. Kum has also proposed the method of mnng global sequental patterns[5] n mult-database. Tradtonal methods of mnng sequental patterns are to fnd all the patterns that satsfy the user-specfed mnmum support threshold, such as the well-known algorthms GSP[6], Prefxspan[7], SPADE[8] and so on. However, these sequental patterns mnng algorthms based on support have some nherent lmtatons. So, we propose a novel method of mnng approxmate multdmensonal sequental patterns on dstrbuted system. Our experments ndcate that the method smplfy the process of mnng multdmensonal sequental patterns and solve the problem of hgh dmenson effectvely. The global multdmensonal sequental patterns could be obtaned effectvely by reducng the redundant nformaton.

2. Problem formulaton Assume that there are n stes S 1,S 2,,S n n the dstrbuted envronment and the multdmensonal sequence database MSDB s parttoned over the n stes nto {MSDB 1,MSDB 2,,MSDB n }, respectvely. Let the ndependent computer on each ste can communcate each other. Gven schema MSDB (TID, A 1,,A m, S) s a multdmensonal sequence database, where TID s a prmary key, A 1,,A m s multdmensonal nformaton and S are sequences. Let * be any value belong to any doman of A 1,,A m. A multdmensonal sequence takes the form of (a 1,,a m,s), where a ( A {*} ) for(1 m) and s s a sequence. Defnton 1. Gven a local sequence database DB x, let dst (seq,seq j ) be the dstance measure for seq and seq j (0<dst(seq,seq j ) <1), and DB x can be parttoned nto smlarty clusters G x1 G xn such thatσ j dst (seq a, seq jb ) s maxmzed and Σ j dst (seq a, seq jb ) s mnmzed where seq a G x,seq jb G xj. Defnton 2. Let G x1,, G xn be smlarty clusters for a local database DB x, an approxmate sequental pattern for group G x, denoted as lpat x, s a sequence that mnmzes dst (lpat x,seq a ) for all seq a n smlarty group G x. Defnton 3. Let the set M be approxmate sequental patterns on all stes, ts subset HS s a homogeneous set of range γ when the smlarty between any two patterns p and p j n HS s not less than γ, p HS Λ p j HS Λ sm(p,p j ) γ, where sm(p,p j )1-dst(p,p j ). Defnton 4. The vote of a homogeneous set HS s defned as the sze of the homogenous set. Vote(HS, γ) HS(γ). Defnton 5. Let γ and Ө be desred smlarty level and threshold correspondngly, a hgh vote homogenous set s a homogeneous set HS such that Vote(HS, γ) Ө. Gven a hgh vote homogenous set, the hgh vote sequental pattern s the longest common subsequence of all local patterns n the set. Defnton 6. Gven a schema WS<X 1 : v 1,,X l :v l >:n, WS s a weghted sequence when carryng the followng nformaton: the current algnment has n sequences, v sequences have a non-empty temset X algned n the th temset, where(1< <l), and an temset n the algnment s n the form of X (x j1 :w j1,...,x jm :w jm ), whch means, n the current algnment, there are w jk sequences that have tem x jk n the th poston of the algnment, where (1< <l) and (1<k<m). Gven the,ג that s specfed by users, f w jk /n ג mnmum degree then x jk can be collected for obtanng approxmate sequental patterns. 3. Multdmensonal sequental patterns mnng on dstrbuted system 3.1 Embeddng multdmensonal nformaton nto sequences Inspred by UnSeq, for a tuple n the multdmensonal sequence database, the multdmensonal nformaton could be embedded nto the correspondng sequence through ntroducng a specal element. So, the problem of mnng s predgested by convertng the mnng n both nformaton of the dmenson and sequence to the mnng only n the sequence. For example, gven a tuple q(10,busness, Chcago, Mddle, <(bd)cb(ac)>), the multdmensonal nformaton (Busness, Chcago, Mddle) could be embedded nto the correspondng sequence<(bd)cb(ac)>as the frst element. That s to say, the sequence x <(bd)cb(ac)> n q could be extended to y <(Busness Chcago Mddle)(bd)cb(ac)>. Ths method could convert the mnng of the sequence n multdmensonal sequence database to the mnng of the extended sequence n the extended sequence database. In the same way, the multdmensonal nformaton could be embedded nto the correspondng sequence as the last element. Now, let us verfy approxmate multdmensonal sequental pattern mnng usng the extended database. Theorem 1. Gven a multdmensonal sequence database MSDB and extended database ESDB. A multdmensonal sequence t(a 1,,a n,s)s an approxmate sequental pattern n MSDB f and only f sequence t 1 <(a 1,,a n ),s> s an approxmate sequental pattern n ESDB. Proof. If a multdmensonal sequence t(a 1,,a n,s)s an approxmate sequental pattern n MSDB, then the levenshten dstance dst(t,seq) s mnmum for all seq n smlarty group G. So, the dst(t 1,seq) s also mnmum by calculatng the levenshten dstance(algorthm 1), that s to say, the sequence t 1 <(a 1,,a n ),s> s an approxmate sequental pattern n ESDB. In the same way, we can educe that the multdmensonal sequence t(a 1,,a n,s) s an approxmate sequental pattern n MSDB. 3.2 Multdmensonal sequence mnng The goal of the multdmensonal sequental pattern mnng n the dstrbuted envronment s to reduce cost of the communcaton n the network. Though we can get hgh performance by the tradtonal method of mnng patterns wth low dmenson, the effcency s very low when the dmenson s hgh for the need of

mnng the long sequental patterns. So we adopt the approxmate sequence mnng method for extended database n every staton. Frst the levenshten dstance s ntroduced whch s commonly used as a dstance measure for sequences. It s used to computng the mnmum cost of nsertng, deletng, and replacng when one sequence S s converted to another sequence T. Gven S<s 1,,s n > and T<t 1,,t m >, the levenshten dstance could be obtaned by the dynamc programmng and the followng crcle operatons. Algorthm 1. Calculatng levenshten dstance Input: Tow sequences S<s 1,,s n >, T<t 1,,t m >. Output: Levenshten dstance between S and T, dst(s,t) 1) If n 0, return m and ext. If m 0, return n and ext. Construct a matrx contanng m rows and n columns 2) Intalze the frst row to 0 n. Intalze the frst column to 0 m. 3) Examne each character of S ( from 1 to n). Examne each character of T (j from 1 to m). 4) If S[] equals T[j], the cost s 0. If S[] doesn't equal T[j], the cost s 1. 5) Set cell dst[,j] of the matrx equal to the mnmum of: a. The cell mmedately above plus 1: dst[-1,j] + 1. b. The cell mmedately to the left plus 1: dst[,j-1] + 1. c. The cell dagonally above and to the left plus the cost: dst[-1,j-1] + cost. 6) After the teraton steps (3, 4, 5, 6) are complete, the dstance s found n cell dst[n,m]. The normalzed levenshten dstance as Formula 1. Formula 1. dst( S, T) D ( S, T) max{ S, T } The normalzed set dfference s used to ft sequence of sets properly for measurng the dstance, as Formula 2. Formula 2. ( s t) ( t s) 2 s t Re pl( s, t) 1 s + t s t + t s + 2 s t We adopt a densty-based clusterng algorthm to cluster sequences. For each sequence s n the database S, let d 1,,d k be the k smallest non-zero values of D(s, s j ), where s j S, s s j, then Den(s ) n/d, dmax{ d 1,,d k },n { s j S D(s, s j ) d}. Algorthm 2. Unform kernel k-nn clusterng Input: A set of sequences {s }, the number of neghbor sequences k. Output: A set of clusters {C j }. 1) Generate ntal cluster. Set every sequence as a cluster, and Den(Cs )Den(s ). 2) Expand ntal cluster based on the densty of sequences. Set s 1,,s n be the nearest neghbor for s, for each s j {s 1,,s n },merge cluster Cs contanng s wth a cluster Cs j contanng s j, f Den(s ) < Den(s j ) and there exsts no s p havng D(s, s p ) < D(s, s j ) and Den(s ) < Den(s p ), set Den(new cluster) max{den(cs ),Den(Cs j )}. 3) Merge based on the densty of new clusters. Fnd sequences s such that Den(s )Den(s j ), merge the two clusters Cs and Cs j contanng each sequence f Den (Cs ) >Den(Cs j ). Sequences n every database are parttoned nto several groups by clusterng. All sequences are sorted wthn a group n densty descendng order, then the frst two sequences are compressed nto the weghted sequence ws 1 ; then a weghted replace cost s adopted to ensure that the dstance between the sequence assgned and the weghted sequence ws 1 s mnmum, as Formula 3, let ws(x 1 :w 1,,x m :w m ):v be an temset n a weghted sequence, and t(y 1,,y l ) s an temset n a sequence n the database. Let n be the global weght of the weghted sequence, the weghted sequence ws n-1 s obtaned by compressng sequences nto the correspondng weghted sequence, and then we could collect approxmate sequental patterns accordng to WS n-1. Formula 3. R v + n v REPL( ws, t) n R m 1 w + t v 2 m 1 w + t v x t The global multdmensonal sequences are obtaned by hgh vote sequental pattern model. Algorthm 3. Global multdmensonal sequence mnng Input: All local patterns L 1,,L n for stes 1,,n. Output: Global patterns G. 1) Collect all local patterns L 1,,L n to a ste, and generate homogeneous sets. 2) Collect hgh vote homogenous sets M from results n step one, and then generate global patterns G, that s the longest common subsequences n M. 3) Broadcast G to each ste. 4 Expermental evaluatons 4.1 Effectveness analyss of ApproxMGMSP w

For effectveness analyss of ApproxMGMSP, we adopt a general evaluaton method that can evaluate the accuracy of the approxmaton n terms of how well t fnds the real underlyng patterns n the data and whether or not t generates any spurous patterns. The datasets were generated by the well-known IBM data generator[9]. Base patterns were generated randomly accordng to the user s specfcaton. Then, these base patterns were corrupted and merged to generate the sequences n the database. Dmensonal nformaton was generated and merged randomly so that values were dstrbuted evenly n every dmenson. For evaluaton crtera, recoverablty R,Precson P, N redun : the number of redundant patterns, N spur : the number of base patterns, N max : the number of spurous patterns, L: the average length of sequence. Followng, Table1 and Table2 demonstrate how 7 of the most frequent 10 base patterns were uncovered from 1000 sequences usng ApproxMGMSP Table1. Base patterns 10 base patterns L B 0 <(B, X, D, Y)(20)(63 24)(2)(5)(2 74)(95)(96)> 13 B 1 <(F, A, Z, F)(66 62 50)(16)(16 30 22)(58 66) > 13 B 2 <(W, A, D, F)(6)(24 65 93)(2 24 16 63)(58)(22)> 14 B 3 <(W, L,D,Y)(62)(66)(76 31)(2 74)(58 99)(15)(16 66)> 15 <(G, H, C, Y)(63 99)(16)(22 58)(51)(66)(96)(50) B 4 19 (45 36) (94)(96 29)(18)> B 5 <(B, L, I, Y)(40 62)(15)(40)(29 40)(24 63)(2 74 88)> 15 B 6 <(G, H, I, J)(23 96)(50)(2 22)(16)(58)(10 74)(51 63)> 15 B 7 <(W, X, D, O)(22)(58)(96)(88)(58 78)> 10 B 8 <(B, A, I, O)(22 41)(2 74)(31 76)(2 74)(22)(58 66)> 15 B 9 <(W, H, C, F)(2 22)(24)(22 50 66)(50)(16)> 12 Table2. Local patterns Local patterns: approxmate sequental patterns L A 0 <(B, X, D, Y)(20)(63 24)(2)(5)(2 74)(95) > 12 A 1 <(F, A, Z, F)(66 62 50)(16)(16 30 22) > 11 A 2 <(W, A, D, F)(6)(24 65 93)(2 24 16 63)(58) > 13 A 3 <(W, L, D, Y)(62)(66)(76 31)(2 74)(58 99)(15) > 13 <(G, H, C, Y)(63 99)(16)(22 58)(51)(66)(96)(50) A 4 (45 36) (94)(96 29)> 18 <(G, H,C,Y)(63 99)(16)(22 58)(51)(66)(96)(50)(45 A 5 15 36)> A 6 <(G, H, I, J)(23 96)(50)(2 22)(16)(58)(10 74)(51 63)> 13 A 7 <(W, X, C, O)(22)(58 66)(96)(88)(58 78)> 11 Clearly, 8 local patterns are generated and recover major parts of the base patterns wth hgh expected frequency n the database from 1000 sequences, each of the 8 approxmate patterns match a base pattern well. The recoverablty s excellent at 90.66%. The precson s qute good at P1-2/8797.7%. In all approxmate patterns, only 2 tems ((W, X, C, O) (58 66)) do not appear on the correspondng poston n the base pattern. There were no spurous patterns and only one redundant pattern A 5. Ths s because B 4 s too long, as a result of the long B 4, the sequences generated from a long base pattern B 4 can be parttoned nto multple clusters by ApproxMGMSP. To sum up, ApproxMGMSP s an effectve method of mnng multdmensonal sequental patterns. 4.2 Scalablty analyss of ApproxMGMSP The followng experments have been carred out to text the scalablty of ApproxMGMSP. Group 1, the recoverablty changes as dfferent sequence numbers on the average length of sequence L 20, the average tem length I 2.5, the number of tem 10000, the number of base pattern N seq 1000, the average length of base pattern L seq 14, the average tem length of base pattern Iseq 2, the number of neghbor sequences k4, the mnmum degree, 50% ג the results n Fgure 1. Group 2, the recoverablty changes as dfferent average lengths of sequence on N 100000, I 2.5, 10000, N seq 1000, I seq 2, k4,, 50% ג the results n Fgure 2. Group 3, the executve tme of ApproxMGMSP changes as dfferent dmensons on N 100000, L 20, I 2.5, 10000, N seq 1000, I seq 2, k4,, 50% ג the results n Fgure 3. Fgure1. Recoverablty vs. N Fgure2. Recoverablty vs. L

Fgure3. Runnng Tme vs. Dmenson Obvously, we observe that ApproxMGMSP s scalable wth respect to database sze from Fgure 1. The more the sequences n the database, the better the recoverablty. For a base pattern wth the same Probablty n sequences, the large the database sze, the more the approxmate sequental patterns, so there are more sequences smlar to base patterns, and the recoverablty s more ncreased. From Fgure 2, we could fnd that ApproxMGMSP s scalable wth respect to the average length of sequence. That s because the larger the average length of sequence, the more the repeated tems, so the recoverablty s more ncreased. We can see from Fgure 3, the executve tme decrease wth the ncreasng dmensons. As the ncreasng dmensons, the man task of the entre mnng process s to mne dmensonal nformaton ncreasngly, and dmensonal nformaton mnng does not need to fnd the mnmum dstance between the sequences by sequence comparson. So, wth the dmenson ncreasng, the runnng tme has decreased gradually. 5. Concluson and future work A scalable method s proposed n ths paper to mne multdmensonal sequental patterns effectvely. The multdmensonal nformaton s embedded nto the correspondng sequences to convert complex mnng on multdmensonal sequences to mnng on sequences n ths method. If the dmenson s low, we could adopt the mnng method based on support n every ste, and obtan global multdmensonal sequental patterns by collectng local patterns. But the tradtonal approach would have a lot of redundancy and short patterns, and dffcult to resolve long patterns when the dmenson s hgh. So, the method of mnng approxmate sequences s adopted to mne local patterns, and fnally collect global patterns by hgh vote sequental patterns. The experments show that ths scalable method not only smplfy the problem of mnng multdmensonal patterns, but also resolve the ssue of hgh dmenson. Although ths approach s very effcent for mnng multdmensonal sequental patterns n large databases n the dstrbuted envronment, t brngs a hgh degree of complexty. So, reducng complexty of ApproxMGMSP and the evaluaton of global sequental pattern mnng are our future researches. Acknowledgements: The research n the paper s supported by the Natonal Natural Scence Foundaton of Chna under Grant No. 60673060; the Natonal Facltes and Informaton Infrastructure for Scence and Technology of Chna under Grant No. 2004DKA20310; the Natural Scence Foundaton of Jangsu Provnce under Grant No. BK2005047 ; the Qng Lan Project Foundaton of Jangsu Provnce of Chna. References [1] H. Pnto, J. Han and J. Pe, Mult- dmensonal Sequental Pattern Mnng, In Proc. of the 10 th Int. Conf. on Informaton and Knowledge Management (CIKM), ACM, Atlanta, Georga, pp. 81-88, November 2001. [2] S. Zhang, X. Wu, and C. Zhang, Mult-Database Mnng, IEEE Computatonal Intellgence Bulletn, Vol.2, No.1, pp. 5-13, June 2003. [3] X. Wu and S. Zhang, Syntheszng Hgh-Frequency Rules from Dfferent Data Sources, IEEE Transactons Knowledge Data Engneerng, Vol.15, No.1, pp. 353-367, January 2003. [4] C. Zhang, M. Lu, W. Ne, and S. Zhang, Identfyng Global Exceptonal Patterns n Mult-database Mnng, IEEE Computatonal Intellgence Bulletn, Vol.3, No.1, pp. 19-24, Feb 2004. [5] H.C. Kum, J.H. Chang, W. Wang, Sequental Pattern Mnng n Mult-Databases va Multple Algnment, Data Mnng & Knowledge Dscovery, Vol.12, No.1, pp. 151-180, January 2006. [6] R. Srkant and R. Agrawal, Mnng Sequental Patterns: Generalzatons And Performance Improvements, In Proc. of the 5 th Int. Conf. on Extendng Database Technology (EDBT), Sprnger, Avgnon, France, pp. 3-17, March 1996. [7] J. Pe, J. Han, H. Pnto, Q. Chen and U. Dayal, PrefxSpan: Mnng Sequental Patterns Effcently by Prefx-Projected Pattern Growth, IEEE Transactons on Knowledge & Data Engneerng, Vol.16, No.1, pp. 1424-1440, January 2004. [8] M. Zak, SPADE: An Effcent Algorthm for Mnng Frequent Sequences, Machne Learnng, Vol.42, No. 1/2, pp. 31-60, January 2001. [9] R. Agrawal and R. Srkant, Mnng Sequental Patterns, In Proc. of the 11 th Int. Conf. on Data Engneerng (ICDE), IEEE Computer Socety, Tape, Tawan, pp. 3-14, March 1995.