Mining Sequential Patterns with Periodic Wildcard Gaps

Mining Sequential Patterns with Perioic Wilcar Gaps Youxi Wu, Lingling Wang, Jiaong Ren, Wei Ding, Xinong Wu Abstract Mining frequent patterns with perioic wilcar gaps is a critical ata mining problem to eal with complex real-worl problems. This problem can be escribe as follows: given a subect sequence, a pre-specifie threshol, an a variable gap-length with wilcars between each two consecutive letters. The task is to gain all frequent patterns with perioic wilcar gaps. State-of-the-art mining algorithms which use matrix or other linear ata structures to solve the problem not only consume a large amount of memory but also run slowly. In this stuy, we use an Incomplete Nettree structure (the last layer of a Nettree which is an extension of a tree) of a sub-pattern P to efficiently create Incomplete Nettrees of all its super-patterns with prefix pattern P an compute the numbers of their supports in a one-way scan. We propose two new algorithms, MAPB (Mining sequential Pattern using incomplete Nettree with Breath first search) an MAPD (Mining sequential Pattern using incomplete Nettree with Depth first search), to solve the problem effectively with low memory requirements. Furthermore, we esign a heuristic algorithm MAPBOK (MAPB for top-k) base on MAPB to eal with the Top-K frequent patterns for each length. Experimental results on real-worl biological ata emonstrate the superiority of the propose algorithms in running time an space consumption an also show that the pattern matching approach can be employe to mine special frequent patterns effectively. Keywors Sequential pattern mining, Perioic wilcar gap, Pattern matching, Heuristic algorithm, Nettree Y. Wu ( ), L. Wang, J. Ren, W. Ding, X. Wu School of Computer Science an Engineering, Hebei University of Technology, Tianin 33, China e-mail:wuc567@63.com L. Wang e-mail:wangling927927@26.com J. Ren School of Information Science an Engineering, Yanshan Univeristy, Qinghuangao 664, China e-mail:ren@ysu.eu.cn W. Ding Department of Computer Science, University of Massachusetts Boston, Boston, 225, USA e-mail:wei.ing@umb.eu X. Wu Department of Computer Science, University of Vermont, Burlington, VT 545, USA e-mail:xwu@cs.uvm.eu. Introuction Pattern mining plays an essential role in many critical ata mining tasks, such as graph mining [], spatiotemporal ata mining [2], an univariate uncertain ata streams [3]. Sequential pattern mining, first introuce by Agrawal an Srikant [4], is a very important ata mining problem [5] with broa applications incluing analyzing moving obect ata [6], etecting intrusion [7], an iscovering changes in customer behaviour [8] etc. In orer to solve the specific applications, researchers have also propose some special approaches. For example, Liao an Chen [9] propose a Depth-First SPelling algorithm for mining sequential patterns in long biological sequences. To avoi proucing a large number of uninteresting an meaningless patterns, Hu et al. [] propose an algorithm which consiers compactness, repetition, recency an frequency ointly to select sequential patterns an it is efficient in the business-to-business environment. Shie et al. [] propose an efficient algorithm name IM-Span to mine user behaviour patterns in mobile commerce environments. Yin et al. [2] propose an efficient algorithm USpan for mining high-utility sequential patterns. Zhu et al. [3] propose an efficient algorithm Spier-Mine to mine the Top-K largest frequent patterns from a single massive network with any user-specifie probability. Wu et al. [4] propose a novel framework for mining the Top-K high-utility itemsets. Classical sequential pattern mining algorithms inclue GSP [4] an Prefix Span [5]. Researchers have recently focuse on perioic etection [6] an mining frequent patterns with perioic wilcar gaps [7], since this kin of pattern can flexibly reflect sequential behaviours an is often exhibite in many real-worl fiels. For example, in business, retail companies may want to know what proucts customers will usually purchase at regular time intervals rather than in continuous time accoring to time gaps. So managers can aust marketing methos or strategies an improve sales profit [8, 9]. In biology, perioic patterns are reeeme as having significant biological an meical values. Consiering that genes an proteins both contain many repetitive fragments, many perioic patterns of ifferent lengths an types can therefore be foun. Some of the patterns with excessively high frequency have been etermine as the suspecte cause of some iseases [2, 2]. In ata mining, researchers foun that some frequent perioic patterns can be use in feature selection for the purpose of classification an clustering [22, 23]. So an extraorinary task is how to mine all these frequent patterns an the Top-K patterns for each length from a mass of ata to improve its efficiency in each application fiel. In this stuy, we focus on mining frequent patterns with perioic wilcar gaps. This problem can be escribe as follows: given a subect sequence, a pre-specifie threshol, an a variable gap-length with wilcars between each two consecutive letters. Our goal is to gain all frequent patterns with perioic wilcar gaps an the Top-K frequent patterns for each length. The pattern with perioic wilcar gaps means that all these wilcar gap constraints are the same. For example, pattern P=p [min, max ]p [min, max ]p 2 =A[,

2]T[, 2]G is a pattern with perioic wilcar gaps since min =min = an max =max =2. We can easily know that both patterns P2=p [min, max ]p [min, max ]p 2 =A[,2]T[,2]G an P3=p [min, max ]p [min, max ]p 2 =A[,3]T[,2]G are not patterns with perioic wilcar gaps since the wilcar gaps between letters are not completely the same. In existing work, a series of algorithms ha been propose on the pattern mining with flexible wilcars, as etaile in Sect. 2. But all these state-of-the-art mining algorithms use a matrix or other linear ata structure to solve the problem an consume a large amount of memory an run slowly [2, 24, 25]. It is also very ifficult to mine frequent patterns in long sequences. Therefore, we propose two more efficient mining algorithms by using an incomplete Nettree structure an can get all frequent patterns in a shorter time with less memory space. In aition, we also esign a heuristic algorithm to gain the Top-K frequent patterns for each length effectively. The contributions of this paper are escribe specifically as follows: We propose an incomplete Nettree structure (see Part 4.2) which can be use to compute the numbers of supports of those super-patterns in a one-way scan. We esign two algorithms name MAPB (Mining sequential Pattern using incomplete Nettree with Breath first search) an MAPD (Mining sequential Pattern using incomplete Nettree with Depth first search) to solve the problem effectively with low memory requirements. Experiments on real-worl bio-ata show that MAPD is much faster than MAPB an can be employe to mine frequent patterns in long sequences. Furthermore, a heuristic algorithm name MAPBOK (MAPB for top-k) which is esigne base on MAPB is presente to solve the Top-K mining problem, although MAPB is slower than MAPD. This algorithm can efficiently get the Top-K frequent patterns for each length without having to sort all frequent patterns or to check all possible caniate patterns. The paper is organize as follows. Section 2 reviews relate work. Section 3 presents the problem efinition an preliminaries. Section 4 introuces Nettree an the incomplete Nettree structure an then proposes the MAPB an MAPD algorithms. The analysis of the time an space complexities of these two algorithms an an illustrative example are also shown in section 4. Section 5 proposes a heuristic algorithm to fin the Top-K frequent patterns for each length an the longest frequent patterns. Section 6 reports comparative stuies an Section 7 conclues the paper. 2. Relate work We iscuss the existing work from the following features: () Pattern P has many wilcar gap constraints. In some stuies the gap constraints can be ifferent [26]. But in this paper all these wilcar gap constraints are the same. The pattern is calle a pattern with perioic wilcar gaps. (2) In some stuies, only one or two positions of the given sequence are consiere [23], whereas we consier that a group of positions are use to enote a support of a pattern in the given sequence. This means that each position p shoul be consiere. For example, for a given sequence S=s s s 2 s 3 s 4 s 5 = AATTGG an a pattern P=p [min, max ]p [min, max ]p 2 =A[,2]T[,2]G, <, 2, 4> is a support of P in S ue to the fact that s =p =A, s 2 =p =T, an s 4 =p 2 =G. So it is easy for us to know that there are 2*2*2=8 supports in the problem. (3) Each position can be use many times in this stuy. In the literature [26-28] each position in the sequence can be use at most once, which is calle the one-off conition first introuce by Chen et al. [29]. Uner the one-off conition there are only two supports of P=A[,]T[,]G in S=s s s 2 s 3 s 4 s 5 =AATTGG which are <, 2, 4> an <, 3, 5>. Another kin of stuy is uner the non-overlapping conition [3]. For instance, given S=s s s 2 s 3 s 4 =ATATA an P=A[,]T[,]A, it is easy to know that <,, 2> an <2, 3, 4> are two supports of P in S. We notice that position 2 is use twice in <,, 2> an <2, 3, 4>, but position 2 oes not appear at the same inication of the supports. Hence <,, 2> an <2, 3, 4> are two supports of P in S uner the non-overlapping conition rather than the one-off conition. But in this stuy, we o not have these kins of constraints. (4) In the literature [26, 27, 3] it is shown that the Apriori property can be use to prune caniate patterns if the sequential pattern mining has the one-off conition or the non-overlapping conition, since the number of supports of any super-pattern is less than that of its sub-patterns uner these conitions. So if a pattern is not frequent, all its super-patterns are not frequent. Whereas the Apriori property oes not hol in our mining problem, since the number of supports (even support ratio) of a super-pattern can be greater than that of its sub-patterns. To reuce reunant caniate patterns, the Apriori-like property was propose [2]. The property can be escribe as that all super-patterns are not frequent if the support ratio of a pattern is less than a value which is less than the given threshol. (5) Generally, pattern mining algorithms can fin all frequent patterns whose number of supports is not less than the given threshol. However, it is not easy for users to use the frequent patterns when they face thousans of patterns or more. In some cases, parts of the uplicate frequent patterns nee to be remove to prune the uninteresting patterns because these patterns an their super-patterns have the same number of supports. This is calle close frequent patterns. Moreover, some issues focus on the Top-K mining problem [3, 4, 3] since the Top-K frequent patterns are more useful than other frequent patterns. Mining frequent patterns with perioic wilcar gaps essentially relies on a counting mechanism to calculate the number of supports (or occurrences) of a pattern an then etermine whether the pattern is frequent or not. However, on one han, the number of supports generally grows exponentially with the length of the checke patterns. On the other han, both the number of supports an support ratio o not satisfy monotonicity an the Apriori property cannot be use to reuce the size of the set of caniate patterns. Thus it is infeasible for traitional algorithms to solve this problem. Researchers have explore several maor approaches to untangle this har problem: () The set of length-m caniate patterns is use to generate a set of length-(m+) caniate patterns. Then select frequent patterns from the set of caniate patterns an iterate this process till it is one [2, 22, 23]. (2) The Apriori-like property is employe to reuce the size of the set of caniate patterns [2]. (3) Min et al. [24] reefine the problem of [2] an the support ratio is monotonous in the new efinitions. So the Apriori property can be use. (4) Pattern matching techniques are use to calculate the numbers of supports of patterns in the given sequence to fin frequent patterns [24, 25]. Table shows a comparison of relate work.

Table. Comparison of relate work Algorithms Perioic wilcar gaps Apriori property Occurrences Output patterns He et al. [26] No Apriori One-off All frequent patterns Ding et al. [3] No Apriori Non-overlap All/close frequent patterns Xie et al. [27] Yes Apriori One-off All frequent patterns Li et al. [23] Yes Apriori First-last Close/long frequent patterns Ji et al. [2] Yes Apriori-like All Minimal istinguishing subsequence patterns Min et al. [24] Yes Apriori All All frequent patterns Zhang et al. [2] Yes Apriori-like All All frequent patterns Zhu an Wu [25] Yes Apriori-like All All frequent patterns This paper Yes Apriori-like All All/Top-K frequent patterns for each length In Table, Zhang et al. [2] stuie the problem of mining frequent patterns with perioic wilcar gaps in a genome sequence an propose the MPP algorithm base on the Apriori-like property which employe an effective pruning mechanism to reuce the size of the set of caniate patterns. Min et al. [24] reefine the issue of [2] an the Apriori property can become available. The mining algorithm uses a pattern matching approach to calculate the number of supports (or occurrences) of a pattern in the given sequence. Zhu an Wu [25] explore the MCPaS algorithm to iscover frequent patterns with perioic wilcar gaps from multi-sequences. The propose Gap Constraine Search (GCS) algorithm use a sparse array to calculate the number of supports of a pattern. Ji et al. [2] propose an algorithm to mine all minimal istinguishing subsequences satisfying the minimum an maximum gap constraints an a maximum length constraint. Li et al. [23] propose Gap-BIDE to iscover the complete set of close sequential patterns with gap constraints, an Gap-Connect to mine the approximate set of long patterns. They further explore several feature selection methos from the set of gap-constraine patterns on the issue of classification an clustering. A triple (bp, ep, count) was use to represent a set of supports of a pattern which share the same beginning position an ening position, where bp, ep, an count are the beginning position of the supports in the sequence, the ening position of the supports in the sequence, an the total number of supports, respectively. Xie et al. [27] an He et al. [26] both stuie the problem of mining frequent patterns uner the one-off conition which can greatly improve the time performance. In [27], the MAIL algorithm employing the left-most an the right-most pruning methos is esigne to fin frequent patterns with perioic wilcar gaps. In [26], two heuristic algorithms, namely one-way scan an two-way scan, are propose to mine all frequent patterns. Ding et al. [3] stuie the repetitive gappe subsequence mining problem an focuse on mining close frequent patterns with non-overlapping supports. Accoring to Table, we can see that stuies [2, 24, 25] are the most closely relate with our stuy. The support ratio is use to etermine whether a pattern is frequent or not in this kin of issue an a pattern an its super-patterns generally have ifferent support ratios. So we pay more attention to mining all frequent patterns rather than close frequent patterns. Whereas the algorithm in [2] is less effective to mine all frequent patterns, especially in long sequences. Moreover, traitional Top-K mining stuies such as Zhu et al. [3] an Wu et al. [4] focus on mining the Top-K frequent patterns in all frequent patterns. But those strategies are invali when users are intereste in the Top-K frequent patterns with length t (<=t <=m an m is the length of the longest frequent patterns). For example, suppose the length of the longest frequent patterns is an the numbers of frequent patterns for various lengths are 4, 6, 64, 256, 24, 496, 3438, 5767, 64, an, respectively. Users may be intereste in the Top- frequent patterns with length 8 an the Top- frequent patterns with length 9. But the traitional Top-K mining algorithm can only fin the Top- frequent patterns in all frequent patterns. Therefore, more effective mining algorithms especially in long sequences an mining the Top-K frequent patterns for each length are worth exploring. 3. Problem efinition an preliminaries We first give some efinitions relate with the propose methos. Definition. S=s... s i... s n- is calle a subect sequence, where s i is a symbol an n is the length of S. can be a ifferent symbol set in ifferent applications an the size of is enote by. In a DNA sequence, ={A,T,C,G } an =4. Definition 2. P=p [min,max ]p... [min -,max - ] p... [min m-2,max m-2 ] p m- is a pattern, where p, m is the length of P, min - an max - are integer numbers an min - max -. min - an max - are gap constraints, presenting the minimum an maximum number of wilcars between two consecutive letters p - an p, respectively. If min = min = = min m-2 =M an max = max = = max m-2 =N, pattern P is calle a pattern with perioic wilcar gaps. For example, A[,3]C[,3]C is a pattern with perioic wilcar gaps [,3]. Definition 3. If a sequence of inices D=<,,, m- > is subect to M - - - N ( m-), D is an offset sequence of P with perioic wilcar gaps [M, N] in S. The number of offset sequences of P in S is enote by ofs(p, S). Definition 4. If an offset sequence I=<i,,i,,i m- > is subect to si = p ( m- an i n-), I is a support (or occurrence) of P in S. The number of supports of P in S is enote by sup(p, S). Definition 5. r(p, S)= sup(p, S)/ofs(P,S) is calle the support ratio. If r(p, S) is not less than a pre-specifie threshol ρ, pattern P is a frequent pattern, otherwise P is an infrequent pattern. Apparently, sup(p, S) is not greater than ofs(p, S) since each offset sequence can be a support. So r(p, S). Zhang et al. [2] gave a metho to calculate ofs(p,s) with gap constraints [M, N]. Suppose m an n are the length of pattern P an the given sequence, respectively. Let W=N-M+, l = ( n + M ) /( M + ), an l 2 = ( n + N) /( N + ). ofs(p,s) is calculate as follows. For m>l, ofs(p,s)=. () For m l 2, ofs(p,s)=(n-(m-)((m+n)/2+) )W^(m-) (2) For l 2 <m l, ofs(p,s) can be calculate by a recursive formula. (3) Example. Suppose patterns P= A[,3]C an P2=G[, 3]C, a sequence S=s s s 2 s 3 s 4 s 5 =AGCCCT an ρ =.25. We can easily enumerate all 9 offset sequences of P an P2 in S which are <,2>,<,3>,<,4>, <,3>, <,4>, <,5>, <2,4>, <2,5>, an <3,5>. Hence ofs(p, S)=ofs(P2, S)=9. The

supports of P in S inclue <,2>,<,3>, an <,4> an sup(p, S)=3. P is a frequent pattern since the support ratio r(p, S)=3/9=.333. We can see that sup(p2, S)=2 an the supports are <, 3> an <, 4>. So r(p2, S)=2/9=.222 < ρ an P2 is an infrequent pattern. Definition 6. Given patterns P an Q, if Q is a sub-string of P, P an Q are calle the super-pattern an sub-pattern, respectively. If sub-pattern Q contains the first P - characters of P, Q is a prefix pattern of P enote by Prefix (P). Similarly, if sub-pattern Q contains the last P - characters of P, Q is a suffix pattern of P enote by Suffix (P). Example 2. Suppose patterns P3= A[, 3]C[, 3]T, Q=A, Q2=C, an Q3=A[, 3]C. Q, Q2, an Q3 are sub-patterns of P3 an P3 is a super-pattern of Q, Q2, an Q3. The prefix pattern an suffix pattern of P3 are Q3 an C[,3]T, respectively. One of the most important tasks of the mining problem is how to calculate the number of supports of a pattern in orer to etermine whether the pattern is frequent or not. Zhu an Wu [25] propose the GCS algorithm to calculate the number of supports of pattern P in sequence S. An illustrative example shows the principle of GCS. Example 3. Suppose a pattern P=T[,3]C[,3]G an a sequence S=s s s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 = TTCCTCCGCG. Figure shows the principle of the GCS algorithm which creates a two-imensional array A. If p i <>s ( i m-), then A(i, )=. If i= an p i =s, then A(i, )=, otherwise A(i, )= k = N k= M A (i-, -k-) (<i m-). The sum of the elements in the last row is the number of supports of the pattern in the sequence. Hence we know that the number of supports of P in S is +++++++5++4=9. 2 3 4 5 6 7 8 9 T T C C T C C G C G T C 2 2 2 G 5 4 Figure. Gap constraine pattern search We know that the array in Figure is a sparse array. So GCS must eal with much useless zero ata that makes it ineffective to compute the number of supports of a pattern. We employ an incomplete Nettree ata structure to overcome its shortcomings an improve the effectiveness. It is necessary to esign a goo ata structure to spee up the calculation of the number of supports of a pattern, in terms of the mining efficiency. What is more, a preferable pruning strategy also plays an important role. As we know, the Apriori property oes not hol in our mining problem. Example 4 shows that not only the number of supports but also the support ratio of a pattern oes not satisfy monotonicity. Example 4. Suppose a sequence S=TCGGG, a pattern Q=T[,3]C an its super-pattern P=T[,3]C[,3]G. We know that sup(q,s)= an sup(p,s)=3. So the number of supports of a pattern can excee that of its sub-pattern. The offset sequences of Q in S are <,>, <,2>, <,3>, <,4>, <,2>, <,3>, <,4>, <2,3>, <2,4>, an <3,4>. The offset sequences of P in S are <,,2>, <,,3>, <,,4>, <,2,3>, <,2,4>, <,3,4>, <,2,3>, <,2,4>, <,3,4>, an <2,3,4>. So both ofs(q,s) an ofs(p,s) are. Hence r(q, S) is less than r(p, S). Therefore, this example shows that not only is sup(q,s) less than sup(p,s) but also ofs(q,s) is less than ofs(p,s). But luckily, Zhang et al. [2] propose the Apriori-like property to prune caniate patterns effectively an can get all frequent patterns. This property means that if the support ratio of a pattern is less than some value, all its super-patterns are not frequent. Lemma. For any length-m pattern Q an its length-(m+) super-pattern P, we know that sup(p)<=sup(q)*w, where W=N-M+ an M an N are the minimum an maximum gap, respectively. Proof: Let I=<i,i, i,,i m- > be a support of the pattern Q. P has at most W supports with prefix-support I in sequence S which are I =<i, i,, i,, i m-, i m- +M+>, I 2 =<i,i,, i,, i m-,i m- +M+2>,...I W =<i,i,, i,, i m-, i m- +N+>. So sup(p)<= sup(q)* W. Likewise, when Q is the suffix pattern of P, we can get the same formula as above. Theorem. If the support ratio of a length-m pattern Q is n ( )*( w + ) less than * ρ, all its super-patterns which n ( m )*( w + ) contain it can be thought infrequent, where (>m) is the length of the longest frequent patterns. sup( Q, S) n ( )*( w + ) Proof: We know that < * ρ. ofs( Q, S) n ( m )*( w + ) Suppose P is a super-pattern of pattern Q with length k (m<k<=), then we know that ofs(q,s)=(n-(m-)*(w+))*w m- an ofs(p)=(n-(k-) *(w+))*w k- accoring to the offset n ( k )*( w + ) k m equation. So ofs(p,s)= * W * ofs(q,s). n ( m )*( w + ) We know that sup(p,s)<= sup(q,s)*w k-m accoring to Lemma. Hence sup(p,s)/ofs(p,s)<= sup( Q, S) n ( m )*( w + ) * * ρ <= ofs( Q, S) n ( k )*( w + ) sup( Q, S) n ( m )*( w + ) * * ρ <ρ. Therefore Theorem ofs( Q, S) n ( )*( w + ) is prove. Definition 7. All frequent patterns with the Apriori or Apriori-like property can be escribe by a tree which is calle a frequent pattern tree. The root of the tree is null. A path from the root to a noe in the tree is a frequent pattern. Example 5. Suppose all frequent patterns of a DNA sequence are {A, T, C, G, AA, AC, AG, TA, TT, TC, CA, CC, CG, GA, GT, GC, GG, AGA, AGT, AGC} which can be expresse as a frequent pattern tree as shown in Figure 2. null A T C G A C G A T C A C G A T C G A T C Figure 2. A frequent pattern tree 4. Nettree an the algorithms 4.. Nettree Definition 8. A Nettree ata structure is a collection of noes. The collection can be empty, otherwise a Nettree consists of some istinguishe noes r,,r m (root), an or more non-empty subnettrees T,T 2,, T n, each of whose roots can be connecte by at least an ege from root r i, where m, n, an i n. A generic Nettree is shown in Figure 3.

r rm T T2 Tn Figure 3. A generic Nettree than once, calle the one-off conition [29], supports <,3> an <4,5> are not the optimal supports uner the one-off conition. This instance has many optimal solutions an one of the solutions is <,3>, <,5> an <4,8>. Literature [33] esigne an algorithm to fin the optimal supports using Nettree. st level T 3 4 A Nettree is an extension of a tree because it has similar concepts of a tree, such as the root, leaf, level, parent, chil, an so on. In orer to show the ifferences between tree ata structure an Nettree ata structure, Property is given as follows. Property. A Nettree [32,33] has the following four properties. () A Nettree may have more than one root. (2) Some noes except roots in a Nettree may have more than one parent. (3) There may be more than one path from a noe to a root. (4) Some noes may occur more than once. Definition 9. Given a Nettree, if the same noe can occur once or more, we use the noe name to enote each of them if i each noe occurs only once. Otherwise n enotes noe i on the th level if a noe occurs more than once on ifferent levels in the Nettree [32, 33]. Example 6. Figure 4 shows a Nettree. Noe 3 occurs 3 3 times in the Nettree. n, st, 2 n an 3 r levels, respectively. Noe n an n 3. Figure 4. A Nettree 3 n 2 an n 3 3 enote noe 3 on the Definition. A path from a root to noe 4 n 2 has two parents, i n is calle a root path. R( n ) enotes the Number of Root Paths (NRP) of noe i i n. The NRPs of a root an a th level noe i n (>) are an the sum of the NRPs of its parents, respectively. A pattern matching problem with gaps can be use to construct a Nettree accoring to the pattern an the sequence [32, 33]. On one han, a Nettree can clearly express all supports of the pattern in the sequence [32]. On the other han, it is easy to calculate the number of supports an fin the optimal supports uner the one-off conitions [33]. An illustrative example is shown as follows: Example 7. Suppose pattern P = T[,3]C an sequence S=s s s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 =TTCCTCCGCG. Figure 5 shows a Nettree for the pattern matching problem. The numbers in the white an grey circles are the name of a noe an its NRP, respectively. A path from a root to a 2 n level noe in the Nettree is a support of pattern P in sequence S. For example, path (,3) which is from noe in the first level to noe 3 in the secon level is a support <,3> of P in S. So a Nettree can show all supports of P in S effectively. Hence, the number of supports is 2+2+2++=8. This is the solution of literature [32]. However, if each position can be use no more 2 n level C 5 Figure 5. A Nettree for P in S 2 2 2 2 3 5 6 8 Therefore, literatures [32] an [33] which eal with two pattern matching problems use Nettree to calculate the number of supports an fin the optimal supports uner the one-off conition when pattern P an sequence S are given, respectively. Literature [32] can be seen as calculating the searching space of literature [33]. Whereas this stuy focuses on a pattern mining problem that is to fin all frequent patterns when sequence S is given. 4.2. Algorithms It is very easy to get a path from a noe in the first level to a noe in the last level in a Nettree if a Nettree has parent-chil an chil-parent relationships. A path is a support (or occurrence) of pattern P in sequence S for the pattern matching problem. So a Nettree has parent-chil an chil-parent relationships in [32, 33]. However, we only want to know the number of supports in the pattern mining problem. Hence it is not necessary to store the parent-chil an chil-parent relationships of the Nettree in this stuy. The Nettree is usually store in a noe array an each element in the noe array stores the noe name an its NRP. Lemma 2. The sum of NRPs of the th ( P ) level noes is the number of supports of sub-pattern Q which is the first characters of pattern P. Proof: Let Q be the first characters of pattern P. R( n ) is the number of supports of Q at position i. So the sum of NRPs of the th ( P ) level noes is the number of supports of Q in S. By Lemma 2, we know that the sum of NRPs of the first level noes, ++=3, is the number of supports of Prefix (P) in S in Example 7. The sum of NRPs of the last level noes (2+2+2++=8) is the number of supports of pattern P. It is easily seen that the top m- level noes of the Nettree of the length-m pattern P are reunant in calculating the number of supports of its super-pattern. So we only store the last level noes of the Nettree of a pattern which is calle an incomplete Nettree. The reasons for using an incomplete Nettree to solve the problem are as follows. () The incomplete Nettree only stores useful information an avois calculating useless ata, which makes the algorithm more effective an consumes less memory. But in [24, 25], a two-imensional array is use to calculate the number of supports. So the two algorithms in [24, 25] have to eal with useless information which influences the efficiency of mining since the array is sparse. (2) We can use the incomplete Nettree of pattern P to construct a new incomplete Nettree an calculate the number of supports of its super-pattern. This can take full avantage of the previous calculating results. However, if we can simultaneously calculate the numbers of supports of super-patterns with the same prefix pattern in a one-way scan, the pattern mining algorithm will further i

enhance its efficiency. For example, suppose that a pattern P=T[,3]C is frequent in a DNA sequence an we nee to etermine whether super-patterns P = T[,3]C[,3]A, P 2 =T[,3]C[,3]T, P 3 =T[,3]C[,3]C, an P 4 =T[,3]C[,3]G are frequent or not. It will be less effective to calculate the numbers of supports of these four super-patterns in a four-way scan than in a one-way scan. The principle of calculating the numbers of supports of all super-patterns P[M, N]a (a Σ) in a one-way scan is as follows. Accoring to gap constraints [M, N], we create all chil noes of noe q which is a noe in an incomplete Nettree of P. If s = Σ k (q+m+ q+n+, k< Σ ), we check whether noe is a noe in the incomplete Nettree of super-pattern P k or not. If it is not in an incomplete Nettree of P k, we create noe an store it in the incomplete Nettree an the NRP of noe is the NRP of noe q, otherwise we only upate the NRP of noe, making it plus the NRP of noe q. So the number of supports of super-pattern P k is the sum of the NRPs of all noes of the incomplete Nettree of P k. Hence, we can calculate the numbers of supports of these super-patterns in a one-way scan. The algorithm is name INSupport. Here we employ a structure name IINettree (Information of the Incomplete Nettree) to escribe the mine pattern string, the number of its supports an its incomplete Nettree in INSupport. For the sake of simplicity, we use the terms pattern, sup an INtree for short. So IINettree can be expresse as {pattern, sup, INtree}. Due to that the representation of the incomplete Nettree is relatively straightforwar, an incomplete Nettree is compose of a size an an array which contains names of all noes an their corresponing NRPs. Hence, IINettree can again be written as {pattern, sup, {size, (name, NRP ),, (name size-, NRP size- )}}. The inputs of INSupport are P, S an INtree which is an incomplete Nettree of P. The output of INSupport is superps which is an array of IINettree. The size of superps is Σ. For example, given a sequence S= TTCCTCCGCG an a prefix pattern P=T[,3]C (shown in Example 8). superps 2 can be expresse as {T[,3]C[,3]C, 5, {4, (3,2),(5,4),(6,6),(8,3)}} since C is the thir letter in in a DNA sequence. Algorithm INSupport is shown as follows. Algorithm : INSupport (P,S, INtree); Input: P, S, INtree Output: superps : superps.sup=; 2: superps.pattern= P[M, N]Σ; 3:for (i=; i< INtree ; i++) 4: olnoe= INtree i; 5: for (=olnoe.name+m+; olnoe.name +N+; ++) 6: if (s ==Σ k ) then 7: superps k.sup +=olnoe.nrp; 8: position=search(superps k..intree, ); 9: //The result will be - if is not in superps k.intree. : if (position==-) then : newnoe.name= ; 2: newnoe.nrp=olnoe.nrp; 3: superps k.intree size++ =newnoe; 4: else 5: superps k.intree position.nrp+= olnoe. NRP; 6: en if 7: en if 8: en for 9:en for 2: return superps; The principle of MAPB is given as follows. Firstly, each character in Σ is seen as a length- pattern P. We create Σ incomplete Nettrees for each pattern an calculate the numbers of their supports. Then we nee to etermine whether the support ratio of each pattern is not less than β = ρ *(n-(-)*(w+))/(n-(m-)*(w+)) or not, where w=(m+n)/2, an m are the length of the longest frequent patterns an the length of pattern P, respectively. If it is not less than the value, pattern P an its incomplete Nettree will be store in a queue. After this, prefix pattern P an its incomplete Nettree is equeue an we nee to check whether pattern P is a frequent pattern or not. Algorithm is use to calculate the number of supports of all super-patterns with pattern P an create Σ incomplete Nettrees of the super-patterns. Finally, we check whether each super-pattern is not less than β or not. If yes, we store it an its incomplete Nettree in the queue an then iterate this process till the queue is empty. Apparently, this metho is a kin of Apriori-like property to prune the number of caniate patterns an constructs the frequent pattern tree base on BFS. The algorithm name MAPB is given as follows. Algorithm 2. MAPB Input: S=s s s n-, M, N, ρ, an, where is the length of the longest frequent patterns Output: All patterns with frequency not less than ρ :patterns.pattern=σ; 2:for (i=; i< S ; i++) 3: if (s i ==Σ ) then 4: noe.name=i; 5: noe.nrp=; 6: patterns.intree size++ =noe; 7: en if 8:en for 9:for (=; < Σ ; ++) : patterns.sup= patterns.intree.size; : if (patterns.sup/ S >= ρ *(n-(-)*(w+))/n)) then meta. enqueue(patterns ); 2:en for 3:while (!meta.empty()) 4: subp =meta.equeue (); 5: P=subP.pattern; 6: INtree= subp.intree; 7: length= P ; 8: calculate r(p,s); 9: if (r(p,s)>= ρ ) then C length = C length U P; 2: superps = INSupport(P,S, INtree); 2: for (=; < Σ ; ++) 22: Q= superps. pattern; 23: calculate r (Q,S); 24: if(length+<=) then 25: if (r(q,s) >= ρ *(n-(-)*(w+))/ (n-length* (w+))) then meta.enqueue(superps ); 26: else 27: if (r(q,s) >= ρ ) then meta.enqueue(superps ); 28: en if 29: en for 3:en while 3:return C=U C i Although MAPB employs a pattern matching strategy (INSupport) to calculate the numbers of supports of caniate patterns, INSupport an the algorithm in [32] are significantly ifferent. The reasons are given as follows. Firstly, the algorithm in [32] is use to calculate the number of supports for

only one pattern, while INSupport can simultaneously calculate the numbers of supports for many patterns with a common prefix pattern. Seconly, the algorithm in [32] oes not nee any previous result to solve the problem, while INSupport calculates the numbers of supports epening on previous results. MAPB stores the frequent patterns in a queue. A new algorithm name MAPD stores the frequent patterns in a stack an the frequent pattern tree is constructe base on DFS. We can know that lines, 4, 25, an 27 of MAPD are then moifie as: : if (patterns.sup/ S >= ρ *(n-(-)*(w+))/n) then meta.push(patterns ); 4: subpattern=meta.pop(); 25: if (r(q,s) >= ρ *(n-(-)*(w+))/(n-length* (w+))) then meta.push(superps ); 27: if (r(q,s) >= ρ ) then meta.push(superps ); 4.3. A running example An illustrative example is use to show how Algorithm works. Example 8. Suppose sequence S=s s s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 = TTCCTCCGCG, patterns P = T[,3]C[,3]A, P = T[,3]C[,3]T, P 2 =T[,3]C[,3]C, P 3 =T[, 3]C[,3]G an the incomplete Nettree of P=T[,3]C which is shown in the black frame in Figure 5. R(noe) is use to express the NRP of a noe accoring to Definition. We create chil noes of n 2 2 from s 3 to s 6 since gap constraints are [, 3]. Because s 3 =C, P 2 =T[,3]C[,3]C an 3 noe n 3 is not in the incomplete Nettree of P 2, we create noe n 3 3 in the incomplete Nettree of P 2 an R(n 3 3 )=R(n 2 2 )=2. We 4 create noe n 3 in the incomplete Nettree of P an R(n 4 3 )=R(n 2 2 )=2. Similarly, we create noes n 5 3 an n 6 3 in the incomplete Nettree of P 2, R(n 5 3 )= R(n 6 3 )= R(n 2 2 )=2. Then we create chil noes of n 3 2 from s 4 to s 7. Because s 4 =T an noe n 4 3 is in the incomplete Nettree of P, we upate the value of R(n 4 3 )= R(n 4 3 )+ R(n 3 2 )=4. Similarly, we can know that R(n 5 3 )= R(n 5 3 )+ R(n 3 2 )=4, R(n 6 3 )= R(n 6 3 )+ R(n 3 2 )=4, an R(n 7 3 ) = R(n 3 2 )=2. Finally, we create all chil noes of noe n 5 2, n 6 2, an n 8 2. Figure 6 shows incomplete Nettrees of P, P, P, P 2, an P 3. It is straightforwar to know that the numbers of supports of P, P, P 2, an P 3 in S are, 4, 2+4+6+3=5, an 5+4=9, respectively. 2 n level 3 r level P P P P2 P3 2 2 2 C 5 2 3 5 6 8 A T 4 4 C 4 3 2 G 2 7 5 Figure 6. Incomplete Nettrees of P, P, P, P 2, an P 3 4 6 3 5 6 8 4.4. Correctness an completeness The main ifference between MAPB an MAPD is using ifferent strategies to create the frequent pattern tree. So we only prove the correctness an completeness of MAPB. The same proofs apply to the correctness an completeness of MAPD. Theorem 2. (Correctness of MAPB) The output of MAPB is the frequent patterns. Proof: We know that Algorithm is correct accoring to Lemma 2. So MAPB can calculate the numbers of supports of 9 4 super-patterns with prefix P correctly. Zhang et al. [2] gave the metho to calculate ofs(p,s) an prove the correctness of the metho. Therefore Theorem 2 is prove. Theorem 3. (Completeness of MAPB) MAPB can fin all frequent patterns whose lengths are less than l 2. Proof: We know that is less than or equal to l 2 an if the n ( )*( w + ) support ratio of pattern P is less than * ρ, n ( m )*( w + ) all its super-patterns which contain it can be thought infrequent accoring to Theorem, where w= (M+N)/2 an l 2 = ( n + N) /( N + ) an n, m,, M, an N are the length of sequence, the length of pattern, the length of the longest frequent patterns, the minimum gap, an the maximum gap, respectively. So MAPB enqueues pattern P if its support ratio is not less than n ( )*( w + ) n ( m )*( w + ) * ρ an checks whether its super-patterns are frequent or not. Therefore all frequent patterns can be foun by MAPB. Hence Theorem 3 is prove. Generally is far less than l 2 in our mining problem, so the algorithms can usually fin all frequent patterns in a sequence. 4.5. Complexities analysis Both MAPB an MAPD can create the same frequent pattern tree. So the time complexities of these two algorithms are the same. Algorithm creates the chil noes of the incomplete Nettree of a sub-pattern. It is easy to see that the average size of the incomplete Nettrees is n/ Σ, where n is the length of the sequence. Each noe has W=N-M+ chilren. So the time complexity of Algorithm is O(W*n/ Σ ). Algorithm runs = len times, where len an are the number of length- frequent patterns an the length of the longest frequent pattern, respectively. Hence the time complexities of MAPB an MAPD are both O( len = *W*n/ Σ ). However, the mining algorithm using GCS (MGCS) creates rows for pattern P an Σ rows for all super-patterns with prefix P to calculate the numbers of supports of these patterns. Each row has n elements an each element is calculate W times. So the time complexity of GCS is O((+ Σ )*n*w). Thus, the time complexity of MGCS is O( len = *(+ Σ )*n*w). The average size of an incomplete Nettree is n/ Σ an each noe stores its name an NRP. So the space complexity of an incomplete Nettree is O(n/ Σ ). MAPB uses a queue an the max size of the queue is O( Σ ^x), where x is the position of max len. So the space complexity of MAPB is O( Σ ^(x-)*n). MAPD uses a stack an the max size of the stack is O(* Σ ). So the space complexity of MAPD is O(*n). We can see that the space complexity of MGCS is O((+ Σ )*n). Table 2 gives a comparison of the time an space complexities among MGCS, MAPB, an MAPD. Table 2. Comparison of the time an space complexities Algorithm Time complexity Space complexity MGCS O( = len *(+ Σ )*n*w) O((+ Σ )*n) MAPB O( = MAPD O( = len len *W*n/ Σ ) *W*n/ Σ ) O( Σ ^(x-)*n) O(*n)

5. Top-K mining Both MAPB an MAPD are mining algorithms base on the pattern matching approach to iscover all possible frequent patterns. This kin of pattern mining approach can also be employe to fin special frequent patterns effectively. For example, when we iscover thousans of frequent patterns or more, it is ifficult to use all these patterns. The most valuable patterns are Top-K frequent patterns with various lengths an the longest frequent patterns, where K is a specifie parameter, an this is calle the Top-K mining problem. Like Apriori, algorithm MPP-best [2] acts iteratively, generating length-(m+) caniate patterns using length-m frequent patterns an verifying their supports, till there is no caniate pattern. So it is ifficult to employ this kin of approach to construct an algorithm to solve this problem. An easy metho is to propose an algorithm which fins all frequent patterns an sorts these patterns an then outputs the Top-K frequent patterns to satisfy users nees. Another similar metho is to introuce an algorithm that checks all possible caniate patterns an selects the Top-K frequent patterns. But these methos take a long time to solve the Top-K mining problem in long sequences using MAPD because many useless frequent patterns are foun or useless possible caniate patterns are checke. To iscover these patterns effectively, a heuristic algorithm name MAPBOK is propose. The principle of MAPBOK is as follows. If sub-pattern P is a Top-K length-m frequent pattern, its super-pattern Q is a Top-K length-(m+) frequent pattern in high probability. Firstly, we mine the Top-(e*K) length-m frequent patterns an output the Top-K frequent patterns, where e is not less than. Then the Top-(e*K) frequent patterns are use to mine the length-(m+) frequent super-patterns. We select the Top-(e*K) frequent super-patterns an iterate this process till no new frequent patterns can be foun. Apparently, this metho constructs the frequent pattern tree base on BFS. The algorithm of MAPBOK is not given, since the change is straightforwar from MAPB. It is easy to see that the time complexity of MAPBOK is O(e*K**W*n/ Σ ) since MAPBOK runs Algorithm O(e*K*) times. The space complexity of MAPBOK is O(e*K* n/ Σ ) since the max size of the queue is O(e*K). 6. Performance evaluation From Table in section 2, we know that this stuy focuses on the same pattern mining problem as literatures [2] an [25]. Besies, the most relate issue is literature [24] in which the problem is reefine an the Apriori property is use. Here we call the algorithm in [24] AMIN. Therefore we present experimental results by comparing MPP-best, MGCS, AMIN, MAPB, an MAPD. In orer to show that our algorithms are superior to the state-of-the-art algorithms, we evaluate their performance base on the running time an memory requirements. For the Top-K mining problem for each length, we pay more attention to the mining accuracy of longer frequent patterns. Weighte accuracy can be calculate accoring to the following equation. = ( i * a ) /( i * ) (4) i= 3 i b i= 3 i where a i, b i, an are the numbers of correct Top-K frequent patterns, Top-K length-i frequent patterns, an the length of the longest frequent patterns, respectively. Generally, b i is K. But when c is less than K, b i becomes c, where c is the number of length-i frequent patterns. 6.. Experimental environment an ata The ata use in this paper are DNA sequences provie by the National Center for Biotechnology Information website. Homo Sapiens AX82974, AL587 an AB3849 are chosen as our test ata an can be ownloae from http://www.ncbi. nlm.nih.gov/nuccore/ax82974, http://www.ncbi.nlm.nih.gov/ nuccore/al587. an http://www.ncbi.nlm.nih.gov/ nuccore/ AB3849, respectively. The source coes of MPP-best, MGCS, MAPB, MAPD, an MAPBOK can be obtaine from http://wuc.scse.hebut.eu.cn/msppwg/inex.html. All experiments were run on a laptop with Pentium(R) Dual-Core T45@ 2.3GHz CPU an 2. GB of RAM, Winows 7. Java Development Kit (JDK).6. is use to evelop all algorithms. In this stuy, the greatest length of frequent patterns is consiere to be 3, the minimum an maximum gap constraints are 9 an 2, respectively an the threshol ρ is 3*^-5 because all these parameters were use in [2, 24, 25]. What is more, consiering that the efault stack memory of JVM is too small, we assign.5gb memory space for every algorithm which is the maximal memory space of the Java virtual machine on the laptop. Table 3 shows all the sequences use in this paper. Table 3. Bio-ata sequences Sequence From S Homo Sapiens AX82974 S2 Homo Sapiens AX82974 2 S3 Homo Sapiens AX82974 4 S4 Homo Sapiens AX82974 8 S5 Homo Sapiens AX82974 S6 Homo Sapiens AL587 2 S7 Homo Sapiens AL587 4 S8 Homo Sapiens AL587 8 S9 Homo Sapiens AL587 675 S Homo Sapiens AB3849 5 S Homo Sapiens AB3849 3 S2 Homo Sapiens AB3849 6 S3 Homo Sapiens AB3849 3892 With the above presente test environment an ata, Table 4 an Table 5 show the mining results an the comparison of max size of MAPB an MAPD, respectively. Table 4. Mining results Sequence The length of the longest frequent patterns, the number of frequent patterns for various lengths an total frequent patterns S 3, {4,6,64,256,24,496,3374,5678,54,623, 242, 55, 2}, 26958 S2 2, {4,6,64,256,24,496,525,3436,35,85,8,3}, 24547 S3, {4,6,64,256,24,496,5965,937,59,3}, 23424 S4, {4,6,64,256,24,496,497,4283,24,}, 24955 S5, {4,6,64,256,24,496,4422,48,299,}, 24993 S6, {4,6,64,256,24,496,269,768,64,8}, 25769 S7, {4,6,64,256,24,496,2388,696,749,}, 25568 S8, {4,6,64,256,24,496,2947,633,666,},25387 S9, {4,6,64,256,24,496,3438,5767,64,}, 2527 S, {4,6,64,256,24,496,257,797,563,2}, 25729 S, {4,6,64,256,24,496,26,7634,64,54,}, 25439 S2, {4,6,64,256,24,496,2799,644,699,}, 25373 S3, {4,6,64,256,24,496,293,6558,672,},2564

Table 5. Comparison of max size Sequence Max size of MAPB Max size of MAPD S 492 24 S2 542 25 S3 67 25 S4 59 26 S5 4582 26 S6 2878 27 S7 2754 26 S8 / 27 S9 / 26 S 2768 28 S 666 26 S2 2949 28 S3 / 28 Note: Empty means overflow error. 6.2. Running time evaluation Here we compare the running time on several real DNA fragments of ifferent lengths (shown in Figures 7~9). It is worth noting that the running time of MPP-best refers to the time ata on the left an the other algorithms refer to the right because the MPP-best nees to take much time to finish the mining task. We can analyze the running results from the following aspects: Running time (s) 6. 2. 8. 4.. Homo Sapiens AX82974 2 4 8 MPP-best 47. 87.2 259.7 886. 38.8 MGCS 6.9.9 22.5 48.6 6.3 AMIN 5.3 9.9 9.2 4.8 53.5 MAPB.3 2.5 5..5 3.7 MAPD. 2. 4. 8.7.7 Figure 7. Comparison of the running time on Homo Sapiens AX82974 Running time (s) 4. 3. 2... Homo Sapiens AL587 2 4 8 675 MPP-best 684. 3545.4 MGCS 29.8 257.5 5. 54. AMIN.8 222. 439.7 99.8 MAPB 32.5 85.5 MAPD 25.5 53.6.3 246.7. 8. 6. 4. 2.. 2. Figure 8. Comparison of the running time on Sapiens AL587 9. 6. 3. Running time (s) 2. 5.. 5.. Homo Sapiens AB3849 5 3 6 3892 MPP-best 383.5 6959.9 MGCS 97.2 96.7 385. 853.5 AMIN 83.4 67.4 329.4 732. MAPB 22.8 53.5 25.9 MAPD 8.8 4.2 8. 98.. Figure 9. Comparison of the running time on Homo Sapiens AB3849 Note: Empty means overflow error. () From Figures 7~9, we can see that MPP-best not only is the slowest, but also cannot be use to fin frequent patterns in long sequences. Only MPP-best uses the left axis to show its running time an other algorithms use the right axis to show theirs running time. The value of the left axis is far greater than that of the right axis. So we can easily notice that MPP-best is the slowest. For example, MPP-best costs nearly hours to fin the frequent patterns in sequence with length 4 (shown in Figure 8), while MAPD costs no more than minute. Furthermore, we observe that the running time of MPP-best grows faster than the length of a sequence. For example, the ata table in Figure 7 shows that the running time of MPP-best grows about 29 times when the length of a sequence grows about times from S to S5. Moreover, accoring to the running time in the tables of Figures 7~9, we can see that the running time of MGCS, AMIN MAPB, an MAPD is in linear growth with the length of a sequence. For example, when the length of a sequence grows about times from S to S5, theirs running time grows about times. However, if we mine frequent patterns in long sequences using MPP-best, we will also face the risk of out of memory when the length of a sequence is longer than 6. The reason is that MPP-best employs a PIL (Partial Inex List) structure which consumes lots of memory to calculate the number of supports. (2) It is clear that MGCS an AMIN run faster than MPP-best accoring to Figures 7~9. In [25], MCPaS is about 4 times faster than MPP-best in S but MGCS is about 6.84 times faster than MPP-best in our experiment. The reason is that MCPaS employs not only GCS but also an effective pruning strategy. If both MPP-best an MCPaS employ the same pruning strategy, MCPaS cannot be about 4 times faster than MPP-best. AMIN is a little faster than MGCS accoring to the figures. The reason is that AMIN reefines the issue an can use Apriori property which is a more effective pruning strategy than Apriori-like property. AMIN is a kin of approximate algorithm an [24] etaile the ifference between the approximate solutions an the accurate solutions. Moreover, both MGCS an AMIN can be use to mine in long sequences. The reason is that MGCS an AMIN use pattern matching approaches to calculate the number of supports for each caniate pattern. An this kin of approach consumes less memory than MPP-best. (3) Both MAPB an MAPD run faster than MPP-best an MGCS although they employ the same pruning strategy. MAPD runs about 45 an 28 times faster than MPP-best in S an S5, respectively. An MAPD runs about 5 times faster than MGCS in all sequences. So oes MAPB. Hence these two 75. 5. 25..