Boundary-Based Time Series Sorting

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Boundary-Based Time Series Sorting"

Transcription

1 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 6, NO. 3, SEPTEMBER Boundary-Based Tme Seres Sortng Jun-Ku L, Yuan-Zhen Wang, and Ha-Bo L Abstract In many applcatons, t s desrable to sort the data. Most of prevous work on sortng are key based, however, there are no apparent keys for the tme-seres data and therefore the classc sortng algorthms may fal n sortng tme-seres data. We propose a novel technque, called TS-Sort, to sort tme-seres sequences n the massve set. The proposed method frst extracts the maxmum and mnmum boundares of the set, then calculates the dstance values between the sequences to the boundares, and fnally sorts the values to determne the relatve orders of sequences n the set. For mprovement, we propose a partton based verson of the algorthm, whch puts the sequences nto small groups, and sorts the groups to get the fnal sorted set. Extensve experments, both on synthetc and real datasets, show that our approach can be used to make the tme seres set n order, and there s a factor of up to 26.3% acceleratng for the mproved verson of the method. Index Terms Boundary comparson, sortng, tme seres. 1. Introducton Wth the growng popularty of tme-seres data, researches of processng tme-seres data have receved sgnfcant attentons n recent years [1]. In ths work, we consder an mportant yet nterestng problem,.e., f we have a set of tme seres, how we sort the sequences n the set? Applcatons of sortng tme seres nclude but are not lmted to the followng aspects: Support for top-k search. The most notable example s the support for top-k search n the smlarty search of tme seres,.e., searchng for the k sequences n the tme seres database that are most smlar to the gven query sequence. If we have found a set of canddate sequences, the k sequences wth the mnmal dstance to the query sequence wll be opted out. The method for ths stuaton s to Manuscrpt receved October 5, 2007; revsed March 18, Ths work was supported by the Project of Secure and Intellgent Data Integraton Platform of Chna under Grant No The authors are wth the College of Computer Scence and Technology, Huazhong Unversty of Scence and Technology, Wuhan, , Chna (e-mal: 163.com). sequentally scan the set, calculate the dstance, and fnd out the mnmal k sequences. In ths sense, f we have the canddate sequences sorted, the k sequences can be easly found. Moreover, we only need to do one sortng, and enjoy the faclty of multple tmes of search. Help to buld the user-frendly search engne. The requrements from the user may dffer n a varety scope, therefore, the search algorthms generally requre the settng of many parameters, and a great number of the resultng sequences wll be returned to the user f the parameters are napproprately specfed. However, these cases occur very frequently snce the user may not have the fully pror understandng of the data under beneath. Now that many sequences have returned, t may stll be dffcult for the user to fnd the preferable sequences. However, t wll be more user-frendly for the search engne f the utlty of order-by s provded to the user. Instead of lookng through from a great deal of unsorted sequences by pagng-down and pagng-up, the user can retreve the desrable sequences by sortng the results. As another example, the user may wsh to dsplay the results n a more human-readable form, and the tme seres sortng can be adopted to satsfy the requrement. Sortng has been studed extensvely n the area of nformaton retreval. A varety of classc sortng methods have been proposed [2], such as the QuckSort, HeapSort, ShellSort, BubbleSort, BucketSort, etc. However, t should note that sortng tme seres dffers from these approaches, snce the tradtonal sortng methods are key-based, where the comparsons of entres are performed on a few attrbutes of the records (usually only one key). Whle tme seres s a long sequence of real values, and there are no explct keys nherted n tme seres, thus the key-based methods may not be applcable to the tme seres sortng. To the best of our knowledge, the problem of sortng tme seres sequences has not been well addressed. Inspred by the smlarty search of tme seres [3]-[5], where the dstance between tme seres s used to measure the dssmlarty (dfference) of the sequences, we propose to adopt the dstance between tme seres for sortng. However, as the dstance between two sequences takes no effect on ther relatve orders n the set, how we sort the sequences wth dstance? Note that the set of sequences are enclosed by the set boundares, we can use the dstance to the boundary as the bass of sortng, and we call ths boundary comparson.

2 324 In ths work, we propose a partton based method to sort tme seres. We call our method TS-Sort (tme-seres sort). Overall, our contrbutons n ths work can be smply summarzed as follows: 1) We ntroduce the boundares of tme seres set, and compare the dstance of each sequence wth the boundary. 2) We propose a method, named TS-Sort, to quckly sort the set of tme seres. 3) We expermentally evaluate the TS-Sort method both on the synthetc and real datasets. The results of the experments valdate the utltes of TS-Sort. The rest of the paper s organzed as follows. Secton 2 provdes a background for our work. In Secton 3 we present the TS-Sort n detals, and show how a set of sequences can be effcently sorted. The expermental results and analyss on the results are gven n Secton 4. Fnally, we offer our conclusons and dscuss the future work n Secton Related Work and Defntons 2.1 Related Works The tradtonal sortng methods are key-based, and we have mentoned before that the key-based sortng methods may fal n sortng tme seres. Many works on dmensonalty reducton on tme seres mapped the whole long tme seres nto ponts n N dmensonal vector space, and treated several dmensons n the transformed space as the keys of the data, whch provded a way to represent the tme seres wth the feature vector. Among them are dscrete Fourer transform (DFT) [6], dscrete wavelet transform (DWT) [7], sngular value decomposton (SVD) [8], pecewse aggregate approxmaton (PAA) [9], and adaptve pecewse constant approxmaton (APCA) [10]. These methods have dfferent features n the tme seres processng, however, none of them wll produce the exact results. For tme seres comparson, dfferent dstance measures have been proposed [1], and Eucldean dstance (more general Lp norms) have been the most commonly adopted [4]. The dstance s calculated n a pont by pont manner and work well when the tme seres have the same unt of scale. However, when the lengths of two sequences are dfferent or when t s requred to match sequences that are locally out of phase, the dynamc tme warpng (DTW) s usually used nstead [3],[5],[11],[12]. In the feld of boundares of tme seres, Keogh et al. proposed the envelope for the tme seres sequences [4], where the sequence s enclosed n an Envelope of two sequences U and L, depctng the Upper and Lower boundares for the sequence. The work n [11] ntroduced a noton, named skylne boundng regon (SBR), whch used a regon called SBR to approxmate and represented a group of tme seres data accordng to ther collectve shape. JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 6, NO. 3, SEPTEMBER Problem Defntons We are now n the poston to gve a formal descrpton of the problem under consderng, the tme seres sortng. Intutvely, sortng can be formally depcted wth a tuple < S, >, where S s the set contanng all the elements, s the partal order relaton on the elements n S, whch defnes the relatve orders of elements n the sorted set. For any two entres O and O n S, f O O, the order of O s smaller than that of O. Defnton 1. Sortng tme seres. Gven a set of tme seres S, fnd a partal order relaton < S, > of the sequences n S, such that for TT, S, T T' or T' T. In ths work, we propose to calculate the dstance of the sequence to the set boundary. Defnton 2. Boundary of tme seres set. Gven a set of tme seres S = T1, T2,, Tc, where T = t 1, t2,, tn, for the mnmal boundary of tme seres Mn, Mnk < tk, and for the maxmum boundary of tme seres Max, Maxk > tk, ( = 1,2,, c; k = 1,2,, n). The Max boundary conssts of the data ponts that are maxmum n the set,.e., Maxk = max{ t1 k, t2k,, tck}. (1) Smlarly, the Mn boundary s formed by the data ponts that are mnmum n the set,.e., Mn = mn{ t, t,, t }. (2) k 1k 2k ck Wth the ntroducton of boundary of tme seres set, the defnton of sortng tme seres changes to Defnton 3 Defnton 3. Sortng tme seres wth boundary comparson. Gven a set of tme seres S, the boundary sequence B, the dstance measure d, fnd a partal order relaton < S, > of the sequences n S, such that for T, T S, dtb (, ) < dt (, B) T T. (3) 3. Proposed Methods 3.1 Drect Approach Snce there are two boundares for the tme seres set (.e., Max and Mn), the boundary comparson, therefore, can proceed wth two types: A. Comparng wth the Max Boundary The dstance between the sequence and the Max boundary of the set s calculated,.e. dtmax (, ) dt (, Max) T T. (4) Note that n (4), the larger the dstance from each sequence to the Max boundary s, the smaller the sequence order s. B. Comparng wth the Mn Boundary The dstance between the sequence and the Mn

3 LI et al.: Boundary-Based Tme Seres Sortng 325 boundary of the set s calculated,.e. dtmn (, ) dt (, Mn) T T (5) From (5), the larger the dstance from each sequence to the Mn boundary s, the greater the sequence order s. Table 1 outlnes the dea of tme seres sortng. After extractng the set boundares, each sequence n the set s compared wth the boundary, and sorted wth the dstance. Table 1: Outlne of tme seres sortng Algorthm: TS-Sort (Drect) Input: S: set of tme seres Output: The sorted set of tme seres Process: 1. Sorted _ S ;// ntalze 2. Calculate the set boundares wth (1) or (2); 3. for T S do 4. Calculate the dstance of T to Max or Mn; 5. Sort and get order, o T, of T wth (4) or (5); 6. Append( Sorted _ S, ot, T ); 7. end for; 8. return Sorted _ S. 3.2 Improved Approach Problems wth the Drect Approach. Two problems wth the drect approach arse as follows: 1) Only one boundary s consdered. Each tme only one set boundary Max or Mn s used n the calculaton. 2) The dstance values to the boundary are sorted n the whole set. Note that n the lne 5 of the algorthm n Table 1, the values wll be sorted wthn the whole set. However, the dstance comparsons can be dramatcally reduced f the sequences are parttoned nto small groups. The Method of Partton. For mprovement, we propose usng a partton method to reduce the comparsons of dstance. After calculatng the dstance between the sequences to the boundary, we use (6) to place the sequence nto the target sequence group: Kd( TMn, ) gt ( ) = (6) dmaxmn (, ) where K+1 s the number of groups, we wll dscuss the selectng of approprate value for K n later part. Smlarly, we can also use dtmax (, ) n (6), whch corresponds to the case of comparng wth the Max boundary n the drect approach. Proposton 1. For K > 0, T S, 0 g( T) K. (7) Proof. Wth (1) and (2), we have 0 dtmn (, ) dmaxmn (, ), thus and for K > 0, dtmn (, ) 0 1 dmaxmn (, ) K d( T, Mn) 0 K. dmaxmn (, ) Proposton 2. T, T S, g( T) < g( T') T T. Proof. K dtmn (, ) K dt (, Mn) gt ( ) < gt ( ) < d( Max, Mn) d( Max, Mn) dtmn (, ) < dt (, Mn) dtmn (, ) < dt (, Mn) T T. Lemma 1. Equaton (6) dvdes dstance space D nto K+1 groups, ( D0, D1,, D K ), where D = { d( T, Mn) g( T) = }, D0 < D1 < < D. Proof. Wth (7), we can get the K+1 groups. We need to prove that for, j(0 < j K), D < D. Suppose T D and T Dj. As < j,.e. g( T) < g( T ), wth Proposton 2, we get T T D < D. j K j, hence The mplcaton of Lemma 1 s exceedngly mportant. It provdes the theoretcal bass for partton based sortng on tme seres. Once one sequence s put nto the group,.e. the th(0 K) group, Lemma 1 ensures that the order of the sequence n the fnal set wll be greater than the order of those n groups of 0,1,, 1, and be smaller than the order of those n groups of + 1,, K. Wth ths n mnd, we present the mproved verson of TS-Sort n Table 2. Table 2: Improved tme seres sortng Algorthm: TS-Sort (mproved) Input: S: set of tme seres K: auxlary parameter for set partton Output: The sorted set of tme seres Process: 1. Sorted _ Sk ( k = 0,1,, K);// ntalze 2. Calculate the set boundares wth (1) or (2); 3 dbound d( Max, Mn) ; 4. for T S do Kd( T, Mn) 5. k ; dbound 6. Sort and get order, o T, of T wth (5); 7. Append( Sorted _ S, ot, T ); 8. end for; 9. Sorted _ S k{ Sorted _ Sk}; 10. return Sorted _ S. From the mproved verson algorthm, after extractng the set boundares Max and Mn, we calculate the dstance between boundares Max and Mn, d bound. Each sequence s dstrbuted to the approprate group wth (6), and then the dstance comparson, nstead of beng performed n the whole set, s confned only to the sequences n the same group. Tme Complexty. We now analyze the tme complexty of the TS-Sort algorthm n Table 2. Let c denote the

4 326 number of sequences n the dataset and n be the dmensonalty of the sequence. Frst, t takes O(cn) to compute the set boundares. The complexty of dstance calculaton (n lnes 3 and 5) depends on the dstance measure adopted by the algorthm. For the Eucldean dstance (or more general Lp dstance), the complexty s O(n), whle for the DTW dstance, the complexty s 2 On ( ). Durng the step of sortng dstance, we can use a prorty queue to keep trace of the sorted dstance values n the group, so t takes O(l og 2 c) to place a sequence nto the group. The fnal sorted result set s a sequental combnaton of all the group-wse sets, whch can be fnshed n O(1). Therefore, total tme complexty s: Ocn ( + clog 2 c), Eucldean( Lp) 2 Ocn ( + clog 2 c), DTW JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 6, NO. 3, SEPTEMBER Experment Study (8) In ths secton, we examne the methods presented n ths work wth a comprehensve set of experments. 4.1 Experment Study We conducted experments on the synthetc and real-lfe datasets. The datasets used are as follows: Artfcal dataset. The dataset s created usng a random tme seres generator that produces n tme seres as t = t 1 + ( 1) j Z, where Z ( j = 1,2, are j j j j ) ndependent, dentcally dstrbuted random varables taken n the range of [0, 1]. Each sequence s wth length of 100, and the base value s 30. To ensure there are T 10 fluctuatons n the set, the base value for each sequence, T ( 2,3, 0 =,c), ncreases by 0.05 as the sequence number grows. Synthetc Control Chart. The dataset s from whch contans 600 examples of synthetcally generated control charts and can be dvded nto sx classes: normal, cyclc, ncreasng trend, decreasng trend, upward shft and downward shft. For these experments, we used a personal computer and the system confguraton s lsted n Table 3. Table 3: Experment confguraton To demonstrate the results of TS-Sort on the datasets, we adopted a dstance graph to descrbe the dstance between the neghborng sequences n the sorted set. The poston n the dstance graph represents the dstance between the sequence T and sequence Bound,.e., dt (, Bound), where Bound s the sequence Mn or Max. Our experments were conducted on the datasets wth both Eucldean and DTW dstance. 4.2 Results of Tme Seres Sortng The Eucldean dstance graphs before and after sortng on the artfcal dataset are shown n Fg. 1 and Fg.1, respectvely. The Eucldean dstance graph before sortng s chaotc, and t s dffcult to extract the pattern n the sequences drectly. However, after we sortng the dataset wth TS-Sort, the sequences n the sorted set dsplay n an orderly mode. When the sequences are sorted wth the dstance to Mn sequence n ascendng order, the dstances to the Max sequence are approxmately sorted n descendng order, vce verse. (Compared wth Max) Confguratontem Item value Processor Intel Pentum Operatng System GNU/Lnux(core: ) RAM 256 megabytes Hard dsk Seagate, 40 ggabytes Programmng Language ANSI C Compler GNU gcc Fg. 1. Eucldean dstance graph of sortng artfcal dataset: before sortng and after sortng. Fg. 2 presents the results of sortng wth DTW dstance on the artfcal dataset. The smlar trend s observed from the results. Compared wth the unsorted data, the sorted sets dsplay more n order.

5 LI et al.: Boundary-Based Tme Seres Sortng 327 (Compared wth Max) Fg. 3. Eucldean dstance graph of sortng synthetc control chart dataset: before sortng and after sortng. (Compared wth Max) Fg. 2. DTW dstance graph of sortng artfcal dataset: before sortng and after sortng. The results of sortng synthetc control chart dataset wth Eucldean dstance and DTW dstance are shown n Fg. 3 and Fg. 4, respectvely. As we ntroduced before, tradtonally the sequences n synthetc control chart dataset were recognzed to be dvded nto sx groups by ther shapes, and the sx sectons n the dstance graphs before sortng reflect ths. However, t can seen n the dstance graphs after sortng that the dstance spaces of the sequences can be roughly categorzed nto three classes. Those sequences wth order rangng from 1 to 200 are n the class 1, and those rangng from 201 to 400 are n the class 2, the rest sequences are n the class 3. Ths ndcates that TS-Sort can be used to dscover the patterns nherted n the tme seres data that may not know n advance. Fg. 4. DTW dstance graph of sortng synthetc control chart dataset: before sortng and after sortng. 4.3 The Elapsed Tme for Sortng In ths part, we present the results on the tme durng sortng. As we repeated each experment for 50 tmes, the results reported here are the average of the elapsed tme n the experment wth the same parameters confgured. Fg.5 compares the elapsed tme of the Drect approach and the Improved approach on both Artfcal dataset and Synthetc Control Chart dataset. When sortng artfcal dataset, the number of group number K was set to 8, and the value of K was 6 n sortng synthetc control chart dataset. In each case, we performed four separated experments, the experments

6 328 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 6, NO. 3, SEPTEMBER 2008 labeled n number 1 and 2 used Eucldean dstance for the dstance measure, and the experments labeled n number 3 and 4 used DTW dstance. The experments 1 and 3 reported the results of sortng wth the Drect approach, and 2 and 4 reported the results of sortng wth the Improved approach. For each experment, we presented three types of tme,.e., tme for the boundary calculaton (t_bound), tme for sortng (t_sort) and tme for group dvson (t_group). Note that there was no group dvson n the Drect approach, so the t_group was 0.0 n experments 1 and 3. Fg. 5. Elapsed tme results. In Fg. 5, we can see that the Improved approach outperforms the Drect approach on both of the two dstance measures, generally above 26.3%. Ths s due to that sortng wth the Improved approach s performed n a relatve smaller group, whch reduces the overall number of comparsons. 5. Conclusons In ths work, we study the problem of sortng tme seres. The bg challenge on the problem s that tme-seres sequence has no external keys requred by the tradtonal key-based methods. We calculate the dstance to the set boundares, and sort the values to make the set n order, whch s the man dea of our drect approach. To make mprovement, we also propose the partton-based TS-Sort method. The extensve experments show that TS-Sort can be adopted as a useful tool for sortng tme seres set, and the performance gan s above 26.3% when usng the mproved verson of the method. For future work, we ntend to couple the method wth other mnng methods and explore the possblty for sortng tme seres n knowledge dscovery of tme-seres data. References [1] E. Keogh and S. Kasetty, On the need for tme seres data mnng benchmarks: a survey and emprcal demonstraton, n Proc. 8th ACM SIGKDD Internatonal Conf. on Knowledge Dscovery and Data Mnng, Edmonton, Canada, 2002, pp [2] H. C. Thomas, E. L. Charles, and L. R. Ronald, Introducton to Algorthms, 2nd verson, MIT Press: Massachusetts, [3] D. J. Berndt and J. Clfford, Fndng patterns n tme seres: a dynamc programmng approach, n Proc. of Advances n Knowledge Dscovery and Data Mnng, AAAI/MIT, Oregon, Portland, 1996, pp [4] E Keogh, T. Palpanas, and V. B. Zordan, Indexng large human-moton database, n Proc. of 30th Internatonal Conf. on Very Large Data Bases, Tortonto, Canada. 2004, pp [5] Y. Sakura, M. Yoshkawa, and F. Chrstsos, FTW: fast smlarty search under the tme warpng dstance, n Proc. of 24th ACM SIGMOD Internatonal Conf. on Prncples of Database Systems, Maryland, USA. 2005, pp [6] R. Agrawal, C. Faloutsos, and A. Swam, Effcent smlarty search n sequence databases, n Proc. of 4th Internatonal Conf. of Foundatons of Data Organzaton and Algorthms, Chcago, Illnos, USA. 1993, pp [7] C. Kn-Pong and W. F. Ada, Effcent tme seres matchng by wavelets, n Proc. of 15th IEEE Internatonal Conf. on Data Engneerng, Sydney, Australa. 1999, pp [8] F. Kom, H. Jagadsh, and C. Faloutsos, Effcently supportng ad hoc queres n large datasets of tme sequences, n Proc. of ACM SIGMOD Internatonal Conf. on Management of Data, Tuescon, AZ, USA, 1997, pp [9] E. Keogh and M. Pazzan, A smple dmensonalty reducton technque for fast smlarty search n large tme seres databases, n: Proc. of 4th Pacfc-Asa Conf. on Knowledge Dscovery and Data Mnng, Kyoto, Japan. 2000, pp [10] E. Keogh, K. Chakrabart, and M. Pazzan, Locally adaptve dmensonalty reducton for ndexng large tme seres databases, n: Proc. ACM SIGMOD Conference on Management of Data, Santa Barbra, USA, 2001, pp [11] E. Keogh, Exact ndexng of dynamc tme warpng, n Proc. of 28th Internatonal Conf. on Very Large Data Bases (VLDB), Hong Kong, Chna. 2002, pp [12] L. Quanzhong, V. L. Fernando, and B. Moon, Skylne ndex for tme seres data, IEEE Transactons on Knowledge and Data Engneerng, vol. 16, no. 6, pp , Jun-Ku L was born n Jngdezhen, Jangx, Chna, n He receved B.S. degree from Wuhan Unversty of Technology n 2003, now he s a Ph.D. canddate wth College of Computer Scence and Technology, Huazhong Unversty of Scence and Technology (HUST). Hs research nterests nclude data mnng, machne learnng. Yuan-Zhen Wang was born n Wuhan, Hube, Chna n She s now a professor wth College of Computer Scence and Technology, HUST. Her research nterests nclude modern database technology and data mnng. Ha-Bo L was born n Wuhan, Hube, Chna, n He s now a Ph.D. canddate n College of Computer Scence and Technology, HUST. Hs research nterests nclude data mnng, database theory.