A Similarity Measure Method for Symbolization Time Series

Research Journal of Appled Scences, Engneerng and Technology 5(5): 1726-1730, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scentfc Organzaton, 2013 Submtted: July 27, 2012 Accepted: September 03, 2012 Publshed: February 11, 2013 A Smlarty Measure Method for Symbolzaton Tme Seres Qang Nu and Zhgang L Department of Computer Scence and Technology, Chna Unversty of Mnng and Technology, Xuzhou 221116, Chna Abstract: Smlarty measure s the base tas of tme seres data mnng tass. LCSS measure method has obvous lmtatons n the two dfferent length tme seres selecton of a lnear functon. The ELCS measure method s proposed to alze the sequence, whch ntroducng the scale factor to lmt the search path of the smlarty matrx. Experment n herarchcal clusterng algorthm shows that the mproved measure maes up for the shortcomngs of LCSS, mproves the effcency and accuracy of clusterng and mproves tme complexty. Keywords: Herarchcal cluster, LCSS, smlarty measure, tme seres INTRODUCTION Tme seres has always been an mportant and nterestng research feld due to ts frequent appearance n dfferent applcatons. Tme seres smlarty measure that proposed by Agrawal et al. (1993) has become a hot research topc due to ts wde applcaton usages such as tme seres classfcaton, clusterng, abal fndngs on the bass of data mnng, Many methods have been developed for searchng tme seres measure method n large data sets and especally smlarty measure of tme seres s a very mportant tas n the process of data mnng. There s smlarty measure methods of tme seres, such as Faloutsos et al. (1994) proposed a fast subsequence matchng method based on the Eucldean dstance metrc, n whch the smlarty measure of the two tme seres s calculated as two ponts of the same dmenson and t sets a threshold to udge whether the result s smlar. Eucldean dstance requres two sequences of equal length and gnored the temporal characterstcs of tme seres, thus lmtng ts applcaton n tme seres smlarty measure. Chung et al. (2004) uses the weght method n the Eucldean dstance method and elmnates transform offset, but there are parameters set by manual nterventon. Berndt and Clford, (1994) ntroduce Dynamc Tme Warpng dstance (DTW) to the tme seres smlarty measure whch performed well n the local characterstcs comparaton of the two unequal length sequences, but the tme consumpton of the algorthm s too expensve. In addton, DTW algorthm can't found two tme seres peas between low pont and nflecton pont, such as the correspondng relatons between the feature ponts and the accuracy of the algorthm s low. Some researchers (Y et al., 1998; Km et al., 2001) mproved DTW by ntroducng the ndex technology, mang ts tme complexty reduced. An ndex-based approach for smlarty search supportng tme warpng n large sequence databases (Km et al., 2001) proposed the Segment-wse the Tme Warpng dstance (STW), mang the DTW tme complexty decreased greatly, but mang the smlarty measure accuracy reduced too. Latec et al. (2005) put forward a nd of mnmum varance matchng method to obtan the flexble smlarty matchng. In 1994, the Longest Common Subsequence (LCS) (Paterson and Danc, 1994) to the tme seres smlarty measure. Bollobas et al. (1997) put forward LCSS on the bass of LCS, mang a better smlarty measure of tme seres whch have ampltude translaton, tmelne stretchng and bendng deformaton. Some other researchers have proposed the slopebased, the model-based and the event-based smlarty measure. Ths research studes the smlarty measure problem of symbolc tme seres. Frstly, ths study ntroduces the defnton and the classcal smlarty measure. Then, we propose a new smlarty measure algorthm based on the LCSS algorthm: dfferent to the LCSS algorthm, the new algorthm avods the selecton of a lnear functon effectvely, mproves the accuracy of measurement and mproves tme effcency greatly compared to the DTW measure. Fnally, experments to verfy the proposed algorthm. LCS AND LCSS SIMILARITY MEASURE LCS measure: There are tme seres samples X, Y A, ther vector form s: X { x, x,..., x n, Y { y, y,..., y n, they satsfy the longest common subsequence of the followng condtons were X' { x, x,..., x and 1 2 l Correspondng Author: Qang Nu, Department of Computer Scence and Technology, Chna Unversty of Mnng and Technology, Xuzhou 221116, Chna 1726

Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 Y' { y, y,..., y 1 2 l, where l s the length of the Common subsequence, Smlarty between tme seres X and Y s defned as Sm( X, Y) 1. n If 1 l for each and f and 1 1 If 1 l for each and x x LCSS measure: LCS measure can avod the smlar ssues whch brought by the tme seres of short-term mutaton or ntermttent. However, the tme seres of ampltude translaton, tmelne stretchng and bendng deformaton can t get a good smlarty measure results. LCSS measure s desgned for the mprovement of the above problems. Let 0 be an nteger constant, 0 1 a real constant. And f L, L a lnear functon set. Gven two sequences X { x, x,..., x n and Y { y, y,..., y n, let X' { x, x,..., x and Y' { y, y,..., y be the longest 1 2 l 1 2 l subsequences n X and Y respectvely such that: For 1 l, and 1 1 For1 l, 1 l, y /(1 ) f( x ) y (1 ) the sequence wll undetect the canddate seres (Keogh and Pazzan, 2001). Thus, the LCSS algorthm tmelne stretchng support s very lmted. For the exstence problem of LCSS measure, ths study presents an Extended Longest Common Subsequence (ELCS) measure: Let 1and 0 be a real constant, Gven two sequences X { x, x,..., x n and Y { y, y,..., y n, The alzaton that all sequence s located n between value [0,1], Get X { x', x',..., x' al m Y { y', y',..., y' al m,let X' { x', x',..., x' al l and Y ' { y', ',..., ' y 1 y 2 al l be the longest subsequences n X and Y respectvely such that: For1 l, 1 and 1 For1 l, 1 and 1 m n For1 l, x' y' l l Let S ( X, Y ) 2l, al al m n Let S ( X, Y) l f,, n. Then smlarty between the tme seres s defned as formula (1): EXTENDED LONGEST COMMON SUBSEQUENCE MEASURE (ELCS) (1) Although the LCSS measure has some advantages, there are stll the followng ssues:, max, Sm X Y S X Y f L f,, Then the smlarty between the tme seres defned as the formula (2): Sm X, Y max{ S X, Y LCSS measure derved from a soluton set, for dfferent tme seres data set, the selecton of lnear functon f wll dfferent. In other words, only through the tranng data set for the correspondng lnear functon n advance, to further more accurate measure of the smlarty of the sequence. Tranng and test set s always dfferent, so the result s less X mn than deal. x ò 1, m X The LCSS can be appled wth two dfferent length max x mn x ò 1, m ò 1, m (3) sequence comparson, but because of, length dfference of tme seres X { x, x,..., x Whch avod the lnear functon f selecton n of dffcultes, at the same tme retaned the sequence of and Y { y, y,..., y n, that s mn. Otherwse, numercal trend nformaton. 1727, (2) Defned above, parameter maes the search path of the smlarty measure matrx concentrated n a damond area, not only to prevent the sequence of over match, whle reducng the tme complexty. And the selecton of the search path area s related to each sequence length closely, not only appear undetected sequence, but also well adapted tmelne stretchng and deformaton of the sequence match. Parameter θ n the defnton maes the smlarty measurement algorthm, after alzaton, get further flexblty to match the space. Sequence alzed processng as the formula (3):

EXPERIMENT Smlarty measure s other data mnng process foundaton, the measure veracty drectly affect other process treatment results. Instead, we can use the clusterng results to estmate the accuracy of the dfferent smlarty measure. Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 Expermental envronment and the data: The expermental envronment s 2.20 GHz E4500CPU, memory for the 1024M and Wndow XP Professonal system. The expermental data sets use Synthetc Control Chart Tme Seres (SCC) n the UCI of KDD Archve and CBF dataset. The number of expermental data n the SCC s 600, every tme the sequence's length s 60, dvded nto sx categores. The CBF dataset contans Cylnder (C), Bell (B), Funnel (F), t s typcal of synthetc data sets. Fg. 1: Successful classfcaton rate Experment process: In cluster analyss, tme seres of the same group resemble each other, dfferent sets of tme seres are not smlar. Ths study uses the bottomup herarchcal clusterng. Set the ntal data for the C, C,..., C n, the algorthm steps are: Step 1: Each tme seres as a class C Step 2: Calculate the smlarty between any two categores, get a smlarty matrx Step 3: Merge the two categores whch are smlar, then go to Step 2 loop, untl the class number s equal to the predetermned number of clusters The dstance between the clusters uses ELCS smlarty measure computaton. The results of the clusterng are standard,,..., and the clusterng results of each measure C C C C are C ' C ', C ',..., C ', the clusterng accuracy s computed by the followng formula (4) and (5): C C' SmC, C ' 2 C C ' max Sm( C, ' ), ' C Sm C C (4) (5) The calculaton of Sm( C', C ) and same. Because Sm( C', C ) and Sm( C', C) Sm( C, C' ) so s used as a fnal evaluaton 2 crtera. Sm( C, C ') s Sm( C, C ') s asymmetry, 1728 Fg. 2: Average nternal class dstance EXPERIMENTAL RESULTS AND ANALYSIS Parameter determnaton: The experment usng the SCC dataset s to analyses the nfluence of the algorthm. The ELCS measure contans the parameters and θ, the n the performance of the algorthm s very sgnfcant. Wth the changes of the parameter, the clusterng accuracy rate s showed n Fg. 1, the clusterng average nternal class dstance and average among class dstance are shown n Fg. 2 and 3. Wth the ncreases, the clusterng accuracy rate s changed from low to hgh. When 2.2, clusterng accuracy rate s the hghest, the average nternal class dstance s the smallest; the average among class dstance s largest. Ths result means each one of ELCS measure n the sequence satsfes the length. Whle m n s too large, not well qualfed the poston of the test sequence corresponds to the nformaton, get meanngless smlar sequence segments; Whle s too small, the search range of the smlarty matrx s

Res. J. Appl. Sc. Eng. Technol., 5(5) : 1726-1730, 2013 Fg. 3: Average among class dstance Fg. 6: Average nternal class dstance for the three dstance Fg. 4: Tme-consumng comparsons of three dstance Fg. 7: Average among class dstance for the three dstance ELCS three nds of dstance tme consumng, set them as the smlarty metrc of herarchcal clusterng. LCSS and the ELCS algorthm s selected the approprate parameters, mang the fnal classfcaton accuracy s ther best. The results are shown n Fg. 4. DTW algorthm consumng sgnfcantly hgher than other, as ELCS measure s condton 1 a and 1 m n s complex than LCSSS measure s, so spend more tme. Fgure 5 s a comparson of DTW, LCSS and the ELCS measure of clusterng accuracy rate. Each measure for the SCC data set has good results, because of the obvous characterstcs of SCCC dataset of data and Fg. 5: Clusterng accuracy rate for the threee dstance the data has a lttle nose. CBF dataset s a randomly generated dataset; each tme seres has a lot of gltches very lmted, a lot of data s dscarded to be mssed. that ncrease the dffculty of the clusterng. But no Wth the decrease of, the classfcaton accuracy matter to whch dataset, ELCS have shown good results, dropped sharply. that s the correct rate of clusterng s the hghest. The dataset dfferences above-mentoned, can be Three nds of measure-based clusterng seen n Fg. 6 and 7 easly. Clusterng results of the CBF comparson: To comparson of DTW, LCSS and the dataset average dstance nternal class s greater than the 1729

Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 SCC dataset, whle the average among class dstance s smaller. Due to LCSS and ELCS are based on LCS algorthm, do not exst DTW algorthm pont corresponds to a mult-pont problems, local nose can be gnored. CONCLUSION Based on the LCS measure, by ntroducng parameters whch standardzes smlarty matrx search path, ths study mproves the accuracy of the smlarty measure and overcomes the tradtonal smlarty measure based on Eucldean dstance whch lac of dealng wth nose nterference. By the experment on two dfferent types of data sets, ELCS measure gets hgher clusterng correctness than the exstng smlarty, but the tme expense s hgher. In short, the measure can be appled effectvely to a varety of tme seres smlarty measure. ACKNOWLEDGMENT Ths study was supported by Doctoral Program Foundaton of Mnstry of Educaton of Chna (20100095110003) and Fundamental Research Funds for the Central Unverstes (2011QNB23). REFERENCES Agrawal, R., C. Faloustos and A. Swam, 1993. Effcent smlarty search n sequence database [c]. Proceedngs of 4th Internatonal Conference on Foundatons of Data Organzaton and Algorthms. Sprnger, Berln, pp: 69-84. Berndt, D. and J. Clfford, 1994. Usng Dynamc Tme Warpng to Fnd Patterns n Tme Seres. AAAI-94 Worshop on Knowledge Dscovery n Databases, AAAI Press, Seattle, Washngton. Bollobas, B., G. Das, D. Gunopulos and H. Mannla, 1997. Tme-seres smlarty problems and wellseparated geometrc sets [A]. Proceedngs of the 13th Annual Symposum on Computatonal Geometry [C]. ACM Press, New Yor, pp: 454-456. Chung, L., T.C. Fu and R. Lu, 2004. An evolutonary approach to pattern-based tme seres segmentaton. IEEE T. Evolut. Comput., 8(5): 471-489. Faloutsos, C., M. Ranganathan and Y. Manolopoulos, 1994. Fast subsequence matchng n tme-seres databases [J]. SIGMOD Rec., 23(2): 417-429. Keogh, E.J. and M.J. Pazzan, 2001. Dervatve Dynamc Tme Warpng [DB/OL]. Retreved from: http://cteseerx.st.psu.edu/vewdoc/download?do= 10.1.1.23.3383&rep=rep1&type=pdf. Km, S.W., S. Par and W. Chu, 2001. An ndex-based approach for smlarty search supportng tme warpng n large sequence databases [A]. Proceedngs of the Internatonal Conference on Data Engneerng [C]. IEEE Computer Socety, Hedelberg, pp: 207-614. Latec L.J., V. Megalooonomou, Q. Wang, R. Laaemper, C.A. Ratanamahatana, et al., 2005. Partal elastc matchng of tme seres [A]. 5th IEEE Internatonal Conference on Data Mnng [C]. Nov. 27-30, Phladelpha. Paterson, M. and V. Danc, 1994. Longest common subsequences [J]. Lect. Notes Compu. Sc., 841: 127-142. Y, B.K., H.V. Jagadsh and C. Faloutsos, 1998. Effcent retreval of smlar tme sequences under tme warpng [A]. Proceedngs of the Internatonal Conference on Data Engneerng [C], IEEE Computer Socety, Orlando, pp: 201-208. 1730