A Multiresolution Symbolic Representation of Time Series

Size: px

Start display at page:

Download "A Multiresolution Symbolic Representation of Time Series"

Owen Weaver
6 years ago
Views:

1 A Multresoluton Symbolc Representaton of Tme Seres Vasleos Megalookonomou 1 Qang Wang 1 Guo L 1 Chrstos Faloutsos 2 1 Department of Computer & Informaton Scences 2 Department of Computer Scence Temple Unversty Carnege Mellon Unversty Phladelpha, PA, USA Pttsburgh, PA, USA {vasls,qwang,gl}@temple.edu chrstos@cs.cmu.edu Abstract Effcently and accurately searchng for smlartes among tme seres and dscoverng nterestng patterns s an mportant and non-trval problem. In ths paper, we ntroduce a new representaton of tme seres, the Multresoluton Vector Quantzed (MVQ) approxmaton, along wth a new dstance functon. The novelty of MVQ s that t keeps both local and global nformaton about the orgnal tme seres n a herarchcal mechansm, processng the orgnal tme seres at multple resolutons. Moreover, the proposed representaton s symbolc employng key subsequences and potentally allows the applcaton of text-based retreval technques nto the smlarty analyss of tme seres. The proposed method s fast and scales lnearly wth the sze of database and the dmensonalty. Contrary to the vast majorty n the lterature that uses the Eucldean dstance, MVQ uses a mult-resoluton/herarchcal dstance functon. We performed experments wth real and synthetc data. The proposed dstance functon consstently outperforms all the major compettors (Eucldean, Dynamc Tme Warpng, Pecewse Aggregate Approxmaton) achevng up to 20% better precson/recall and clusterng accuracy on the tested datasets. 1. Introducton The problem of effcent retreval of smlar tme seres has receved a lot of attenton due to ts many applcatons n dfferent domans. Brefly, ths problem can be stated as follows: Gven a query sequence q, a database S of N sequences, S 1,S 2,,S N, a dstance measure D and a tolerance threshold ε, fnd the set of sequences R n S that are wthn dstance ε from q. More precsely, fnd: R = {S S D(q, S ) ε }. To compare two gven tme seres, a sutable measure of smlarty should be gven. Nave approaches for comparng tme sequences generally take polynomal tme n the length of the sequences, typcally lnear or quadratc tme. These approaches are not useful for large tme seres databases. Promsng technques nclude those that are based on the reducton of dmensonalty of the orgnal sequences. In ths case, the sequences can be represented as multdmensonal vectors and smlar sequences can be retreved n sublnear tme. There may be several dfferent crtera to evaluate a method, but generally speakng, a good one should be fast, scalable, and accurate (accordng to some ground truth). In ths paper, we ntroduce a new method that satsfes these requrements. Our method s called Multresoluton Vector Quantzed (MVQ) approxmaton and has the followng characterstcs: 1) It uses tme-tested vector quantzaton methods to dscover a vocabulary of subsequences; 2) It takes multple resolutons nto account ths brngs mproved accuracy; 3) It provdes a new dstance functon utlzng textbased technques from Informaton Retreval, to wegh down unnterestng matches, thus mprovng the accuracy. As Agrawal et al. [2] proposed, compared to the Eucldean dstance, a more ntutve dea s that two seres should be consdered smlar f they have enough nonoverlappng tme-ordered pars of subsequences that are smlar. In ths paper, nstead of calculatng the Eucldean dstance, we frst extract key subsequences utlzng the Vector Quantzaton (VQ) [12] technque and encode each tme seres based on the frequency of appearance of each key subsequence. We then calculate smlartes between dfferent tme seres n terms of key subsequence matches. Ths method can be very meanngful n many domans, for example, when comparng two stocks durng a long perod, we may want to fnd out durng how many months the stocks have smlar movements, though the same trend may appear n dfferent months for dfferent stocks. Ths applcaton s smlar to mnng motfs n massve tme seres databases [22]. Whle the hstogram metrc can record the local nformaton very well, t may lose much global nformaton of the tme seres, snce t does not keep track of the order of appearance of dfferent key subsequences. To deal wth ths problem, we propose to apply a herarchcal mechansm: the orgnal tme seres are processed at several dfferent resolutons, and smlarty analyss s performed usng a weghted dstance functon combnng all the resoluton levels. For example, when consderng a tme seres representng a stock prce

2 movement, we know that subsequences of dfferent length have dfferent real meanngs. If the length s 5, the subsequence stands for a weekly trend of the stock. Smlarly for length 20 we have the monthly trend. As we demonstrate n the experments, MVQ outperforms prevous state of the art methods n clusterng and smlarty searches. Intutvely, the excellent performance of the proposed method can be justfed because of the followng facts: 1) t explots pror knowledge about the data usng a learnng approach 2) t takes multple resolutons nto account and 3) unlke wavelets (that also take multple resolutons nto account) t partally gnores the orderng of the codewords wthn the tme sequence due to the hstogram model that s beng used to calculate smlarty. Moreover, the proposed representaton s symbolc employng key subsequences and allows the applcaton of text-based retreval technques nto the smlarty analyss of tme seres. 2. Background 2.1 Related Work Many approaches and technques have been proposed n the past decade [1, 2, 4, 9, 10, 13, 14, 16, 18, 19, 21, 27, 31, 32] that address the problem of smlarty n tme seres. To deal wth dmensonalty reducton, the soluton to extract a sgnature from each sequence and to ndex the sgnature space was orgnally proposed by Faloutsos et al. [9,10]. To guarantee completeness (.e., no false dsmssals) the admssblty crteron that the dstance functon used n the sgnature space must underestmate the true dstance measure (boundng lemma) was also proposed [10]. Obeyng the admssblty crteron, many methods have been suggested and proved useful n dfferent felds, such as the F-ndex ntroduced by Agrawal et al. [1] or the ST-ndex proposed by Faloutsos et al. [10]. Other approaches for effcent smlarty searches on tme sequences are based on pecewse constant approxmaton (PCA) or pecewse aggregate approxmaton (PAA). Y and Faloutsos [32] and Keogh et al. [19,21] proposed to dvde each sequence nto k segments of equal length and to use the average value of each segment as a coordnate of a k-dmensonal feature vector. The advantages of ths transform are that t s very fast and easy to mplement, the sgnature can be used wth arbtrary L p norms, and the ndex can be buld n lnear tme. In addton, the representaton can be used wth a weghted Eucldean dstance where each segment of the sequence has dfferent weght. Keogh et al. [18] have also proposed an Adaptve Pecewse Constant Approxmaton (APCA) where the segments can be of varable length offerng a more effectve compresson than PCA. In [26] the authors propose a pecewse vector quantzed approxmaton (PVQA) of tme seres. In [7] a technque for compressng multple streams of data n sensor networks that employs an approxmate representaton usng a base sgnal extracted from hstorc nformaton has been proposed. The algorthm constructs a dctonary of canddate base sgnals n the process of buldng a base sgnal. The use of mult-scale hstograms and a weghted Eucldean dstance for measurng the smlarty of tme seres at several precson levels has been nvestgated n [6]. In addton, general dmensonalty reducton technques such as Sngular Value Decomposton (SVD) have been used n tme seres data [19]. For these methods n whch the dstance metrc lower bounds the Eucldean dstance, one of the most sgnfcant characterstcs s the avodance of false dsmssals, though there may be a lot of false alarms. However, n some cases, the exstence of too many false alarms may decrease the effcency of retreval. At the same tme, as many researchers have mentoned n ther work [15,29], the Eucldean dstance s not always the optmal dstance measure. For example, n some tme seres, dfferent parts have dfferent levels of sgnfcance n ther meanng. Also, the Eucldean dstance does not allow shftng n tme axs, whch s not unusual n real lfe applcatons. In order to extract hgh-level features out of tme seres, Koudas et al. [28] formalzed problems of dentfyng varous representatve trends n tme seres data. Snce the Eucldean s not the best dstance one can use (as shown later n our paper and n papers we referenced earler), here, we propose a new dstance functon. We do not deal wth the problem of lower boundng the Eucldean on the orgnal vectors snce ths s not so meanngful anymore. 2.2 Prelmnares To make the presentaton of the proposed work clear, we now gve descrptons of varous concepts and defntons used n the paper. We start wth the defnton of a tme sequence and ts subsequences. Defnton 1. Tme Sequence: A sequence (ordered collecton) of real values. X = x 1, x 2,, x n, where n can be very large. Defnton 2. Subsequence: Gven a tme sequence X = x 1, x 2,, x n, of length n, a subsequence S of X s a sequence of length m consstng of contguous postons from X,.e., S=x k,x k+1,,x k+m-1 ; 1 k n-m+1. In smlarty analyss, we need to defne a metrc for the smlarty, that s, a measure of the dstance between two tme seres. Gven two tme seres, X = x 1, x 2,, x n, Y = y 1, y 2,, y n, ther dstance, D, s defned, n general,

3 as an L p norm, where for p=2, the dstance s the Eucldean, the most popular among the metrcs. An ntutve noton of exact and approxmate smlarty was also formalzed by Goldn, and Kanellaks [8]. Obvously, the smplest way of calculatng the smlarty (or dstance) among tme seres s to compute the Eucldean dstance drectly,.e., on the orgnal seres. For a small dataset ths may be feasble, however, for large data sets effcency s a problem, snce the tme complexty s O(N*n), where n s the number of features that need to be represented for each tme seres and N s the number of tme seres n the dataset. In order to compute effcently whle keepng the accuracy not sgnfcantly affected, many technques of dmensonalty reducton (as ntroduced n secton 2.1) have been suggested. In addton to the computatonal complexty assocated wth the Eucldean dstance calculaton on the orgnal tme seres, we cannot always be sure that the nearest neghbors n Eucldean space are ndeed the most smlar ones. Ths s because the pont-based nformaton model (computng smlarty based on every pont) contans only low-level features of the tme seres and t s vulnerable to dfferent knds of shape transformatons, such as shftng and scalng. Under such crcumstances, t would be better f we could fnd some hgh-level features and apply a more robust nformaton retreval model for tme seres analyss. Based on ths dea, we ntroduce a new framework that uses key subsequences to represent tme seres and facltate smlarty retreval. Ths framework conssts of the followng man components: 1) Codebook generaton from a set of tranng samples; 2) Tme seres encodng usng the codebook; 3) Tme seres feature representaton and retreval. Ths framework s smlar to the key block framework suggested by Zhang et al. [33] for content-based mage retreval. In the tme sequences doman the dea was ntroduced n [26]. However, n order to keep both local and global nformaton and mprove the accuracy, we ntroduce the use of multple codebooks wth dfferent resolutons. For each resoluton, Vector Quantzaton [12, 24] s appled to dscover the vocabulary of subsequences n a tme seres database. In VQ a codeword (or codevector) s used to represent a number of smlar vectors. More precsely, a vector quantzer Q of dmenson n and sze s s a mappng: Q: R n C from a vector or a pont n n-dmensonal Eucldean space, R n, to a fnte set C={c 1, c 2,,c s }, the codebook, contanng s output or reproducton ponts c R n, called codewords. Assocated wth every s-pont VQ s a partton of R n nto s regons or cells R for J {1,2,,s} where R ={x R n : Q(x)=c }. For a gven Fgure 1. The Generalzed Lloyd Algorthm (GLA). dstorton functon 1 d(x,c ) (such as the mean squared error (MSE)) between an nput vector x and a codeword c, an optmal mappng should satsfy two condtons: (a) Nearest neghbor Condton (NNC): For a gven codebook, the optmal partton R = {R : =1,2,,s} satsfes: R = { x : d( x, c ) d( x, c ); j} where c s the codeword representng partton R. Gven a pont x n the dataset, the encodng functon for x, Encodng(x)=c only f d(x,c ) d(x,c j ) j. (b) Centrod condton (CC): For a gven partton regon {R : =1,,s} the optmal reconstructon vector (codeword) satsfes: c =centrod(r ) where the centrod of a set R={x : =1, R } s defned as: R 1 centrod( R) = x R The Generalzed Lloyd Algorthm (GLA) [24, 25] s an teratve procedure that produces a locally optmal codebook from a tranng set based on these two condtons (that form the Lloyd teraton). Ths s done durng a tranng phase. The man structure of GLA s gven n the flowchart (see Fgure 1). Startng wth an ntal codebook, the GLA algorthm repeats the Lloyd teraton untl the fractonal drop of the dstorton becomes less than a gven threshold. Ths process s guaranteed to converge snce from the necessary condtons for optmalty each applcaton of the Lloyd teraton must reduce or leave unchanged the average dstorton [12]. To quanttatvely measure the smlarty between dfferent tme seres encoded wth a VQ codebook, we employ the Hstogram Model (HM) that has been successfully appled n mage retreval [33]. We present ths model n the context of tme seres analyss: 1 S HM ( q, t) = (1) 1+ ds( q, t) where = 1 s f, t f, q ds( q, t) =. = 1 1+ f, t + f, q 1 The dstorton s a measure of overall qualty degradaton due to approxmaton of a vector by ts closest representatve from a codebook. j

4 In the formula, f,t and f,q refer to the appearance frequency of codeword c n tme seres t and q, respectvely. Although ths model focuses on the appearance of ndvdual key subsequences n tme seres, correlaton between key subsequences can also be addressed [33]. Informaton about some alternatve models can be found n Appendx A. 3. Proposed Method: MVQ We propose a new method to represent tme seres data, the Multresoluton Vector Quantzed (MVQ) approxmaton, along wth a new dstance functon. The method parttons each tme seres nto equ-length segments and represents each segment wth the most smlar key subsequence from a codebook. The codebook s generated earler durng a tranng phase usng VQ. By countng the appearance frequency of each codeword n each tme seres a new representaton s obtaned. The pecewse approxmaton wth VQ encodng s appled at several resolutons. Table 1 gves a bref descrpton of the notaton we use n the rest of the paper. In the followng subsectons, we ntroduce the components of our method. Table 1. Symbol Table X Orgnal tme seres, X= x 1,x 2,,x n of length n X Encoded form of the orgnal tme seres X = f 1,f 2,,f s N Number of tme seres n the dataset n Length of orgnal tme seres C Codebook: a set of codewords {c 1,,c k,, c s } c Number of resoluton levels s Sze of codebook l Length of codeword 3.1 Codebook Generaton For a gven dataset, a codebook wth s codewords C={c 1,c 2,, c s } s frst generated usng a clusterng algorthm (such as the GLA ntroduced n Secton 2). We apply ths algorthm to generate the codebook based on the dataset T of tme seres. The dataset s preprocessed before the generaton of the codebook; each tme seres n T s parttoned nto a number of segments each of length l and each segment forms a sample of the tranng set that s used to generate the codebook. Each codeword n the codebook corresponds to a key subsequence; t s an approxmaton for a certan group of subsequences of length l. All the tme seres n the database are then encoded usng the codebook (see Secton 3.2). The verson of GLA we use, requres a partton splt mechansm to solve the ntal codebook generaton problem. The algorthm starts wth a codebook contanng only one codeword, the centrod of the whole tranng set. In each repetton and before the applcaton of the Lloyd Table 2. Codewords of a 2-level codebook that are used to represent SYNDATA n MVQ approxmaton. teraton, t doubles the number of codewords (and cells) from the prevous teraton by splttng the most populous cells. Table 2 shows some of the codewords (at two dfferent levels) used by MVQ to represent the Control Chart dataset (SYNDATA) [30]. 3.2 Tme Seres Encodng After a codebook s generated, we can form a new representaton for each tme seres n the dataset. In the process of encodng, every seres s decomposed nto segments (.e., subsequences) of length l (whch s equal to the length of each codeword). For each segment, the closest (based on a dstance metrc) codeword n the codebook s then found and the correspondng ndex s used to represent ths segment. After fndng the correspondng codeword ndex for each segment, the appearance frequency of each codeword s counted. The new representaton of a tme seres s a vector X =f 1,f 2,,f s showng the appearance frequency of every codeword. By applyng ths new encodng form, we can easly deal wth tme seres wth arbtrary large number of ponts, snce we can always reduce ther dmensonalty to a rather small number gven by the sze, s, of the codebook. 3.3 Tme Seres Summarzaton Besdes achevng dmensonalty reducton, ths encodng process also provdes a very nce summarzaton of the tme seres, whch s useful n many applcatons. Table 2 shows dfferent codewords we obtan usng ths method; these codewords stand for the most representatve subsequences (of a gven length) for the entre tme seres dataset. Instead of the whole tme seres, we may be more nterested n the usage of representatve key subsequences. Ths s very useful n the dscovery of motfs or approxmately repeated subsequences n tme seres [22]. In ths case, we can just check the appearance frequences of these codewords and get an overvew of the tme seres. For example, n Fgure 2, we show a tme seres representaton usng a number of codewords. Two of these codewords are beng used twce revealng a pattern that would reman undetected usng prevous

Fgure 2. A tme seres (bottom) s beng represented as a sequence of representatve subsequences.e., codewords (top). Two codewords (#3 and #5) are beng used twce n ths representaton. technques.

4 Dstance measure and a multresoluton representaton Based on the frequency of appearance of key sequences wthn tme seres, features of tme seres are extracted formng a new representaton of a rather

5 Fgure 2. A tme seres (bottom) s beng represented as a sequence of representatve subsequences.e., codewords (top). Two codewords (#3 and #5) are beng used twce n ths representaton. technques. Results on tme seres summarzaton are presented n Secton Dstance measure and a multresoluton representaton Based on the frequency of appearance of key sequences wthn tme seres, features of tme seres are extracted formng a new representaton of a rather small dmensonalty and smlarty retreval can be effcently performed. We stll need a dstance measure approprate for ths new representaton. We choose the Hstogram Model as the dstance measure, and all the expermental results presented n Secton 4 are based on t. By applyng the hstogram model, t s not dffcult to dentfy the tme seres that are smlar to a gven query (.e., that have smlar frequent patterns). However, usng only one codebook (analyss at a sngle resoluton), ntroduces some problems that cannot be gnored. Frst, although the local nformaton of a tme seres s kept after the encodng process, the new representaton of a tme seres s not recordng the order among the ndces of dfferent codewords. Some mportant global nformaton of the tme seres s lost n ths representaton. In Fgure 3, we see two dfferent tme seres whose encoded representatons are the same (2, 1). Ths problem n the key subsequence representaton correspondngly ncreases the number of false alarms reducng the performance of the sngle resoluton (.e., sngle codebook) method. On the other hand, n real applcatons, t s not always easy to fnd a sutable resoluton (correspondngly, a sutable codeword length). Moreover, an napproprate codeword length may reduce the effcency. In order to solve these potental problems occurrng due to the use of a sngle resoluton, we ntroduce a herarchcal mechansm, whch nvolves several dfferent resolutons for encodng. Whle the encodng form of hgher resoluton pays more attenton to the detal of local nformaton, that of lower resoluton represents more global nformaton. The pecewse approxmaton wth VQ encodng s appled at several resolutons. For each resoluton ths s done by groupng a dfferent number of consecutve segments together,.e., the length of the Fgure 3. Necessty of multresoluton representaton: dfferent seres wth the same encoded representaton. segment at a gven resoluton s a multple (usually double) the length of the segment at the mmedate hgher resoluton representaton. Thus, we call ths representaton Multresoluton Vector Quantzed (MVQ) approxmaton. Fgure 4 shows a tme seres and ts reconstructon seres usng dfferent resolutons. (For dfferent resoluton levels, the szes of codebooks are the same, 32, and the lengths of codewords are 128, 64, 32, 16, respectvely.) By assgnng reasonable weghts to dfferent resolutons, we defne a new weghted smlarty measure, the Herarchcal Hstogram Model: c S HHM (q,d )= j w *S (q,d (2) = 1 HM j) where c s the number of resoluton levels. Fgure 4. Reconstructon of tme seres usng dfferent resolutons 3.5 Parameters of MVQ Here we dscuss n more detal the parameters of MVQ and how to choose ther values. For the number c of resoluton levels an ntutve choce s c = log n, wth the length of a codeword at the th level beng 2-1 (1 log n). However, when the codeword s too short (e.g., of length 1, 2), ths becomes meanngless. Thus, we need to set a mnmum value of codeword length l mn and set the number of herarchcal levels as c = log (n / l mn ) +1. The codeword length (l) for each level s chosen as follows: At the frst level, each tme seres s treated as a whole (l = n); at the second level, each tme seres s parttoned nto two parts (l = n/2), and at the th level, l = n / 2-1. In cases where n s not a power of two we satsfy ths constrant approxmately. The sze of the codebook at each resoluton level s data dependent, snce the more subsequences used durng the tranng process and the hgher ther varablty, the

6 larger the sze of the codebook needed. In fact, the hgher the number of parttons and the number of codewords the better the approxmaton but also the more computaton and space s needed. So, there s a tradeoff between effcency and accuracy of approxmaton. In practce (as also shown n our experments (see Secton 4)), use of a rather small codebook can acheve very good results. In addton to the number of codewords, the Lloyd algorthm uses a threshold to stop the teratons when the fractonal drop of the dstorton between consecutve teratons reaches a certan pont. A common value for ths threshold s Our experments show that a multresoluton representaton acheves much hgher accuracy than a sngle resoluton one. The prce for ths mprovement s slghtly more computaton, snce we have to calculate the smlarty at each resoluton level before we can fnally compute S HHM. In our experments we studed the behavor of the multresoluton approach wth dfferent weghts assgned to each resoluton level. Lackng any nformaton or any pror knowledge about the doman (.e., the most realstc case) the straghtforward soluton s to use equal weghts for all resolutons. Ths choce provdes the best results n almost all of the experments we performed. The proposed method also provdes the ablty to nclude pror doman knowledge n the selecton of the weghts. 4. Experments In tme seres smlarty analyss, best matches retreval and clusterng are two of the most common and mportant applcatons. We performed experments to evaluate the effectveness and effcency of our method n these two applcatons. We address the followng ssues: (a) how accurate the method s, (b) how t compares to alternatves, (c) how fast and scalable t s. We start wth a descrpton of the datasets we used n our experments. 4.1 Datasets In the experments presented n ths secton, one synthetc and two real datasets are nvolved. We used the Control Chart synthetc dataset (SYNDATA) whch s downloadable from the UCI KDD archve [30]. Ths dataset contans 600 examples of control charts (each has 60 ponts) synthetcally generated by the process n Alcock and Manolopoulos [3]. The tme seres belong to sx dfferent classes of control charts: Normal, Cyclc, Increasng trend, Decreasng trend, Upward shft, and Downward shft, wth each class havng 100 tme seres. The frst real dataset, CAMMOUSE, s a spatotemporal dataset of 5 words obtaned usng the Camera Mouse Program [5]. The 2D tme seres obtaned represent the X and Y poston of a human trackng feature (e.g., tp of fnger). In conjuncton wth a spellng program the user can wrte varous words and the transtons of the trackng feature or word mage s profles are beng recorded. We used 3 recordngs of 5 words. The 5 words were: Athens, Berln, Boston, London, and Pars. For smplcty, only the x-values are consdered. The average length of sequences n ths dataset s 1100 ponts. The shortest one s 834 ponts and the longest one s 1719 ponts. Snce the length of sequences vares for dfferent nstances, we stretched all sequences to a same length of 1600 ponts. The second real dataset, RTT, conssts of RTT (packet round trp tme) measurements from UCR to CMU wth sendng rate of 50 msec for a day (Feb 10, 2002, startng at 8:20pm). The total number of RTT values s 1,728,000. The dataset was parttoned nto 24 tme seres of length 72,000, each standng for an hour of RTT measurements. These measurements vary between 70 and 150. For clusterng experments we separated the tme seres nto the followng three classes based on the rato of tme where the RTT value s greater than 100: (a) heavy traffc hours: rato > 0.5 (6 seres), (b) medum traffc hours: 0.5 > rato > 0.1 (7 seres) and (c) lght traffc hours: rato < 0.1 ( 11 seres). In order to avod the effects of scalng and shftng n the analyss, before we actually perform any experment, we preprocess the datasets wth zero-mean normalzaton. That s, each tme seres X s normalzed as: X = ( X - X ) / σ(x) where X s the mean value of X and σ(x) s ts standard devaton. For the RTT dataset we take logarthms before we apply the normalzaton. 4.2 Best Match Searchng Experment desgn The best match searchng s defned as follows: gven a query sequence, fnd the best k matches n the database (.e., havng the lowest dssmlarty wth the query) or fnd all the tme seres whose dssmlarty wth the query s below some predefned threshold. In order to evaluate the performance of dfferent approaches n best match searchng, we need an evaluaton metrc. Defnton 3. For a gven query, the set of tme seres whch are actually wthn the same class as the query (gven our pror knowledge) s taken as the standard set (std_set(q)), and the results found by dfferent approaches (knn(q)) are compared wth ths set. The matchng accuracy s defned as: knn(q) std_set(q) Accuracy = 100% (3) k In the defnton above, knn(q), s the k nearest neghbors for the query found by a certan method. In our experments, every tme seres n the dataset s treated as a query, and the best k matches (k nearest neghbors) are sought wthn the whole dataset. The average accuracy of a certan method s then calculated based on the matchng

7 Table 3. Experment parameters for SYNDATA MVQ Parameters Level l s results takng each tme seres as a query. The actual value of k we use depends on the number of tme seres wthn the same class. In our experments, the value of k can vary, but for the purpose of demonstraton, we just show the results when k s set to the number of tme seres wthn the same class Experments on SYNDATA In ths secton, we show the results of the experments performed on the SYNDATA dataset. The expermental parameters for dfferent resoluton levels are gven n Table 3. Wth the ncrease of resoluton, the codeword length decreases and the sze of codebook ncreases (snce there are more tranng samples avalable for that resoluton). The expermental results on SYNDATA are shown n Table 4. The frst element n the weght vector represents the weght assgned to the frst level, the second element the weght assgned to the second level, and so on (e.g., wth a weght vector [ ], only the frst level s nvolved n dstance calculatons). Accuracy s defned based on Eq. (3). The expermental results clearly demonstrate the effect of usng a multresoluton approach: the combnaton of multple resolutons dramatcally mproves the matchng accuracy over the sngle resoluton approach. Table 4. Matchng accuracy on SYNDATA Method Weght Vector Accuracy Sngle level VQ [ ] 0.55 [ ] 0.70 [ ] 0.65 [ ] 0.48 [ ] 0.46 MVQ [ ] 0.83 Eucldean 0.51 To show the effectveness of the proposed representaton and dstance metrc, we appled the plan Eucldean dstance (naïve method) on the same dataset, whch drectly computes the Eucldean dstance to measure the smlarty between tme seres. From Table 4 we can conclude that for ths dataset, the Naïve method does worse than most of the sngle level VQ approxmatons, whle MVQ provdes a much better matchng accuracy. Table 5. Experment parameters for CAMMOUSE data MVQ parameters Level l s Experments on CAMMOUSE We performed smlar experments as wth SYNDATA dataset. The experment parameters and results are shown n Table 5 and 6 respectvely. Table 6. Matchng accuracy on CAMMOUSE data Method Weght Vector Accuracy Sngle level VQ [ ] 0.56 [ ] 0.60 [ ] 0.44 [ ] 0.56 [ ] 0.60 MVQ [ ] 0.83 Eucldean 0.58 From Table 6, t s clear that for the CAMMOUSE dataset, the herarchcal mechansm also helps to mprove the accuracy obtaned wth a sngle resoluton level. Comparng wth the average matchng accuracy of the plan Eucldean method, the retreval accuracy of MVQ s much better (25% hgher) Comparson wth other methods In order to compare the effcency and accuracy of MVQ n smlarty searches we consdered alternatve methods ncludng the Dscrete Fourer Transform (DFT), plan Eucldean, Dynamc Tme Warpng (DTW) and Symbolc Aggregate approxmaton (SAX) [23]. For evaluaton and comparson, every tme seres n the dataset s taken as a query, and the precson and recall pars correspondng to the top 1,2,3,,k retreved tme seres are calculated. Then the average value of precson and recall s computed for the whole dataset. The actual value for k s dfferent for dfferent methods. For DFT, SAX and MVQ, some parameters need to be set up for the experments. For DFT, we take the frst 16 non-zero coeffcents; for SAX the number of segments s set to 15 (SYNDATA) or 16 (CAMMOUSE) and the codebook sze s set to 16. For MVQ we take the same codebook szes as n prevous subsecton for the 5 resoluton levels and use [ ] as the weght vector. Fgure 5 shows the precson-recall performance on SYNDATA and CAMMOUSE. Notce that for a fxed recall rato, the fewer tme seres are retreved the better, and subsequently the hgher the precson s. For both

8 (a) (b) Fgure 5. Precson-recall for dfferent methods (a) on SYNDATA (b) on CAMMOUSE datasets the precson decreases quckly wth Plan Eucldean, DFT, SAX and DTW, whle the precson wth MVQ stays at a hgh level. MVQ acheves the best performance on these datasets. When the tme seres are short (as n the case of SYNDATA) MVQ s need for more space due to the multple codebooks s notceable. However, MVQ s the best dstance functon and provdes the best accuracy. An nterestng observaton s that n most cases, even wth only one layer, our dstance measure can provde comparable or even better results than the other methods. Later n the experments, we restrct the space requrements of MVQ so that they are comparable to those of the other methods. Fgure 6. Processng tme and scalablty Besdes accuracy, other consderatons for a good method should nclude speed and scalablty. Fgure 6 shows the processng tme of dfferent methods on datasets wth varous szes. The expermental settngs for dfferent methods are the same as before. DFT shows the best processng effcency wth the shortest tme, but consderng the poor accuracy result shown n Fgure5, t should not be taken as a good canddate. In comparson to the other methods we consdered here, although the encodng of the query consumes some tme, MVQ outperforms them all n speed when the database sze s not too small. Notce that the tme reported here for MVQ does not nclude the preprocessng needed durng the tranng phase to obtan the codebook (s) for a dataset. A bref dscusson about the preprocessng cost can be found n Appendx B. 4.3 Clusterng experments Experment desgn. For tme seres clusterng, we conducted experments Table 7. Clusterng accuracy of MVQ on SYNDATA Method Weght Vector Accuracy Sngle level VQ [ ] 0.69 [ ] 0.71 [ ] 0.63 [ ] 0.51 [ ] 0.49 MVQ [ ] 0.82 DFT 0.67 SAX 0.65 DTW 0.80 Eucldean 0.55 on both synthetc and real datasets. The PAM (Parttonng Around Medods) clusterng algorthm was used to cluster the orgnal tme-seres n every dataset. However, dfferent approaches appled for dstance calculaton resulted n dfferent dstance matrces for the tme seres, and subsequently n dfferent clusterng results. In order to evaluate the clusterng accuracy and qualty of our approach, a cluster smlarty metrc was used. Gven two clusterngs, G=G 1,G 2,,G k (the true clusters), and A=A 1,A 2,,A k (clusterng result by a certan method), the clusterng accuracy s evaluated wth the cluster smlarty defned as: max Sm G A j (, j ) Sm(G, A) = (4) k where 2 G A j. Sm(G, A j ) = G + A j Ths metrc was ntroduced n [11] to evaluate clusterng results and was also used n [17]. The metrc value ranges between 0 and 1, and t takes the maxmal,.e. 1, when the clusterng result s perfect. For each dataset, we used the same experment parameters as n Secton 4.1. Consderng the stochastc nature of the PAM algorthm, gven a set of parameters, each experment was repeated 10 tmes, and the average result s reported here. For the purpose of comparson, clusterng results wth other methods are also provded Experments on SYNDATA dataset. Takng the same parameters as shown n Table 3, clusterng experments were performed on the SYNDATA dataset. The expermental results are lsted n Table 7. Clusterng performance of other methods s also reported. It s clear that for ths dataset, we cannot acheve satsfyng performance usng the Eucldean Dstance as the dstance metrc, whle the suggested method s very promsng. The performance acheved by several sngle resoluton levels of the VQ approxmaton s better than that of the Naïve method (Eucldean on the orgnal tme seres) and comparable or better to that of the other

9 Table 8. Clusterng accuracy of MVQ on CAMMOUSE Method Weght Vector Accuracy [ ] 0.61 Sngle level [ ] 0.60 VQ [ ] 0.59 [ ] 0.63 [ ] 0.62 MVQ [ ] 0.79 DFT 0.62 SAX 0.58 DTW 0.69 Eucldean 0.61 methods. By combnng dfferent resoluton levels, the clusterng result s further mproved. For completeness we compared a multresoluton mplementaton of SAX to MVQ. We used 5 resoluton levels wth the number of segments as 2, 3, 6, 30 and 60 respectvely. The accuraces of SAX wth dfferent resolutons vary between 0.54 and However, when we tred to combne the dstance measurement n all resoluton levels, the accuracy was Snce SAX encodes already the order of segments n the orgnal tme seres, the use of multresoluton levels does not mprove the accuracy of the representaton and ts performance Experments on CAMMOUSE dataset. The expermental parameters for the CAMMOUSE dataset are the same as n Table 5. Table 8 dsplays the results for MVQ wth dfferent weght vectors and results of the other methods. Agan, the performance of plan Eucldean Dstance s poor, whle MVQ provdes much better clusterng results. Its performance s also superor to the other methods we tested. Observe agan that even wth only one layer, our dstance measure can provde comparatve or even better results than the others (n ths case MVQ has smlar space requrements as the other methods) Experments on the RTT dataset. For MVQ we used 5 dfferent layers 1-5 wth 3, 8, 8, 16, and 16 codewords respectvely. Ths s a total of 51 codewords. We used the same number of parameters for DFT and SAX. Table 9 compares the clusterng accuracy of MVQ wth that of the other methods. An mportant observaton here s that we do not need to take all layers nto consderaton to get the best performance. The reason s that when the dfferent resoluton levels cannot present unformly rch nformaton, the nvolvement of less nformatve levels wll reduce the overall accuracy. Furthermore, the study at dfferent sngle resoluton levels can help us dentfy the mportance of dfferent layers n dscrmnatng among classes. Table 9. Clusterng results on RTT (wth same space requrements for MVQ as for the other methods) Method Weght Vector Accuracy [10000] 0.55 Sngle level [01000] 0.52 VQ [00100] 0.57 [00010] 0.80 [00001] 0.79 MVQ [00011] 0.81 [11111] 0.60 DFT 0.54 SAX 0.54 DTW 0.62 Eucldean Summarzng tme seres Here, we present results from applyng MVQ to summarze tme seres. We consder the SYNDATA dataset. To help n evaluatng the summarzaton capabltes of the proposed approach, n Fgure 7, we present a few typcal tme seres that we manually extracted from each of the sx classes. Fgure 7. Representatve tme seres extracted manually from the SYNDATA dataset. Table 10 shows how the codewords of the frst codebook are used to represent each class at the frst level (of resoluton). The actual codewords are dsplayed n Fgure 8. The frst number n each cell of Table 10 shows how usage of a codeword (row) s dstrbuted across classes (we show percentages). These numbers add up to 100 for each row (codeword). The second number n each cell shows the usage (n percentages) of all codewords for a certan class (column). They add up to 100 for each column. One can make the followng observatons about the representaton of classes at ths level (more coarse approxmaton). For all tme seres n class 1 (normal) only the 2 nd codeword s used and only class 2 (cyclc) tme seres use the same codeword (rarely though). The 2 nd codeword s ndeed very representatve of the tme seres n class 1. Tme seres n class 2 make equal use of codewords 1, 5, and 6 whle they rarely use codeword 2. Snce class 2 s the cyclc one ths makes a lot of sense. One could have a concse representaton by just lookng

10 Table 10. The codewords (c:1-6) used to represent each one of the 6 classes of SYNDATA at level 1. c class1 class2 class3 class4 class5 class6 1 0, 0 100,31 0, 0 0, 0 0, 0 0, ,100 4, 4 0, 0 0, 0 0, 0 0, 0 3 0, 0 0, 0 0, 0 50,100 0, 0 50, , 0 0, 0 50,100 0, 0 50,100 0, 0 5 0, 0 100,40 0, 0 0, 0 0, 0 0, 0 6 0, 0 100,25 0, 0 0, 0 0, 0 0, 0 at the codewords and the frequency of ther use n dfferent classes. Classes 3 (ncreasng trend) and 5 (upward shft) make equal use of the 4 th codeword although no other classes use ths codeword. They both have an ncreasng trend so ths summarzes them very well. Smlarly for classes 4 (decreasng trend) and 6 (downward shft) the 3 rd codeword s used and no other class s usng ths codeword. At ths frst level we cannot dscrmnate between classes 3 and 5 and classes 4 and 6. Fgure 8. The codewords used to represent tme seres of the SYNDATA dataset at the frst level. The second level though provdes more detals nto the summarzaton enablng the dscrmnaton between classes 3 and 5 and classes 4 and 6. Table 11 shows how the codewords of the second codebook are used to represent each class at the second level. The actual codewords are dsplayed n Fgure 9. Please note that Fgure 9. The codewords used to represent tme seres of the SYNDATA dataset at the second level. codeword numbers correspond to dfferent codewords (not the same codewords as for level 1). Tme seres n class 5 make heavy use of codeword 12 that ndeed represents the upward shft. Ths s not the case for class 3 whch nstead uses heavly codewords 1 and 9. Smlarly, class 6 makes heavy use of codeword 15 that ndeed represents the downward shft. Class 4 uses the codeword 15 very rarely. The tables for the other levels are not shown here due to space lmtatons. They are also Table 11. The codewords (c:1-16) used to represent each one of the 6 classes of SYNDATA at level 2. c class1 class2 class3 class4 class5 class6 1 2, 1 0, 0 94, 19 0, 0 2, 1 2, , 51 10, 6 0, 0 0, 0 0, 0 0, 0 3 1, 1 0, 0 0, 0 3, 1 56, 21 40, , 0 0, 0 96, 48 0, 0 3, 1 1, 1 5 0, 0 0, 0 37, 5 0, 0 63, 9 0, , 5 89, 39 0, 0 0, 0 0, 0 0, 0 7 0, 0 0, 0 0, 0 100, 45 0, 0 0, 0 8 0, 0 0, 0 1, 1 3, 2 46, 28 5, , 0 0, 0 98, 26 0, 0 2, 1 0, , 39 22, 11 0, 0 0, 0 0, 0 0, , 0 0, 0 0, 0 12, 3 25, 7 63, , 0 0, 0 7, 1 0, 0 93, 21 0, , 0 0, 0 0, 0 0, 0 100, 11 0, , 2 96, 44 0, 0 0, 0 0, 0 0, , 0 0, 0 0, 0 3, 1 0, 0 97, , 1 0, 0 0, 0 70, 48 0, 0 29, 20 not very useful for summarzaton of ths partcular dataset snce most of the useful summarzaton nformaton s extracted from the frst two levels. These results demonstrate the ablty of MVQ to provde a summarzaton of tme seres datasets. Ths s possble due to the symbolc and multresoluton nature of the representaton. 5. Dscusson The MVQ approach that we proposed for representng tme seres data n order to make ther analyss more effcent s a natural extenson of the pecewse constant approxmaton schemes proposed earler. By applyng Vector Quantzaton to extract hgh-level features of the data and by nvolvng a multresoluton approach we were able to dentfy a vocabulary of subsequences of varous lengths and mprove performance and effcency n tme seres smlarty retreval. We were especally successful n domans where we could not acheve good results usng the Eucldean dstance as the smlarty metrc. In addton, the new representaton s very useful n summarzng tme seres by provdng typcal patterns observed at dfferent resolutons. We presented the man dea of an approach to represent tme seres along wth a new dstance functon that s better than prevous dstance functons and n addton t s fast to compute. Obvously, there are a lot of varatons of ths approach ncludng use of sldng wndows, non-rgd borders for subsequences, use of dfferent rules for assgnng weghts to dfferent resolutons, etc. These are drectons n whch ths work can be extended. Another nterestng problem s related to the sze of the codebook. When we generate the codebooks for dfferent resolutons, the sze of each codebook affects the performance of encodng. The more codewords at a gven resoluton, the better the approxmaton but the effcency of the method decreases.

11 Future studes nclude lookng nto these tradeoffs n more detal. 6. Conclusons In ths paper we ntroduced a new symbolc representaton of tme seres, MVQ, along wth a new dstance functon that s better than major compettors. By parttonng a sequence nto equal-length segments and usng vector quantzaton to represent each sequence by appearance frequences of key subsequences, MVQ provdes a more meanngful smlarty metrc for many domans, besdes the mprovement n effcency because of the dmensonalty reducton especally n the case of long sequences. Moreover, usng a multresoluton approach, MVQ can record both local and global nformaton of tme seres, whch further mproves the robustness n calculatng smlarty, requrng lttle more calculaton than a sngle resoluton approach. The expermental evaluaton of the proposed method showed that t outperforms current state-of-the-art methods n clusterng and smlarty searches. Ths s due to the followng: (a) t explots pror knowledge about the data, (b) t takes multple resolutons nto account and (c) t partally gnores the orderng of the codewords wthn the tme sequence due to the hstogram model that t uses. The proposed representaton s symbolc potentally allowng the applcaton of text-based retreval technques nto the smlarty analyss of tme seres. Moreover, due to the symbolc and multresoluton representaton the proposed approach s excellent n summarzng tme seres by provdng typcal patterns observed at dfferent resolutons. The proposed transformaton on tme seres s very fast to process long tme seres, snce the length of new representaton s only related to the sze of the codebook. The parameters of our method are easy to determne. In partcular, a general concluson from our experments s that lackng any pror knowledge equal weghts to all resoluton levels works well most of the tme. Whle the expermental results presented here manly focus on smlarty analyss, clusterng, and summarzaton, our approach can also be easly adjusted to other applcatons, such as frequent pattern retreval (.e., motf dscovery), assocaton rule mnng, and other data mnng applcatons. Acknowledgements The authors are grateful to the anonymous referees and to Eamonn Keogh for provdng helpful comments. Ths work was supported n part by NSF under Grant No. IIS , by NIH under Grant No. R01MH A1 (funded by NIMH, NINDS, and NIA) and by the Pennsylvana Department of Health. References [1] Agrawal, R., Faloutsos, C. and Swam, A.. Effcent smlarty search n sequence databases, Proceedngs of the 4th Int'l Conference on Foundatons of Data Organzaton and Algorthms. Chcago, IL, Oct 13-15, pp [2] Agrawal, R., Ln, K. I., Sawhney, H. S. and Shm, K., Fast smlarty search n the presence of nose, scalng, and translaton n tme-seres databases, Proceedngs of the 21st Int'l Conference on Very Large Databases. Zurch, Swtzerland, Sept., 1995, pp [3] Alcock R.J. and Manolopoulos Y.. "Tme-Seres Smlarty Queres Employng a Feature-Based Approach" Proceedngs of 7th Hellenc Conference on Informatcs, Ioannna, Greece, Aug , 1999, pp.iii.1-9. [4] Baeza-Yates, R.A. & Gonnet, GH.. A fast algorthm on average for all-aganst-all sequence matchng, Proceedngs of the Strng Processng and Informaton Retreval Symposum, 1999, pp [5] Betke, M., Gps, J., and Flemng, P., "The Camera Mouse: Vsual Trackng of Body Features to Provde Computer Access For People wth Severe Dsabltes." IEEE Transactons on Neural Systems and Rehabltaton Engneerng, 10:1, March 2002, pp [6] Chen, L. and Ozsu, M.T., Mult-scale hstograms for answerng queres over tme seres data, Proceedngs of the 20th Internatonal Conference on Data Engneerng, Boston, MA, 2004, p [7] Delgannaks A., Kotds, Y., and Roussopoulos, N., Compressng hstorcal nformaton n sensor networks, Proceedngs of the 2004 ACM SIGMOD Internatonal Conference on Management of Data, Pars, France, June 2004, pp [8] Goldn, D.Q. and Kanellaks, P.C. On smlarty queres for tme-seres data: Constrant specfcaton and mplementaton, Proceedngs of Constrant Programmng, Marselles, France, [9] Faloutsos, C., Jagadsh, H., Mendelzon, A. and Mlo, T., A sgnature technque for smlarty-based queres, Proceedngs of the Int'l Conference on Compresson and Complexty of Sequences. Postano-Salerno, Italy, Jun 11-13, [10] Faloutsos, C., Ranganathan, M. and Manolopoulos, Y., Fast subsequence matchng n tme-seres databases, Proceedngs of the ACM SIGMOD Int'l Conference on Management of Data. Mnneapols, MN, May 25-27, 1994, pp [11] Gavrlov, M., Anguelov, D., Indyk, P. and Motwan, R., Mnng the stock market: Whch measure s best?, Proceedngs of the Internatonal Conference on Data Mnng and Knowledge Dscovery, 2000, pp [12] Gersho, A. and Gray R. M., Vector Quantzaton and Sgnal Compresson, Kluwer Academc Publshers, [13] Gusfeld, D., Algorthms on Strngs, Trees and Sequences. Cambrdge Unversty Press, [14] Hetland, M. L., A survey of recent methods for effcent retreval of smlar tme sequences, In Mark Last, Abraham Kandel, and Horst Bunke, edtors, Data Mnng n Tme Seres Databases, World Scentfc, [15] Höppner, F., Dscovery of temporal patterns learnng rules about the qualtatve behavor of tme seres, Proceedngs of the 5th European Conference on Prncples and Practce of Knowledge Dscovery n Databases, Freburg, Germany, 2001, pp

12 [16] Huhtala, Y., Kärkkänen, J. & Tovonen, H., Mnng for smlartes n algned tme seres usng wavelets, Data Mnng and Knowledge Dscovery: Theory, Tools, and Technology, SPIE Proceedngs Seres, Vol Orlando, FL, Apr., 1999, pp [17] Kalpaks, K., Gara, D. and Puttagunta, V, Dstance Measures for Effectve Clusterng of ARIMA Tme-Seres, Proceedngs of the 2001 IEEE Internatonal Conference on Data Mnng, San Jose, CA, Nov 29-Dec 2, 2001, pp [18] Keogh, E., Chakrabart, K., Pazzan, M. and Mehrotra, S., Locally adaptve dmensonalty reducton for ndexng large tme seres databases, Proceedngs of ACM SIGMOD Conference on Management of Data. Santa Barbara, CA, May 21-24, 2001, pp [19] Keogh, E., Chakrabart, K., Pazzan, M. and Mehrotra, S., Dmensonalty Reducton for Fast Smlarty Search n Large Tme Seres Databases, Journal of Knowledge and Informaton Systems, 2001 [20] Keogh, E. & Folas, T., The UCR Tme Seres Data Mnng Archve. Rversde CA. Unversty of Calforna, Computer Scence & Engneerng Department. [21] Keogh, E. and Pazzan, M., A smple dmensonalty reducton technque for fast smlarty research n large tme seres databases, Proceedngs of the Fourth Pacfc-Asa Conference on Knowledge Dscovery and Data Mnng, Kyoto, Japan, [22] Ln, J., Keogh, E., Patel, P. and Lonard, S., Fndng motfs n tme seres, The 2nd Workshop on Temporal Data Mnng, at the 8th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, Edmonton, Alberta, Canada, July 23-26, [23] Ln, J., Keogh, E., Lonard, S. and Chu, B., A Symbolc Representaton of Tme Seres, wth Implcatons for Streamng Algorthms, Proceedngs of the 8th ACM SIGMOD Workshop on Research Issues n Data Mnng and Knowledge Dscovery, San Dego, CA. June 13, [24] Lnde, S., Buzo, A. and Gray, A., An algorthm for vector quantzer desgn, IEEE Transactons on Communcatons, vol. 28, 1980, pp [25] Lloyd, S. P., Least squares quantzaton n PCM, IEEE Transactons on Informaton Theory, IT(28), 1982, pp [26] Megalookonomou, V., L, G., Wang, Q., "A Dmensonalty Reducton Technque for Effcent Smlarty Analyss of Tme Seres Databases", Proceedngs of the 13th ACM CIKM Internatonal Conference on Informaton and Knowledge Management, Washngton, DC, Nov. 8-13, 2004, pp [27] Park, S., Chu, W.W., Yoon, J. and Hsu, C., Effcent search for smlar subsequences of dfferent lengths n sequence databases, Proceedngs of the ICDE, 2000, pp [28] Potr Indyk, Nck Koudas, S. Muthukrshnan. Identfyng Representatve Trends n Massve Tme Seres Data Sets Usng Sketches, Proceedngs of VLDB, 2000, pp [29] Rafe, D., On smlarty-based queres for tme seres data, Proceedngs of the 15th Internatonal Conference on Data Engneerng (ICDE), Sydney, Australa, 1999, pp [30] UCI KDD Archve. [31] Wu, Y., Agrawal, D. and El Abbad, A., A comparson of DFT and DWT based smlarty search n tme-seres databases, Proceedngs of the 9th ACM CIKM Int'l Conference on Informaton and Knowledge Management. McLean, VA, Nov 6-11, 2000, pp [32] Y, B-K and Faloutsos, C., Fast Tme Sequence Indexng for Arbtrary Lp Norms, Proceedngs of the VLDB, Caro, Egypt, Sept, [33] Zhu, L., Rao, A. and Zhang A., Theory of Keyblock-based Image Retreval, ACM Transactons on Informaton Systems, 20(2), 2002, pp Appendx A. Tme seres codeword representaton Other models of smlarty In VQ-based mage retreval [33], two other models that have been proposed are the Boolean Model (BM) and the Vector Model (VM). The Hstogram Model we adopted n our methodology can be consdered as specal case of VM. For completeness we present these models n the context of tme seres analyss below: Boolean model (BM): computes the smlarty of the Boolean models of the codeword representaton of two tme seres usng the followng formula: S BM ( q, t) = n11 * w11 + n00 * w00 where n 11 s the number of dentcal ndces and n 00 s the number of ndces of the code words that do not exst n both of the representatons, whle w 11 and w 00 are the weghts assgned to these frequences. Vector Model (VM): computes the smlarty between the frequency-based representatons of two tme seres usng the followng formula: S vm ( q, t ) = s = 1 s In the above formula, f,t, denotes the frequency of codeword n tme seres t. B. Preprocessng cost: Codebook generaton In MVQ a codebook needs to be generated for each one of the multresoluton levels usng tranng data before the encodng can be performed. Let be the number of teratons n the tranng process where depends on the predefned threshold of the fractonal drop of the dstorton. Durng each teraton, every tranng vector s compared to every codeword. Snce the sze of codebook s s, and totally there are N*w tranng vectors (N s the number of tme seres n the tranng set and w s the number of fragments at the hghest resoluton of a tme seres), and c the number of resoluton levels, the tme complexty of preprocessng of a sngle level s: T(tranng) = O(c * N * w * s * ). Ths tme complexty s not so prohbtve snce tranng s done once durng preprocessng and as we showed earler the sze of the codebook needs not be large to acheve very good approxmaton usng MVQ. In the case that the data s modfed over tme there s no addtonal overhead f the dstrbutons reman the same. In the case of a decreased codebook qualty an ncremental update of the codewords need to be consdered. = 1 f, t f 2, t * * f s = 1, q f, q 2

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School