MARKOV MODEL BASED TIME SERIES SIMILARITY MEASURING

Size: px

Start display at page:

Download "MARKOV MODEL BASED TIME SERIES SIMILARITY MEASURING"

Quentin Hunt
5 years ago
Views:

1 Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi an, 2-5 November 2003 MARKOV MODEL BASED TIME SERIES SIMILARITY MEASURING YUN-TA0 QIAN, SEN JIA, WEN-WU SI College of Computer Science and Technology, Zhejiang University, Hangzhou, China Abstract: Similarity or distance measures between two time series play an important role in analysis and retrieval of time series database, which is a fundamental problem in time series data mining. Mathematical model is widely used as the representation of time series, but few papers discuss it in similarity measure of time series. In this paper, we propose a Markov model based technique for similarity/distance measures of variable-length time sequences. State space of Markov model is partitioned by hierarchical clustering method, and the information of state-transition is used to represent a time series. The similarity/distance measures of time sequences can be defined as various functions of the difference between their state-transition information, and some widely used distance measures can be considered as our specific cases. In addition, in modeling procedure, the vector sequence in reconstructed phase space is used instead of the original time sequence, which more effectively reflects the dynamical property of time series. Experimental results show that it works well under the strong noise environment, and it is versatile for various applications by its flexible definition. Keywords: Similarity measure,;markov model; Hierarchical c1ustering;phase space reconstruction;^ Time series data mining 1 Introduction Similarity or distance measuring of time series is a fhdamental problem in time series data mining (DBMS), which is widely applied for speech recognition, retrieval of time series databases, trajectory analysis, rule extraction from time series, clustering, classification and prediction of time series [1,2,3]. Many similarity measures have been developed for various applications ~41. Most of similarity measures are directly derived from the original time series, and few are based on the model of time series [5,6]. However, model based time series analysis has solid mathematical foundation, and has been proven effectiveness in many applications. The model of time series provides inherent information about the structure and parameter of time series, which can alleviate the affection of noises, outliers and other exterior factors. But model structure and parameter estimation is very complicated and has high time-costing, which impedes its application in time series similarity measuring, because the volume of time series databases is very large in general. In this paper, a novel Markov model based time series similarity measuring method is proposed, in which hierarchical clustering is used to partition state space for adaptively simplifying the model by coarse-to-fine scheme, and phase space reconstruction is used to effectively deal with nonlinear time series. Moreover, various time-dependent and independent state-transitions are defined to build different similarity measures that are suitable for the corresponding applications. Many popular time series similarity measures can be considered as our specific cases from a certain point of view. Experimental results show the power and efliciency of our approach. The rest of the paper is organized as follows. Section 2 surveys related work and background about time series similarity measuring. Section 3 summarizes our contributions to time series similarity, and gives details of model-based similarity method. An experimental evaluation on our similarity measures is given in section 4. Finally, the proposed algorithm is summarized, and the conclusions of our work are given in section 5. 2 Summary of relevant research Defining the similarity between two time series is at the heart of most series data mining tasks. The real mean of similarity is not doubtless, and the time series have different sampling rates and noisy or uncertain values, therefore, similarity is hard to define for time series, and all existed defining method are pragmatic. We will give a brief review on some such popular similarity measures as Euclidean metrics and dynamic time warping (DTW) in the following. Let Q and c be two time series with length m and n, where /$ IEEE 278

2 Q 1 (41,q2 9. ',qi 9. *.,qn 1 c = (c,,c2,. *., cj, * *, c,} In this paper, we only discuss the similarity based on whole matching, because the similarity based on subsequence matching can be derived from whole matching by "window" sliding technique [7]. Definition 1 (Minkowski distance): if p = 1, it is Manhattan distance, if p = 2, it is Euclidean distance,. and if p = CO, it is Maximum distance. In order to eliminate some distortions in the data, Euclidean distance measure needs preprocessing procedures including offset translation, amplitude scaling, linear trend, and noise removing. However, Euclidean distance could not deal with the time series with different sampling rates. DTW method is proposed to solve this problem [8]. Definition 2 (Dynamic Time Warping): Warping path w=~w,,w,,.--,w,,...,w~) is a contiguous set of matrix elements of D,,, that defines a mapping between Q and c, and it must satisfy the following requirements: 1) Boundary conditions: w, = (l,l), WK = (n,m). 2) Continuity conditions: if wk = (a, b), wk-, = (a', b') then a - a's 1 and b-b'i1. 3) Monotony conditions: a-a'2o and b-b'>o. The time and space complexities of DTW are very high. Even though some fast algorithms have been developed, DTW is dificult to be used in large time series databases. Besides the above two similarity measures, many other similarity measures are proposed according to their understanding of similarity. In addition, in order to speed up computational time of similarity measuring and indexing, dimensionality reduction technique is widely used, in which the similarity measuring is done in reduced space instead of original dimensional space. Such popular dimensionality reduction methods as time-frequency transformation algorithm, singular value decomposition, - (1) piecewise linear approximation, and symbolic ripproximation, are deeply studied for time series [9,10]. The evaluation of a similarity measure is mostly dependent on user's opinions, therefore, machine learning based weighted similarity measures are proposed to improve the quality of similarity with feedback information. Model-based method plays an important role in time series data mining, and various models such as linear and nonlinear sequence models are deeply studied, among them Markov model or hidden Markov model (HMM) is a good choice in many cases. Assume a set of states {s,,s2,---,sm}, and an output chain {x,, n = 1,2;.-, N}. It is a Markov sequence, if this random sequence has the following Property P(Xn+, =sj (x, =s,,x,-, =sk,"',x] =s,) = P(xn+l= sj I xn = s,) Markov model is characterized by an initial distribution 17 and a state-transition probability matrix A with ay = P(X,+~ = s, I x, = s,). In practice, a Markov sequence is always polluted by noise in observation process, so HMM is proposed. Let assume there are a Markov sequence (x,, n = 1,2,. - -, N }, and its observation (y,, n = 1,2, (3), N}. If an observation is generated by adding Gaussian white noise into a Markov sequence, its density bction is P(Yn I x n = sj 1 (4) Therefore, HMM is characterized by 17, A, p, and 0. Since similarity measuring is always used for large time series databases, it requires that the algorithm must be simple and easy completed. But the parameter. estimation of Markov or HMM is very complicated, therefore, by now Markov model or HMM is seldom used in this field [5,6]. 3 Model-based time series similarity In this paper, we propose a novel model-based method for time series similarity measuring, whose main features are phase space reconstruction and hierarchical state space partition.

3 The theory of dynamical systems becomes to be more and more important in time series analysis, especially for nonlinear series. Based on dynamical theory, the time evolution of a sequence is defined in some phase space, i.e. the dynamics of a time sequence can be obtained by studying the dynamics of the corresponding phase space points. In practice, a scalar sequence of measurements is the only information that we can observe. We therefore have to convert the observations into state vectors in phase space. According to Taken's theorem, phase space reconstruction is technically solved by the method of delays [ 111. LetX = ( X~,X~,...,X,,.'.,X~)~~ a time sequence, in which X, = x(nat). Its delay reconstruction in m dimensions is formed by the vectors Yn = (xn-(m-l)r,xn-(m-2)r 7. * 7 Xn-r xn) (5) z is lag or delay time, m is embedding dimension, and m z is embedding time length. Finding a good embedding is a very difficult theoretic problem, and by now there exists no clear solution for this problem. However, some semi-theoretical and semi-experienced methods have been presented to compute m and z. Markov model is used for nondeterministic system, in which the fiture state is selected randomly according to the state-transition probabilities. Moreover, deterministic system is also regarded as a limiting case of Markov model. Since Markov model in phase space is a general solution to correctly representing various time series, it can be used for computing the similarity of time series. The complexity of discrete Markov model of a time sequence is mainly dependent on the number of states. Uniformly partition of state space for generating discrete states is frequently used in practice and also frequently criticized because it does not consider the distribution information of state space. Therefore, a hierarchical clustering method is used to adaptively partition state space from coarse to fine. A hierarchical algorithm yields a dendrogram representing the nested clusters by agglomerative or divisive scheme. For agglomerative hierarchical clustering algorithm, two clusters are merged to a new cluster if they have the minimal distance (or maximal similarity) in all pairs of clusters. Therefore, the definition of distance between two clusters is the core of an agglomerative algorithm. Most of hierarchical clustering algorithms are variants of minimum, maximum, mean, and average distance based algorithms, and these four distances are defined as 'mm (ci 3 c, = mi' xec,,ycc, IX - Y I (6) mi is the mean for cluster Ci (9) ni is the number of points in cluster Ci Here mean distance based hierarchical clustering is used, and its procedure can be summarized as follows. Each state point in state space is defined as a sub-cluster, and all these sub-clusters form an initial clustering result. Find two closest sub-clusters that have minimal mean distance, and merge then into a new sub-cluster. Repeat step 2 until the required number of sub-clusters is reached or the mean distances between any pair of sub-clusters is larger than a presumed threshold. Examine all sub-clusters to eliminate little sub-clusters whose number of state points is less than a threshold. After hierarchical clustering procedure, the phasekite space is partitioned into non-overlapped subspace, and each vector point in phase space has a label that marks which subspace this point is in. If the required precision of similarity is high, the number of subspaces (sub-clusters) is given a large value, and while the required precision is lower, the number of subspaces is given a little value. Obviously, high precision means the sensitiveness to noise. Therefore, the number of subspaces is defined by the compromise between the precision and the ability of anti-noise. An original sequence in time space is transformed to a vector sequence in phase space, and a discrete state-transition (vector-transition) sequence is constructed by hierarchical clustering based phase space partition. From the state-transition sequence, a state-transition probability matrix and a frequency matrix that describes the number of appearances of every specific state-transition in the sequence, are derived. Now we discuss the similarity of two time series Q and C. Their corresponding vector sequences in reconstructed phase space are Qph,, and cph,,,. The phase space H is partitioned into 1+1 subspace {H,,H,,...,H,,H,+,}, in which H,,H,,-..H, are formed by hierarchical clustering procedure, and the rest of space forms HI+, = H -H, - H, - -. a - H,. The discrete state-transition sequence are (7)., etransit and 280

4 Proceedings of the Second International Conference on Mac :he Learning and Cybernetics, Xi an, 2-5 November 2003 clransit. Their corresponding state-transition probability matrices are A, and A,, and frequency matrices are F, and F, whose each element represents the number of the appearances of a specific state-transition. Through the above model parameters of these two time series, the following model based similarity measures (MSM) can be defined. but the sampling rate of sequence is adaptive modified in DTW to reach a minimal distance. Adaptive sampling rate modification can be completed by choosing suitable delay time in phase space reconstruction for two time series. As this problem of determining delay time is very complicated, we will study it in another paper. 4 Experiments A range of experiments has been done to veri@ our novel model-based similarity measures between two time series, but limited by space, we only give an experiment on the Funnel-Bell-Cylinder dataset which is always used as benchmark for evaluation. Three groups of time series in Funnel-Bell-Cylinder dataset are generated by the following formula: Obviously, MSM, is similar to Euclidean distance metric, but it uses a vector sequence in phase space instead of the original sequence in time space. MSM, is more precise than Euclidean distance in representing the inherent information from the view of dynamical system theory. MSM, can be considered as a specific MSM, with the hybrid dimensional reduction technique of piecewise aggregate approximation (PAA) and symbolic approximation, which is more robust to noise than MSM,. Both of MSM, and MSM, is sensitive to the order of the sequence, i.e., it has not order-invariant property. Differ from MSM,, MSM, only uses the frequency information about state-transition, it is therefore not related to the order of these state-transitions. From the strict definition of model-based similarity, the similarity of two time series should only consider their models, and has no relationship with the order and frequency [12]. MSM, could be regarded as a strict model based similarity measure, because it only uses the state-transition probability matrices of Markov model in phase space. Since the criterion of similarity is not unique, these four model-based similarity measures are suitable for different applications. One advantage. of our method is its flexibility that one model can produce several different similarity measures. It should be noted that there is not any model-based similarity measure corresponding to DTW. DTW includes the factor of the order of sequence, Where and E(t) are drawn from a standard random normal distribution, a is an integer drawn uniformly from the range [16,32], and (b-u) is an integer drawn from the range [32,96]. Fig. 1 gives some examples of the dataset used for experiment. Fig. 2 shows that three sequences are transformed into two-dimensional phase space. Our time series dataset contains 120 sequences with the length of 1000, and each group has 40 examples. We use leaving-one-out evaluation and nearest neighbor algorithm in our classification experiment. The error rates of MSM,, MSM,, MSM,, and MSM, are 25.4% 20.3%, 13.7%, and 12.9% respectively when the number of subclusters is 50, the embedding dimension is 3, and the time delay is 2. This result is better than that of Euclidean distance 26.2%. 281

5 the complex of model. From the Markov model in phase space, several similarity measures are derived for various different applications. Many popular similarity measures of time series have their corresponding model-based forms. Our model-based method is possible to become a general framework for time series similarity measuring, which is our next research topic. In addition, the indexing problem based on our similarity measures is also our future work. Acknowledgements This work was supported by ational Natural Science Foundation of China under Grant References c J Figure. 1. Examples of Funnel-Bell-Cylinder dataset. Figure 2. Original time series and their corresponding vector sequence in two-dimensional phase space. 5 Conclusions In this paper, we propose a novel model-based time series similarity measuring method, motivated by the shortcoming of the existed similarity measures and the great potential of Markov model. In order to deeply find the inherent dynamical features, phase space reconstruction is used to transform an original sequence in time space into a vector sequence in phase space. We also use hierarchical clustering method to partition phase space for reducing state number, which significantly decreases [I] K.Kalpakis, D.Gada and V.Puttagunta, Distance measures for effective clustering of arima time-series. In proceedings of the IEEE Int? Conference on Data Mining, San Jose, CA, Nov 29-Dec 2, 2001, pp [2] M.K.Ng and Z.Huang, Data-mining massive time series astronomical data: changes, problems and solutions, Information and Software Technology, 41 : , [3] R.Agrawa1, K. Lin, H.S.Sawhney and K.Shim, Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In proceedings of the 21st Int? Conference on Veiy Large Databases, Zurich, Switzerland, Sept. 1995, pp [4] E.Keogh and S. Kasetty, On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8rh ACM SZGKDD International Conference on Knowledge Discovery and Data Mining. July 23-26, Edmonton, Alberta, Canada. pp [5] M.H.Law and J.T.Kwok, Rival penalized competitive leaming for model-based sequence clustering, In proceeding of 151h Int l Con$ On Pattern Recognition, Barcelona, Spain, September, 2000, pp [6] X.Ge and P.Smyth, Deformable Markov model templates for time-series pattern matching. In proceedings of the 6th ACM SIGKDD Int l Conference on Knowledge Discovely and Data Mining. Boston, MA, Aug 20-23,2000. pp [7] S.Park, W.W.Chu, J.Yoon and C.Hsu, Efficient searches for similar subsequences of different lengths in sequence databases, In proceedings of the 16th Int l Conference on Data Engineering, San Diego, CA, Feb 28-Mar 3,2000, pp

6 E.Keogh and M.Pazzani, Scaling up dynamic time warping to massive datasets, In Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discoveiy in Databases, pp , E.Keogh and M.Pazzani, A simple dimensionality reduction technique for fast similarity search in large time series databases. In Proceedings of PaciJic- Asia Con$ on Knowledge Discovery and Data Mining, pp ,2000. [lo] K.Chan and A.W.Fu, Efficient time series matching by wavelets, In proceedings of the 15th IEEE Int'l Conference on Data Engineering, Sydney, Australia, Mar 23-26, 1999, pp [lo] H.Kantz and T.Schreiber, Nonlinear Time Series Analysis, Cambridge Press, [ll] T.Kahveci, A.Singh, and A. Gurel, An efficient index structure for shift and scale invariant search of multi-attribute time sequences. In proceedings of the 18th Int'l Conference on Data Engineering, San Jose, CA, Feb 26-Mar 1,

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,