Clustering Sequence Data using Hidden Markov Model. Representation. Cen Li and Gautam Biswas. Box 1679 Station B,

Clustering Sequence Data using Hidden Markov Model Representation Cen Li and Gautam Biswas Box 1679 Station B, Department of Computer Science, Vanderbilt University, Nashville, TN 37235 USA ABSTRACT This paper proposed a clustering methodology for sequence data using hidden Markov model(hmm) representation. The proposed methodology improves upon existing HMM based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure to obtain a better t model for data during clustering process, and (ii) it provides objective criterion function to select the optimal clustering partition. The algorithm is presented in terms of four nested levels of searches: (i) the search for the optimal number of clusters in a partition, (ii) the search for the optimal structure for a given partition, (iii) the search for the optimal HMM structure for each cluster, and (iv) the search for the optimal HMM parameters for each HMM. Preliminary results are given to support the proposed methodology. Keywords: clustering, hidden Markov model, model selection, Bayesian Information Criterion(BIC), mutual information 1. INTRODUCTION Clustering assumes data is not labeled with class information. The goal is to create structure for data by objectively partitioning data into homogeneous groups where the within group object similarity and the between group object dissimilarity are optimized. The technique has been used extensively and successfully by data mining researchers in discovering structures from databases where domain knowledge is not available or incomplete 1. 2 In the past, the focus of clustering analysis has been on data described with static features 123, i.e., valuesofthe features do not change during the observation period. Examples of static features include an employee's educational level and salary, or a patient's age, gender, and weight. In real world, most systems are dynamic which can often be best described by temporal features, whose values change signicantly during observation period. Examples of temporal features include monthly ATM transactions and account balances of bank customers, and blood pressure, temperature and respiratory rate of patients under intensive care. This paper addresses the problem of clustering data described by temporal features. Clustering temporal data is inherently more complex than clustering static data because (i) the dimensionality of the data is signicantly larger in the dynamic case, and (ii) the complexity of cluster denition(modeling) and interpretation increases by orders of magnitude with dynamic data. 5 We choose hidden Markov model representation for our temporal data clustering problem. There are a number of advantages in the HMM representation for our problem: There are direct links between the HMM states and real world situations for the problem under consideration. The hidden states of a HMM can be used to eectively model the set of potentially valid states of a dynamic process. While the exact sequence of stages going through by a dynamic system may not be observed, it can be estimated based on observable behavior of the systems. HMMs represent a well-dened probabilistic model. The parameters of a HMM can be determined in a precise, well-dened manner, using methods such as maximal likelihood estimates or maximal mutual information criterion. HMMs are graphical models of underlying dynamic processes that govern system behavior. Graphical models may aid the interpretation task.

Clustering using HMMs was rst mentioned by Rabiner et al. 6 for speech recognition problems. The idea has been further explored by other researchers including Lee, 7 Dermatas and Kokkinakis, 8 Lee, 7 Kosaka et al., 9 and Smyth. Two main problems that have been identied in these works are: (i) no objective criterion measure is used for determining the optimal size of the clustering partition, and (ii) uniform, pre-specied HMM structure is assumed for dierent clusters of each partition. This paper describes a HMM clustering methodology that tries to remedy these two problems by developing an objective partition criterion measure based on model mutual information, and by developing an explicit HMM model renement procedure that dynamically modify HMM structures during clustering process. 2. PROPOSED HMM CLUSTERING METHODOLOGY The proposed HMM clustering method can be summarized in terms of four levels of nested searches. From the outer most to the inner most level, the four searches are: the search for 1. the optimal number of clusters in a partition, 2. the optimal structure for a given partition, 3. the optimal HMM structure for each cluster, and. the optimal HMM parameters for each cluster. Starting from the inner most level of search, each of these four search steps are described in more detail next. 2.1. Search Level : HMM Parameter Reestimation This step tries to nd the maximal likelihood parameters for the HMM of a xed size. The well known Baum- Welch parameter reestimation procedure 11 is used for this purpose. The Baum-Welch procedure is a variation of the more general EM algorithm, 12 which iterates between two steps: (i) the expectation step(e-step), and (ii) the maximization step(m-step). The E-step assumes the current parameters of the model and computes the expected values of a necessary statistics. The M-step uses these statistics to update the model parameters so as to maximize the expected likelihood of the parameters. 13 The procedure is implemented using the forward-backward computations. 2.2. Search level 3: the optimal HMM structure This step attempts to replace an existing model for a group of objects by a more accurate and rened HMM model. Solcke and Omohundro 1 described a technique for inducing the structure of HMMs from data based on a general \model merging" strategy. Takami and Sagayama 16 proposed the Successive State Splitting(SSS) algorithm to model context-dependent phonetic variations. Ostendorf and Singer 17 further expanded the basic SSS algorithm by choosing the node and the candidate split at the same time based on the likelihood gains. Casacuberta et. al 18 proposed to derive the structure of HMM through error correcting grammatical inference techniques. Our HMM renement procedure combines ideas from the past works. We start with an initial model conguration and incrementally grow or shrink the model through HMM state splitting and merging operations for choosing the right size model. The goal is to obtain a model that can better account for the data, i.e., having a higher model posterior probability. For both merge and split operations, we assume the Viterbi path does not change after each operation, that is for the split operation, the observations that were in state s will reside in either one of the two new states, q 0 or q 1. The same is true for the merge operation. This assumption can greatly simplify the parameter estimation process for the new states. The choice of state(s) to apply the split(merge) operation is dependent upon the state emission probabilities. For the split operation, the state that has the highest variances is split. For the merge operation, the two states that have the closest mean vector are considered for merging. Next we describe two criteria measure we propose to use for HMM model selection: (i) Posterior Probability Measure(PPM), and (ii) Bayesian information criterion.

2.2.1. Posterior probabilities for HMMs The computation for Posterior Probability of a HMM model (PPM) is based on the computation for Bayesian model merging criterion in. 1 The Bayesian model merging criterion trades the model likelihood against bias towards a simpler model. Assume the a prior probability of a fully parameterized model, ( ) the model structure and the model parameter, is uniformly distributed. Given some data X, using Bayes's rule, the posterior probability of P ( )P (Xj ) the model P ( jx) can be expressed as: P ( jx) = P / P ( )P (Xj ), where P (Xj ) is the (X) likelihood function. We propose to extend Stolcke and Omohundro's P ( jx) computation in Discrete Density HMMs(DDHMMs) for our Continuous Density HMM model. We decompose model into three independent components: its global structure, G, the transitions from each state q, trans, (q) and the emissions within each state, ( (q) (q) ). Assuming parameters associated with one state are independent of those in another state, the model prior can be written as P ( ) =P ( G ) Y q2q P ( (q) trans j(q) G ) Y q2q P ( (q) (q) j(q) G ): The structure of the model is modeled with an exponential distribution which explicitly bias towards smaller models: P ( G ) / C ;N where C is a constant, and C>1. Since the transitions represent discrete, nite probabilistic choices of the next state, Dirichlet distribution is used for calculating the probability of transitions from each state 1 1 : P ( (q) trans j G)= ty n (q) 1 B( t ::: t ) i=1 qi t;1 where qi are the transition probabilities at state q, i ranging over the states that can follow q. t is the prior weights, and can be chosen to introduce more or less bias towards uniform assignment of the parameters. This prior has a desirable characteristic that it favors state congurations that have less yet more signicant output transitions. For our single component CDHMM case, we propose to use Jerey's prior for the location-scale parameters, i.e., the mean vector and variance matrices associated with each state 19 : P (( (q) (q) )j G)= ;1 : (q) This location-scale prior shows that data having a smaller lead to a more accurate determination of parameter. In the case of CDHMM state conguration, this prior awards CDHMMs with clearly dened states, i.e., the variances,, associated with these states are small. 2.2.2. Bayesian Information Criterion for HMMs One problem with PPM criterion is that it depends heavily on the base value, C, of the exponential distribution for the global model structure probability. Currently, we do not have a strategy for selecting the exponential base value for dierent problems and the model selection performance deteriorates if the right base value is not used. An alternative scheme is the Bayesian model selection approach. A criterion that is often used by Bayesian model selection is relative model posterior probability, P ( X), given by P ( X) =P ()P (Xj): By assuming an uniform prior probability for dierent models, P ( X) / P (Xj), where P (Xj) is the marginal likelihood. The goal of this approach is to select the model that gives the highest marginal likelihood. Computing marginal likelihood for complex models 21 2223 has been an active research area. Established approaches include Monte-Carlo methods, i.e., Gibbs sampling methods 23, 2 and various approximation methods, i.e., the Laplace approximation and approximation based on Bayesian information criterion. 21 Ithasbeenwell documented that the Monte-Carlo methods are very accurate, but are computationally inecient especially for large databases. It is also shown that under certain regularity conditions, Laplace approximation can be quite accurate, but its computation can be expensive, especially for its component Hessian matrix computation.

A widely used and very ecient approximation method for marginal likelihood is Bayesian information criterion where, in log form, marginal likelihood of a model given data is computed as: logp(jx) =logp(xj ) ; d 2 logn where is Maximum Likelihood(ML) conguration of the model, d is the dimensionality of the model parameter space and N is the number of cases in data. We choose BIC as our alternative HMM model selection criterion. 2.3. Search Level 2: the optimal partition structure The two most commonly used distance measures in the context of the HMM representation is the sequence-tomodel likelihood measure and the symmetrized distance measure between pairwise models. 26 Wechoose the sequence-to-model likelihood distance measure for our HMM clustering algorithm. Sequence-to-HMM likelihood, P (Oj), measures the probability that a sequence, O, is generated by a given model,. When the sequence-to- HMM likelihood distance measure is used for object-to-cluster assignments, it automatically enforces the maximizing within-group similarity criterion. A K-means style clustering control structure and a depth-rst binary divisive clustering control structure are proposed to generate partitions having dierent number of clusters. For each partition, the initial object-to-cluster memberships are determined by the sequence-to-hmm likelihood(see Section 2.2.1) distance measure. The objects are subsequently redistributed after HMM parameter reestimation and HMM model renement have been applied in the intermediate clusters. For the K-means algorithm, the redistribution is global for all clusters. For binary hierarchical clustering, the redistribution is carried out between the child clusters of the current cluster. Thus the algorithm is not guaranteed to produce the maximally probable partition of the data set. If the goal is to have a single partition of data, K-means style control structure may be used. If one wants to look at partitions at various levels of details, binary divisive clustering may be suitable. 2.. Search level 1: the optimal number of clusters in the partition The quality of a clustering is measured in terms of its within cluster similarity andbetween cluster dissimilarity. A common criterion measure used by anumber of HMM clustering schemes is the overall likelihood of data given models of the set of clusters. Since our distance measure does well in maximizing the homogeneity of objects within each cluster, we want a criterion measure that is good at comparing partitions in terms of their between-cluster distances. We use the Partition Mutual Information(PMI) measure 27 for this task. From Bayes rule, the posterior probability of a model, i, trained on data, O i,isgiven by: P ( i jo i )= P (O ij i )P ( i ) P (O i ) = P (O i j i )P ( i ) P J P (O j=1 ij j )P ( j ) where P ( i ) is the prior probability of a data coming from cluster i before the feature values are inspected, and P (O i j i ) is the conditional probability of displaying the feature O i given that it comes from cluster i. Let MI i represent the average mutual information between the observation sequence O i and the complete set of models =( 1 ::: J ): MI i = logp( i jo i ) = log(p(o i j i )P( i )) ; logpj j=1 P(O ij j )P( j ): Maximizing this value is equivalent to separating the correct model i from all other models on the training sequence O i. Then, the overall information of the partition with J models is computed by summing over the mutual information PJ j j=1pn i=1 MIi of all training sequences: PMI = J where n j is the number of objects in cluster j, andj is the total number of clusters in a partition. PMI is maximized when the J models are the most separated set of models, without fragmentations.

Object 1 Object 2 Object 3 Object Feature 1 Feature 2 - - - - -0 050 607080900-0050607080900 - - - - -0050607080900-0050607080900 - - - - -0 0506070 80900-0050607080900 - - - -0 050607080900 - - 0050 607080900 Figure 1. Objects generated from dierent HMMs 3. PRELIMINARY RESULTS We have conducted preliminary experiments with HMM clustering on articially generated data. Since we have not nished implementing the HMM renement procedure, in the following experiments, we assume the correct model structure is known and xed throughout the clustering process. Therefore, uniform prior distributions are assumed for all HMMs in computation. For these experiments, the objectives of the HMM clustering is to derive a good partition with optimal number of clusters and object-cluster memberships. To generate data with K clusters, rst we manually create K HMMs. From each of these K HMMs, we generate N k objects, each described with M temporal sequences. The length of each temporal sequence is L. The total data points for such a data set is K N k M L. In these experiments, we choose K =,N k =,M = 2, and L = 0. The HMM for each cluster has 5 states. Figure 1 shows four example data objects from these models. It is observed that, from the feature values, it is quite dicult to dierentiate which objects are generated from the same model. In fact, objects 1 and 3 are generated from the same model and objects 2 and are generated from a dierent model. 3.1. Experiment One In this experiment, we illustrate the binary HMM clustering process and the eects of the PMI criterion measure. In the rst part of the experiment, the PMI criterion measure was not incorporated in the binary clustering tree building process. The branches of the tree is terminated either because there are too few objects in the node, or because the object redistribution process in a node ends with one cluster partition. The full binary clustering tree, as well as the PMI scores for intermediate and nal partitions are computed and shown in Figure 2(a). The PMI scores to the right of the tree indicate the quality of the current partition, which includes all nodes at the frontier of the current tree. For example, the PMI score for the partition having clusters C and C 123 is 0.0, and PMI score for the partition having clusters C, C 2, 26C 2,andC 13 is ;1:75 2. The result of this clustering process is a 7-cluster partition, with six fragmented clusters, i.e., cluster C 2 is fragmented into C 2 and 26C 2, cluster C 3 is fragmented into 1 C 3, 29C 3, and cluster C 1 is fragmented into C 1 and 26C 1. Figure 2(b) shows the binary HMM clustering tree where PMI criterion measure is used for determining branch terminations. The dotted lines cut o branches of the search tree where the split of the parent cluster results in a decrease in the PMI score. This clustering process rediscovers the correct -cluster partition.

C 123 PMI C 123 C C 123 0.00 C C 123 C 2 C 13 0.00 PMI bef ore =0:0 C 2 C 13 C 2 26 C 2-175 PMI af ter = ;87:3 C 2 26 C 2 C 3 C 1-175 PMI bef ore =0:0 C 3 C 1 1 C 3 29 C 3 C 1 26 C 1-186 -238 PMI af ter = ;:13 1 C 3 PMI bef ore =0:0 PMI af ter = ;58:7 29 C 3 C 1 26 C 1 (a) (b) Figure 2. The binary HMM clustering tree. Misclassification 80 70 60 50 0 Misclassification 0 35 K-Means Binary Divisive 0 0 5 35 0 (a) S/N ratio 5 (b) Binary Divisive K-means Number of States 0 1 2 3 5 6 7 8 9 Figure 3. HMM clustering results: (a) data having dierent levels of noise, (b) clustering starting with dierent size HMMs

3.2. Experiment Two In this experiment, we want to study the performance of the HMM clustering system with data being corrupted by dierent levels of noises. White Gaussian noises were added to the data. The added noise was computed at dierent signal-to-noise ratios. 28 More noise is successively added to the original -cluster data, i.e., the signal-to-noise ratio is successfully decreased from 35 to 1. Figure 3(a) shows the clustering results in terms of the misclassication counts vs. the signal-to-noise ratio. We observe that noise does not seem to have much eect on the clustering results until it is very large, i.e., S=NdB < 5. After that, the clustering process fails to separate out objects from three HMMs. 3.3. Experiment Three In this experiment, we study the eects of dierent initial HMM structure on clustering performance. Since the four original HMMs all have 5-states. The initial HMMs in this experiment have number of states ranging from 2 to 8. Figure 3(b) shows the results in terms of the misclassication counts versus number of states in the initial HMMs. The clustering results remains the same for initial HMMs having 3, and 5 states. For initial HMMs having 2 states, the misclassication is high. The algorithm fails to separate objects from HMM model 2 and 3. For initial HMMs having 6, 7, and 8 states, the clustering partition generated is close to optimal. This result agree with the intuition that the initial HMMs having too few states will result in worse clustering partition than cases when initial HMMs have too many states. The reason for this that when a model is too small, multiple state denitions have to be squeezed into one state, which makes the state denitions less specic and the model less accurate. On the other hand, when there are extra states in the model, the model can be made more accurate by dividing single state denition into multiple state denitions. At the very least, it can retain the original model by setting the transitions to the extra states very small to ignore those states. REFERENCES 1. P. Cheeseman and J. Stutz, \Bayesian classication(autoclass): Theory and results," in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., ch. 6, pp. 3{180, AAAI-MIT press, 1996. 2. G. Biswas, J. Weinberg, and C. Li, \Iterate: A conceptual clustering method for knowledge discovery in databases," in Articial Intelligence in Petroleum Industry: Symbolic and Computational Applications, B. Braunschweig and R. Day, eds., Teditions Technip, 1995. 3. D. Fisher, \Knowledge acquisition via incremental conceptual clustering," Machine Learning 2, pp. 139{172, 1987.. C. S. Wallace and D. L. Dowe, \Intrinsic classicatin by mml - the snob program," in Proceedings of the Seventh Australian Joint Conference onarticial Intelligence, pp. 37{, World Scientic, 199. 5. C. Li, \Unsupervised classication on temporal data." Survey paper, Department of Computer Science, Vanderbilt University, Apr. 1998. 6. L. R. Rabiner, C. H. Lee, B. H. Juang, and J. G. Wilpon, \Hmm clustering for connected word recognition," in Proceedings of the International Conference onacoustics, Speech, and Signal Processing, 1989. 7. K. F. Lee, \Context-dependent phonetic hidden markov models for speaker-independent continuous speech recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing 38(), pp. 599{609, 1990. 8. E. Dermatas and G. Kokkinakis, \Algorithm for clustering continuous density hmmby recognition error," IEEE Transactions on Speech and Audio Processing, pp. 231{23, May 1996. 9. T. Kosaka, S. Masunaga, and M. Kuraoka, \Speaker-independent phone modeling based on speaker-dependent hmm's composition and clustering," in Proceedings of the ICASSP' 95, pp. 1{, 1995.. P. Smyth, \Clustering sequences with hidden markov models," Advances in Neural Information Processing, 1997. 11. L. E. Baum, T. Petrie, G. Soules, and N. Weiss, \A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains," The Annuals of Mathematical Statistics (1), pp. 16{171, 1970. 12. A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood from incomplete data via the em algorithm," Journal of Royal Statistical Society Series B(methodological) 39, pp. 1{38, 1977.

13. Z. Ghahramani and M. I. Jordan, \Factorial hidden markov models," Tech. Rep. 9502, MIT Comuter Cognitive Science, Aug. 1995. 1. A. Stolcke and S. M. Omohundro, \Best-rst model merging for hidden markov model induction," Tech. Rep. TR-9-003, International Computer Science Institute, 197 Center St., Suite 600, Berkeley, CA 970-1198, Jan. 199.. S. M. Omohundro, \Best-rst model merging for dynamic learning and recognition," Advances in Neural Information Processing Systems, pp. 958{965, 1992. 16. J. Takami and S. Sagayama, \A successive state splitting algorithm for ecient allophone modeling," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 1, pp. 573{576, 1992. 17. M. Ostendorf and H. Singer, \Hmm topology design using maximum likelihood successive state splitting," Computer Speech and Language 11, pp. 17{1, 1997. 18. F. Casacuberta, E. Vidal, and B. Mas, \Learning the structure of hmm's through grammatical inference techniques," in Proceedings of the International Conference on Acoustic, Speech, and Signal Processing, pp. 717{7, 1990. 19. G. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis, Addison-Wesley Publishing Co., 1973.. R. E. Kass and A. E. Raftery, \Bayes factor," Journal of the American Statistical Association, pp. 773{795, June 1995. 21. D. Heckerman, \A tutorial on learning with bayesian networks," Tech. Rep. MSR-TR-95-06, Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond, WA 98052, 1996. 22. G. F. Cooper and E. Herskovits, \A bayesian method for the induction of probabilistic network from data," Machine Learning 9, pp. 9{37, 1992. 23. S. Chib, \Marginal likelihood from the gibbs sampling," Journal of the American Statistical Association, pp. 1313{1321, Dec. 1995. 2. C. G. and G. E. I., \Explaining the gibbs sampler," The American Statistician 6, pp. 167{17, Aug. 1992.. L. R. Rabiner, \A tutorial on hidden markov models and selected applications in speech recognition," Proceedings of the IEEE 77, pp. 7{285, Feb. 1989. 26. B. H. Juang and L. R. Rabiner, \A probabilistic distance measure for hidden markov models," AT&T Technical Journal 6, pp. 391{08, Feb. 1985. 27. L. R. Bahl, P. F. Brown, P. V. De Souza, and R. L. Mercer, \Maximum mutual information estimation of hidden markov model parameters," in Proceedings of the IEEE-IECEJ-AS International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 9{52, 1986. 28. D. J. Mashao, Computations and Evaluations of an Optimal Feature-set for an HMM-based Recognizer. PhD thesis, Brown University, May 1996.