EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

Size: px

Start display at page:

Download "EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition"

Georgina Mason
5 years ago
Views:

1 EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han, Abstract. In this paper, we introduce two reformulated versions of the standard EM algorithm, namely Successive Split EM and Split and Merge EM, to relax the problem of initialization dependence in datadriven Speech Trajectory Clustering. These two algorithms allow us to prevent the EM procedure in Trajectory Clustering from ending in a local maximum of the likelihood surface. Thus, the new methods will generate more coherent trajectory clusters. We applied these two methods for developing multiple parallel HMMs for a continuous digit recognition task. We compared the performance obtained with the proposed methods to the recognition performance obtained with knowledge-based contextdependent Head-Body-Tail models. The results showed that both datadriven approaches significantly outperform the knowledge-based approach. In addition, in most cases the model based on Split and Merge EM is better than the model based on Successive Split EM. 1 Introduction Over the past decades, it has been repeatedly shown that modeling pronunciation variation with multiple parallel HMM paths can significantly improve the performance of automatic speech recognition. The idea underlying multiple-hmms acoustic modeling is to use HMM topologies with multiple parallel paths that account for the structure of the acoustic variability, thus alleviating the so called trajectory folding problem [1]. Well-known examples are Gender-dependent models and Context-dependent models. The common feature of these examples is that the training tokens of an acoustic unit (e.g. phoneme, syllable, word) are first clustered into separate subgroups with respect to a priori phonetic and linguistic knowledge, and these subgroups are then used to train separate HMM paths. Head-Body-Tail (HBT) model [2] for digit recognition, is an example of context-dependent modeling in which phonetic knowledge about the immediate left and/or right neighboring acoustic unit is used as the criterion to split training tokens. However, this top-down method is not necessarily suitable for all sources of variation. First of all, it is very hard to decide what is the most important

2 source of variation in a certain speech database. Inter-speaker variation, for example, may well be more important than linguistic context variation for a small vocabulary recognition task. Secondly, even within one speech database, the most important variation for different acoustic units may be due to different factors, such as speaking style, speed or regional background of the speakers. The use of a single criterion to derive pronunciation variation clusters for all acoustic units might not be appropriate. Finally, some important sources of speech variation may not be amenable to top-down modeling. Speaking style, for instance, is important for many speech recognition tasks, but it is very hard to label utterances in a database for relevant styles. These limitations of the knowledge-based methodology limit the power of conventional multiple-hmm acoustic modeling. The limitations of the knowledge-based approach might more seriously deteriorate the performance of a Chinese speech recognizer. Chinese is syllable-based language. The most natural acoustic units for a Chinese recognizer are syllables, which can well model the long term coarticulation in speech. However, it holds that part of the pronunciation variation is due to factors such as neighboring syllables, the speaking rate, dialect, etc. Applying prior knowledge, for instance context, on syllables rather than phonemes may lead to the sparsity of available training data in the subgroups to accurately train the separate HMM paths. To overcome the limitations of the knowledge-based approach, some datedriven approaches were proposed [3] [4]. Contrary to a knowledge-based approach, a data-driven approach automatically derives the most salient pronunciation variation classes by clustering the training tokens of individual acoustic units. In this way, the most important variants can be uncovered directly from the acoustic data. However, given the fact that speech tokens are time serials data with different length, it is not straightforward to define a distance measure, which is a necessary prerequisite for clustering speech tokens in a data-driven manner. Previous methods to measure the distance include dynamic time wrapping of tokens [3], and modeling individual token as HMM [4]. However, the first method loses the information that successive frames in a speech token are not statistically independent. The second method loses information about the details of the temporal evolution of the speech patterns. In our previous work, we developed a novel data-driven method to cluster training tokens, namely Trajectory Clustering (TC), and evaluated the method on different types of recognition tasks [5] [6]. In this approach, the training tokens are represented in terms of continuous trajectories along time in acoustic parameter space. The speech trajectories are then clustered into a number of classes using the Mixture of Polynomial Regression [7]. In this way, the dependency of neighboring frames and the evolutionary pattern of a training token are preserved. One common but serious problem for TC is that it is highly sensitive to the initial value of its model parameters, due to the fact that the EM algorithm adopted by TC in the parameter estimation can only give locally optimized solution. The major contribution of this paper is to introduce two clus-

3 tering strategies, namely Successive Split EM (SSEM) and Split and Merge EM (SMEM), to partly solve the initialization problem. Experiments were carried out in connected Dutch digit recognition task to evaluate the performance of these approaches, using conventional context-dependent HBT models as a reference. The proposed TC model can also be directly applied to Chinese speech recognition. This paper is organized as follows: Section 2 introduces the mathematics underlying the TC model, together with the overall clustering strategy based on SSEM and SMEM. Section 3 describes the design and the results of the experiments. Finally, in Section 4, our main conclusions are drawn. 2 Methodology 2.1 Speech Trajectory Clustering In TC, speech tokens are assumed to be drawn from several components of mixture Gaussians, where the mean of each component density is a polynomial function of time. For speech token j with a length of N j frames, the matrix form of the regression equation for component k in D dimensional acoustic feature space can be written as or: y (d) j (1) y (d) j (2). y (d) j (N j) ( = Y j = X j β k + E k (1) N j 1 )p 1... ( N j 1 N j 1 )p β (d) k,0 β (d) k,1. β (d) k,p + k (1) k (2). (Nj) e (d) e (d) e (d) k for d = 1,..., D Y j is the feature vector matrix, which is N j D; X j is an N j (p + 1) matrix whose second column contains the frame numbers corresponding to the feature vectors in Y j, and p is the highest order of the regression model, in our case p = 3; β k is a matrix of regression coefficients; E k is N j D residual error matrix which is assumed to be zero-mean multivariate Gaussian with covariance matrix Σ k. Since the speech trajectories that we will be dealing with have different durations, we normalize the trajectories to unit length by dividing the frame numbers in the second column of X j by N j 1. Note that the normalization does not change the number of frames in a speech token, but varies the way of time representation. In [8], we found that this method of handling different durations yields the most coherent clusters. Assume that speech trajectories are modeled by a Gaussian mixture with K components, each of which is a regression model with polynomial mean and

4 Gaussian residue. Then the probability that a speech trajectory Y j is generated by the mixture model is a linear combination of component regression models, which can be written as P (Y j X j, θ) = K k=1 N j 1 ω k i=0 f k (y j (i) x j (i), θ k ) (2) where the ω k s are the weights of the components, f k (y j (i) x j (i), θ k ) is the observed density given that Y j belongs to component k, and θ k = {β k, Σ k } are the model parameters for the k th regression component. The log-likelihood of the parameter θ given the set S with M speech trajectories can be defined as N M K j 1 L(θ S) = log ω k f k (y j (i) x j (i), θ k ) (3) j=1 k=1 To find the maximum likelihood estimates of the parameters of a mixture model, EM is the most general algorithm. The EM algorithm consists of the following two steps: The E-step calculates the membership probability: h jk = N j 1 ω k K k=1 i=0 N j 1 ω k i=0 i=0 f k (y j (i) x j (i), θ k ) f k (y j (i) x j (i), θ k ) which is the posterior probability that trajectory Y j is generated by component k. All the acoustic vectors in Y j share the same probability. The M-step calculates the the new model parameters: (4) ˆβ k = (X H k X) 1 X H k Y (5) ˆΣ k = (Y Xˆβ k ) H k (Y Xˆβ k ) M j=1 N jh jk (6) ŵ k = 1 M M h jk (7) j=1 where H k is a diagonal matrix with [h 1k h 2k... h Mk ] as the diagonal, in which h jk is a row vector containing N j copies of the membership probability h jk. By default, the EM algorithm starts with randomly initialized model parameters. Then the E-step and M-step are iteratively performed until convergence on loglikelihood Eq.(3) is reached. Finally, each speech trajectory is assigned to the cluster with the highest membership probability h jk.

5 Table 1. Successive Split EM for Speech Trajectory Clustering 1. Fit one polynomial to the complete data set, compute the model parameters β and Σ, set K = 1 and ω 1 = 1; 2. Select the component k (k K) with largest ω k to split, initialize the parameters of new components with respect to Eq.(8) (10), increase K by 1; 3. Run EM on all mixture components until convergence of log-likelihood (Eq.(3)); 4. Loop to step 2, stop if K has reached the desired number. 2.2 Successive Split EM One of the issues for the EM algorithm is that it often converges to one of the local maxima of the likelihood surface. As a consequence, the EM procedure for TC is highly sensitive to the initial parameter assignments: Different initial values of the model parameters lead to different clusters after EM estimation. One way to tackle this problem is to apply the Linde-Buzo-Gray(LBG) algorithm [9] to TC. We start from the complete set of speech trajectories, then successively split one cluster selected with respect to a certain split criterion, until K clusters are obtained. Assume in a certain split iteration, component k is selected to be split into k and j. The initialization of parameters for k and j after split is as follows: β k,0 = β k,0 + ε and β j,0 = β k,0 ε (8) where ε is a small noise term sampled from N(0, Σ). Σ k = Σ j = det(σ k ) 1/D I D (9) Here, det(σ) denotes the determinant of matrix Σ and I D is the D-dimensional identity matrix. ω k = ω j = ω k 2 (10) The split criterion adopted in this work is to always split the component with largest weight ω. In all our experiments with TC that we conducted so far, we have found that the component with the largest ω is always related to the cluster with the largest number of trajectories. Thus, taking large size cluster to split is reasonable, because it makes the resulting K clusters have approximately equal size, and consequently the speech tokens in all clusters are equally sufficient to train separate HMM paths. The SSEM algorithm for TC is briefly shown in Table. 1.

6 Table 2. Split and Merge EM for Speech Trajectory Clustering 1. Randomly initialize the parameters of K mixture components; 2. Run EM on all mixture components until convergence; 3. Collect a list of merge candidates with respect to Eq.(14), and a list of split candidates with respect to ω k ; sort these list; 4. Select the most promising split and merge triplet {i, j, k} from the sorted candidate list; 5. Perform the the split and merge operation, initialize the parameter of new model with respect to Eq.(8) (13); 6. Run EM on all mixture components until convergence; 7. If the log-likelihood improved after split and merge, save the newly estimated parameters, ignore the other candidates, go back to step 3; otherwise, reject it; 8. Loop to step 4 until no candidate is available in the list. 2.3 Split and Merge EM In recent research [10], the idea of performing split and merge operations has been successfully applied to the EM for Gaussian mixture models. In the case of mixture models, local maxima found by EM often involve having too many components of a mixture model in one part of the space and too few in another. Thus, it is possible to avoid local maxima by introducing a merge operation, which merges components in regions that contain too many highly similar clusters, and a split operation, which splits components in regions where dissimilar tokens are combined in one cluster. This Split and Merge EM algorithm can also be applied to TC. SMEM starts from randomly initialized parameters of TC model with K components, and the model parameters are then estimated by a standard EM procedure. With respect to a certain split and merge criterion, a number of merge and split candidates {i, j, k} s are selected. Here {i, j} denotes the components pairs to be merged, and k is the component to be split. The most promising candidate is then selected, and the split and merge operations on this candidate are performed simultaneously so that the total number of components K is unchanged. After the split of component k, the parameters of the new components {j, k } are initialized by Eq.(8) (10). After the merge of components {i, j}, the initialization of the new component i is set as a linear combination of the original ones before merge:

7 β i,0 = ω iβ i,0 + ω j β j,0 ω i + ω j (11) Σ i = ω iσ i + ω j Σ j ω i + ω j (12) ω i = ω i + ω j (13) The newly generated model after split and merge operations then are subjected to the EM procedure. If the likelihood is better than before split and merge, save the estimated parameters and go back to the candidate selection. Otherwise, reject the new model, and select another candidate to split and merge. This procedure is iteratively performed, until no candidate produces better results than the old one. Note that in theory, the total number of available split and merge candidates is K(K 1)(K 2)/2. However, experiments have shown that it is only necessary to test about 5 promising candidates at each iteration. The merge criterion adopted in this work is defined as follow: J merge (i, j) = h T i h j (14) where h i is row vector [h 1i h 2i... h Mi ] containing the membership probabilities (cf. Eq.(4)) of all trajectory belonging to component i. The idea underlying this merge criterion is: if there are many trajectories which have almost equal membership probability for two components, it is reasonable to assume that these two components can be merged. The SMEM algorithm for TC is summarized in Table Path Mixture Multiple-HMMs Model With the results of TC, multiple HMM paths for a speech unit can be trained, based on the training tokens in different trajectory clusters. We refer to this model topology as the separate path model. An example model topology with two HMM paths is illustrated in Figure.1(a). The priori probability in separate HMM (a) Separate Path Model (b) Path Mixture Model Fig. 1. Model topologies for Separate Path Model and Path Mixture Model.

8 paths are equal to one. By recruiting two non-emitting states, the separate HMM paths are combined into one entity with weighted HMM paths (cf. Figure 1(b)). We refer to this model topology as the path-mixture model. The difference between the path-mixture model and the separate path model is not only the additional weights for parallel HMM paths, but also the way we train them. For the separate path models the HMM paths are trained by using separate sets of tokens corresponding to the trajectory clusters, whereas all the tokens are used to train path-mixture models by means of the Baum-Welch algorithm. Thus, the training of the path-mixture model is equivalent to clustering the tokens again as in the Mixture of Hidden Markov Models approach, but now with the initialization of the parameters obtained from TC-based separate path models. In our previous work [11], it was shown that Path Mixtures Models outperformed Separate Path Models. Thus, we used only Path Mixture Models in this work. In decoding, the Viterbi algorithm can also be directly used in path mixture models. It should be noted that in decoding when a search path begins with a state in a HMM path, it will end in the same HMM path, thus alleviating the trajectory folding problem. 3 Experiments 3.1 Speech Material The performance of the proposed TC based models was evaluated by applying it to a connected Dutch digit recognition task. The speech material for our experiments was taken from the Dutch POLYPHONE [12], SESP [13] and CASIMIR corpora [14]. For each of the corpora, speech was recorded over the public switched telephone network in the Netherlands. Among other things, the speakers were asked to read several connected digit strings. The number of digits in a string varied from 1 to 14. For training we used a set of 9,753 strings containing 61,592 digits. All models were evaluated with an independent set of 10,000 test utterances comprising 80,016 digits. None of the original utterances used for training or testing had a high background noise level. We computed 12 Mel-frequency log-energy coefficients using a 25 ms Hamming window shifted with 10 ms steps and a pre-emphasis factor of Based on a Fast Fourier Transform, 12 filter band energy values were calculated, with the filter bands triangularly shaped and uniformly distributed on a Mel-frequency scale. Mel-frequency cepstra were computed from the raw Mel-frequency logenergy coefficients using the DCT. Channel normalization was done by means of cepstrum mean subtraction over the entire utterance. Finally, we computed the first and second order time derivatives. Together with log-energy and first and second order delta log-energy we obtained 39 dimensional feature vectors. 3.2 Experimental Design In our experiments we used Head-Body-Tail (HBT) [2] models as the baseline system. HBT models account for pronunciation variation in a knowledge based

9 manner. Because pronunciation variation at the boundary of a digit is much larger than in the middle, each digit is split up into three parts. The middle part of a digit (the Body) is assumed to be context-independent. The first part (the Head) and the last part (the Tail) are dependent on the previous and subsequent digit (or silence), respectively. Thus, each digit is modeled as one context-independent body HMM and 11 context-dependent head and tail HMMs that can be conflated in models with 11 parallel paths. In all our experiments the head and tail HMMs consisted of three states, whereas the number of states in body models was based on the mean duration of the digit as observed in the train corpus [14]. In addition to digit models, one silence and one noise model, both consisting of three states, were built. All the HMM paths have the standard left-to-right no-skip topology. In addition to the knowledge-based multiple-hmms, we also built SSEM-TC and SMEM-TC based models. To that end, we used the baseline HBT models to segment the training data by means of forced alignment. This allowed us to cluster the Head and Tail parts of the training tokens of the ten digits. The segmented tokens of each Head or Tail part were then clustered into 11 subgroups with respect to both SSEM and SMEM strategies. Considering that the dependence between frames is explicitly modeled in TC, we only used the 12 MFCCs as the acoustic feature vector. Based on the clustering results obtained with SSEM-TC and SMEM-TC, the two types of TC-based multiple-hmms models with separate paths can be trained, and then subjected to four passes of Baum-Welch re-estimation to train path mixture models. The new models had the same three state left to right no-skip topologies as the baseline models. In training Multiple-HMMs models, we made use of 39 dimensional acoustic feature vectors. All the models in these experiments were trained and evaluated with HTK [15]. In order to study the improvements due to changes in acoustic modeling only, without the risk that the language model could mask the effects, we used a language model that only specifies that all digits have equal prior probability, and that each digit (or silence) can follow each other digit with equal prior probability. 3.3 Result and Discussion We applied the proposed algorithms to cluster the Head and Tail parts of all the 10 digits into 11 subgroups. Table 3 shows the summary statistics (mean, standard deviation (std), maximum, and minimum) of the log-likelihood values obtained by the standard EM, SSEM, and SMEM algorithms with 20 different simulations for the Head and Tail (H1 and T1) of digit /een/ (one), together with the equivalent log-likelihood values obtained by clustering based on context (Context). For the SSEM method only one result can be obtained with the successive splitting procedure used in this study. In Table 3 it is shown that the log-likelihoods for the TC model based on any of EM, SSEM and SMEM are much larger than that given by the clustering based on context, which strongly suggests that the clusters yielded by TC are

10 Table 3. Log Likelihoods found after the clustering ( 10 5 ) Unit Statistics EM SSEM SMEM Context mean std n.a H max n.a min n.a mean std n.a T max n.a min n.a more coherent than those produced by linguistic context criteria. As shown in Table 3, the log-likelihoods achieved by the SSEM and SMEM algorithms have lower variance than those achieved by the EM algorithm. Even the worst solution found by the SMEM algorithm was better than the best solutions found by SSEM algorithm. These results indicate that the proposed algorithms worked very well to avoid the local maxima of the likelihood surface, and the SMEM algorithm worked even better than SSEM. Fig. 2. Results of connected Dutch Digit recognition. Fig.2 illustrates the recognition performance of the baseline context-dependent HBT models, SSEM-TC based models and SMEM-TC based models. The results in this figure correspond to models with 1, 2, 4, 8, 16, 32 Gaussians in each HMM state. The error bars represent the 95% confidence interval of the measurements. From Fig.2, it can be seen that the recognition accuracies for

11 both SSEM-TC and SMEM-TC based models always significantly outperform HBT models. With lower model complexity, the advantages of TC based model are more obvious. These recognition results again prove the effectiveness of the proposed methods for defining Multiple-HMMs acoustic models. Comparing SMEM-TC and SSEM-TC based models, the recognition performance of former is significantly better than later for the models with 1, 2, 4, 8 Gaussians per HMM state. For the models with 16 and 32 Gaussians per state, SSEM-TC models are competitive to SMEM-TC models. For the 32 Gaussian systems, the performance of SSEM-TC is even slightly better than the SMEM-TC models. This is because the variance of the number of tokens in the clusters produced by SMEM-TC is larger than in the case of SSEM-TC. As a consequence, the number of training tokens in some of the clusters yielded by SMEM-TC may have been too small to accurately train the separate HMM paths. However, one should be aware that the level of pronunciation variation in connected digits recognition task is not very high. When facing recognition tasks with a high level variation, the SMEM-TC might outperform SSEM-TC even with high model complexity. 4 Conclusion In this paper, we investigated the effectiveness of the SSEM and SMEM algorithm in solving the problem of initialization dependence in standard EM in Speech Trajectory Clustering. The SSEM and SMEM are reformulated versions of the standard EM algorithm, which can partly avoid the local maxima of the likelihood surface by the means of incrementally increasing the number of mixture components and heuristically reallocating the mixture components in the data space, respectively. The clustering results showed that SMEM gave more coherent clusters than the SSEM and knowledge-based methods. To evaluate the performance of SSEM-TC and SMEM-TC based Multiple- HMM acoustic models, a number of experiments were carried out to compare their performance with context dependent HBT models in a connected digits recognition task. The results show that both the SSEM-TC and the SMEM-TC based models always significantly outperformed the conventional HBT models. When the model complexity is low, the recognition accuracy of SMEM-TC based models is significantly better than SSEM-TC based models. For models with high complexity, SMEM-TC based models is competitive to SSEM-TC based models. From the experimental results of the TC based models in Dutch digit recognition, we believe that this novel data-driven method in Multiple-HMMs acoustic modeling can improve the performance of Chinese speech recognition as well. In our future work, we will consider the possibility to apply the proposed method to Chinese speech. Furthermore, we will investigate the relation among local maxima, amount of training data and model complexity. Moreover, a method for automatically deriving the optimal number of parallel HMM paths with respect to some statistical criteria is also a very promising direction to improve TC-Base Multiple-HMM acoustic modeling.

12 Acknowledgements The research is part of the Interactive Multimodal Information extraction (IMIX) program, which is funded by the Netherlands Organization for Scientific Research (NWO). References 1. I. Illina and Y. Gong, Elimination of trajectory folding phenomenon: HMM, Trajecotry Mixture HMM and Mixture Stochastic Trajectory model, In Proceedings of ICASSP-97, vol. 2, pp , W. Chou, C. Lee, and B. Juang, Minimum error rate training of inter-word context-dependent acoustic model units in speech recognition, In Proceedings of ICSLP-94, pp , J. Picone, Duration in context clustering for speech recognition, Speech Communication, vol. 9, pp , F. Korkmazskiy, Generalized mixture of HMMs for continuous speech recognition, In Proceedings of ICASSP97, pp , Y. Han, J. de Veth, and L. Boves, Speech trajactory clustering for inproved speech recognition, In Proceedings of INTERSPEECH-2005, September Y. Han, A. Hamalainen, and L. Boves, Trajectory clustering of syllable-length acoustic models for continuous speech recognition, In Proceedings of ICASSP- 2006, April S. Gaffney and P. Smyth, Trajectory clustering with mixtures of regression models, In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp , Y. Han, J. de Veth, and L. Boves, Trajectory clustering for automatic speech recognition, In Proceedings of EUSIPCO-2005, September Y. Linde, A. Buzo, and R. M. Gray, An algorithm fro the vector quantizer design, IEEE Trans. Commun., vol. COM-28, pp , N. Ueda and R. Nakano, EM algorithm with split and merge operations for mixture models, Systems and Computers in Japan, vol. 32, pp. 1 11, Y. Han and L. Boves, Syllable-length path mixture Hidden Markov Models with trajectory clustering for continuous speech recognition, In Proceedings of INTERSPEECH-2006, September E. den Os, T. Boogaart, L. Boves, and E. Klabbers, The Dutch Polyphone corpus, In Proceedings of EuroSpeech-95, pp , F. Bimbot, An overview of the CAVE project research activaties in speaker verification, Speech Communication, vol. 31, pp , J. Sturm and E. Sanders, Modelling phonetic context using Head-Body-Tail acoustic models for connected digit recognition, In Proceedings of ICSLP-2000, vol. 1, pp , S. Young, G. Evermann, and T. Hain, The HTK Book (for HTK version 3.2.1). Cambridge University Engineering Department, 1997.

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of