THE most popular training method for hidden Markov

Size: px

Start display at page:

Download "THE most popular training method for hidden Markov"

Cora Andrews
5 years ago
Views:

1 204 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 A Discriminative Training Algorithm for Hidden Markov Models Assaf Ben-Yishai and David Burshtein, Senior Member, IEEE Abstract We introduce a discriminative training algorithm for the estimation of hidden Markov model (HMM) parameters. This algorithm is based on an approximation of the maximum mutual information (MMI) objective function and its maximization in a technique similar to the expectation-maximization (EM) algorithm. The algorithm is implemented by a simple modification of the standard Baum Welch algorithm, and can be applied to speech recognition as well as to word-spotting systems. Three tasks were tested: Isolated digit recognition in a noisy environment, connected digit recognition in a noisy environment and word-spotting. In all tasks a significant improvement over maximum likelihood (ML) estimation was observed. We also compared the new algorithm to the commonly used extended Baum Welch MMI algorithm. In our tests the algorithm showed advantages in terms of both performance and computational complexity. Index Terms Discriminative training, hidden Markov model (HMM), maximum mutual information (MMI) criterion. I. INTRODUCTION THE most popular training method for hidden Markov model (HMM)-based speech recognition systems is maximum likelihood (ML) estimation. The objective of ML estimation is to find the parameter set that maximizes the likelihood of the training utterances given their corresponding transcription. ML estimation stems from the assumption that the speech signal is distributed according to the model, and is well justified in the theory of parameter estimation. Another advantage of ML estimation of HMMs is its simplicity of implementation using the Baum Welch algorithm. Discriminative training methods such as maximum mutual information (MMI) [1], [13], corrective training [2] and minimum classification error (MCE) [7] attempt to minimize the error rate more effectively by utilizing both the correct and the other categories, and incorporating that into the training phase. Note that the MMI and MCE criteria were shown to be closely related [9]. It was shown [11] that if the true distribution of the samples to be classified can be accurately described by the assumed statistical model, and the size of the training set tends to infinity, then ML estimation outperforms MMI estimation in the sense that it yields less variance in the parameter estimates. Unfortunately, the true distribution of the speech signal cannot be modeled by a HMM, and in realistic speech recognition tasks the training data Manuscript received August 9, 2001; revised October 7, This research was supported by the KITE consortium of the Israeli Ministry of Industry and Trade. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Andreas Stolcke. The authors are with the Department of Electrical Engineering Systems, Tel-Aviv University, Tel-Aviv 69978, Israel ( assaf@eng.tau.ac.il; burstyn@eng.tau.ac.il). Digital Object Identifier /TSA is always sparse. Consequently, the minimization of the recognition error rate is a more suitable objective than the minimization of the error of the parameter estimates. MMI estimation aims to maximize the posterior probability of the words in the training set given their corresponding utterances. Unlike in the ML case, there is no simple optimization method to this problem. First experiments in MMI were reported by Bahl et al. [1], who used the gradient descent algorithm for the optimization of the objective function. Gradient descent algorithms are sensitive to the size of the update step. A large update step can cause an unstable behavior. However, a small update step might result in a prohibitively slow convergence rate. Gopalakrishnan et al. [5] proposed a method for maximizing the MMI objective function which is based on a generalization of the Baum Eagon inequality. This method was proposed to discrete HMMs. Normandin [13] proposed a useful approximated generalization of this method to HMMs with Gaussian output densities known as the extended Baum Welch (EBW) algorithm. Further use of this algorithm was reported in [8], [17] and [18]. A different approach to the optimization of the MMI criterion was reported in [19]. The EBW algorithm is a popular and elegant algorithm that was found useful in various tasks. However, it suffers from the following shortcomings that motivate the development of other algorithms such as the one proposed in this paper: 1) The exact relation between the MMI objective function and the recognition error rate is unknown. This motivates the search of other objective functions. 2) The EBW optimization algorithm is not a simple extension of the standard Baum Welch algorithm. It requires many iterations and is thus computationally expensive. 3) The EBW algorithm is not easy to generalize to other tasks such as word spotting. The proposed algorithm addresses the above shortcomings: It is easily implemented by a simple modification of the standard Baum Welch algorithm. It converges after one or two iterations and is computationally efficient. It is also easily generalized to other tasks such as word spotting. The algorithm we propose is based on an approximation of the MMI objective function, and its maximization in a technique similar to the expectationmaximization (EM) algorithm [4]. Like the EM algorithm, the algorithm proposed in this paper has the desirable property that in practice it monotonically increases the objective function, and is therefore stable. Although the focus of this paper is in the implementation of the new algorithm to speech recognition, it can nevertheless be applied to a general statistical pattern recognition problem. In Section II we give general background to HMM-based speech recognition and explain the standard ML estimation procedure. In Section III we give a general formulation of the ap /04$ IEEE

2 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 205 proximated MMI algorithm. In Section IV we explain the implementation of the algorithm to HMMs. In Section V we asses the performance of the algorithm on several tasks: isolated and connected digit recognition in a noisy environment and word-spotting, and make a comparison to the EBW algorithm and the H-criterion, in which our algorithm is found superior in terms of both performance and computational complexity. Finally, in Section VI we conclude the study, summarize the results and provide some points for further research. II. BACKGROUND A. HMM-Based Speech Recognition In order to simplify the description of the algorithm we assume an isolated word recognition task. Nevertheless, the algorithm can be easily generalized to the recognition of continuous speech. The extensions will be given in Section V.C. Our vocabulary comprises words that form the set. Each word in the vocabulary has a prior probability of occurrence,. The speech signal is divided into frames, and from each frame a feature vector is extracted. We denote the th feature vector by, and the entire sequence of feature vectors that comprise the utterance by. We assume that each word is characterized by a conditional probability density function,, and we perform recognition using the MAP criterion, namely, B. ML Training The training of the models is performed according to a given training set. The training set consists of the utterances and their corresponding transcriptions. ML training is basically the maximization of the ML objective function,, defined as (1) where. We assume that parameters are not tied across words. Hence by (1), it is clear that training can be performed on each word separately. That is, the parameters of each word, are estimated according to its correspondingly labeled utterances, whose indices form the set. Observing this property, it is clear that ML estimation can not take into account confusions between words and recognition errors, and in that sense it differs from discriminative training methods. The optimization of the ML objective function is iteratively implemented using the Baum Welch algorithm [3] which was shown to be a special case of the EM algorithm [4]. The re-estimation formulas are (2) We choose to model the words by a Gaussian mixture HMM, i.e., the probability density function of is (3) (4) where are the transition probabilities. is the state sequence where states and are constrained to be the initial and final non emitting states respectively and the summation is over all possible state sequences. are the output distributions (5) where (6) is the weight of mixture in state, and is a Gaussian distribution with a mean vector and a diagonal covariance matrix. Thus, the parameter set of each word comprises the following elements: the transition probability from state to state.. the weight of the th mixture of the th state.. the mean vector of the th mixture of the th state. the diagonal covariance matrix of the th mixture of the th state. We denote the entire parameter set of all the words in the vocabulary by. and and is the duration of the utterance. The terms and can be efficiently calculated using the well known Forward-Backward algorithm as explained in [14]. For the general case of a HMM parameter, the re-estimation formula takes the form where and are usually referred to as accumulators, and are calculated using the utterances in the set. (7) (8)

3 206 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 III. THE APPROXIMATED MMI ALGORITHM A. The Approximated MMI Criterion The MMI objective function is given by For all. As in the ML case, the parameter set of each word can be estimated separately. Note that this is different from the MMI case (9), in which the parameters of the entire vocabulary should be optimized jointly. It is now possible to formulate the two steps of the algorithm 1. Perform recognition on the training set and obtain the sets, and the objective functions. 2. Maximize with respect to, and obtain new estimates of the parameters. We apply the approximation on the right hand sum of (9) and obtain (9) (10) (11) Note that the above approximation is a special case of the -best approach where. However, the choice of is not arbitrary and is crucial for the implementation of the optimization. Now, recall that and let (12) If we apply the MAP criterion for recognition, we can say that contains the indices of training utterances that were recognized as the word. Using these definitions of and we can rewrite (11) as follows, (13) Motivated by (13) we introduce the following objective function, which we call the approximated MMI criterion, (14) is a prescribed parameter that controls the discrimination rate. Note that using we obtain the ML objective function and using we obtain the MMI objective function under the approximation in (10). Note that in the proposed criterion, resembles in the H-criterion [5]. Observing (14), it is clear that the prior probabilities of the words,, do not effect the maximization of. Unless parameters are tied across the models, maximizing is equivalent to maximizing the following objective functions (15) The ML estimates for the parameters are taken as the initial condition. The two steps can be iterated repeatedly. In our experiments, however, best results were obtained after the first iteration. B. Example As explained in the introduction, if the observations are distributed according to the assumed statistical model, the optimal training technique is ML estimation. The theoretical justification for using MMI estimation stems from its superiority in examples where the assumed statistical model is incorrect [12]. In this section, we give a simple example in which the assumed statistical model is incorrect, and show that the approximated MMI criterion leads to a better decision rule than the ML criterion, in the sense that it yields a smaller probability of error. Consider a classification problem with two classes, and, that have equal prior probabilities, i.e.,. Both and are Gaussian with means and, and variances and, respectively. Suppose that. The MAP classification rule reads (16) This solution is optimal in the sense that it reaches the minimal classification error probability. The decision regions can be obtained by an explicit solution of (16). When the decision rule becomes (17) where and ( ) are the two solutions of (16) when an equality is imposed. Now suppose that and are not known in advance, and suppose that they are assumed to be Gaussian. The parameters are estimated given a training set, which consists of i.i.d. samples, that correspond to, and that correspond to. We also assume that. Let us now make an incorrect assumption that both classes have the same variance:. We want to calculate the estimates for the means, and. Assuming equal variances and, the decision rule becomes:. The ML Solution: The ML estimates in our case are simple averages of the samples: and. According to the law of large numbers

4 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 207 ( ),, and. (convergence is in the mean square sense). The Approximated MMI Solution: We take the threshold obtained by the ML estimation,, as the algorithm s initial estimate. The sets and are defined as in (12). Using straightforward differentiation to maximize and defined in (15) yields (18) (19) By the law of large numbers, the various terms in (18) and (19) can be replaced by their expectations (e.g., and ). This enables us to calculate the threshold. The probability of error in our case is given by As an example we set,, and. We obtained the thresholds and. The associated probability of error when using the decision rule (17) is The minimal probability of error obtained by using the approximated MMI algorithm is obtained at, and its relative distance from the optimal probability of error (using the correct model and parameter values) is less than. The MMI estimates were also calculated using a Monte Carlo experiment. In this experiment samples were drawn for each class, and the MMI estimates were calculated by a direct maximization of the MMI objective function. The MMI probability of error is , which is larger than the one obtained by the approximated MMI algorithm. Fig. 1 shows the behavior of the probability of error,,as a function of the parameter. It can be seen that for sufficiently small values of, is smaller than the one obtained by ML estimation. We note that for, tends to infinity, since the denominators of (18) and (19) are zeroed. However for, the probability of error is smaller than the one obtained by the ML estimate. Consecutive iterations were also experimented by taking for the calculation of the sets and, and then calculating the new threshold, using (18) and (19). More than one iteration, however, did not yield a consistent improvement in the error rate over the range of values. Fig. 1. P versus. this reason we propose an iterative solution for models that include complete and incomplete data, which is similar to the EM algorithm. Algorithm Formulation: Our training set comprises the elements with the pdf. We assume the existence of complete data corresponding to, with the pdf, where and where is in general a noninvertible (many-to-one) transformation. We are interested in maximizing the following function: As in the EM algorithm, we can write Hence, Now, applying the conditional expectation we obtain Hence, (20) becomes (20) C. Maximization Process for Models With Incomplete Data As shown in Section III-A, the approximated MMI estimates are obtained by maximizing the functions. However, in our case, due to the nature of the pdfs of the HMMs, these objective functions cannot be maximized in closed form. For

5 208 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 Now, similar to the EM algorithm we want to see if yields. We want to check if : is also guaranteed to increase the objective function only in a certain range of its free parameter. However, in practice the parameter of the algorithm is chosen empirically, and is typically outside this range [5]. We thus have the following two steps solution: E-step Compute (21) (22) Note that, M-step (23) Where represents the Kullback-Leibler distance between the densities and, which is always nonnegative. All the summation terms in the right hand side of (21) are Kullback-Leibler distances, hence they are non negative. However, the last term in (21) is multiplied by a negative factor. Our assumption is that this does not change the sign of the entire sum since is chosen sufficiently small, and the number of recognition errors is small, hence, this term contains a small number of elements. The assumption that a maximization of the auxiliary function, actually increases the objective function, was found to be true in all our experiments for the range of used. In light of that, like in the case of the EM algorithm, our algorithm has the desirable property that it monotonically increases the objective function and is therefore stable. Note that the EBW algorithm Finally, it should be noted that the algorithm we proposed can be applied to a general parameter estimation problem and is not restricted to the context of speech recognition using HMMs. IV. APPLICATION TO HMMS Using the two steps solution given in (22) and (23) it is possible to derive the explicit re-estimation formulas for the HMM case. The explicit derivation is detailed in the Appendix. The following re-estimation formulas are obtained: see equations (24) (27) at the bottom of the page where and are defined in (6) and (7) respectively. Comparing (24) (27) to the formulas for the ML estimates (2) (5) it is possible to describe the new re-estimation procedure for each parameter,, in the following way: (28) (24) (25) (26) (27)

6 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 209 Fig. 2. Recognition rate (TIDIGITS) versus. Where and are accumulators that are computed according to the set, the original transcription of training set. Similarly, and are the discriminative accumulators, and are computed according to the set, the transcription obtained by recognition. As seen so far, the new algorithm has two major steps: Approximation: Performing recognition on the training set, in order to obtain the sets. Using these sets, the approximated MMI objective function can be calculated. Maximization: Maximizing the objective function using the re-estimation formulas (24) (27). The algorithm has the following degrees of freedom: The choice of the parameter : Choosing a constant for all words, or choosing a different one for each word. Recognition method in the Approximation step. When the training set consists of continuous utterances of words, recognition can be performed in several ways: using the boundaries of the words in the transcription, not using them but using Viterbi recognition. Iterations: Using one iteration, or more. The order of the steps in the iterations: applying Approximation and Maximization successively, or applying Approximation and then several iterations of Maximization. V. EXPERIMENTAL RESULTS Experiments were conducted on two tasks. All experiments were done using the HTK3 [6] toolkit. A. Isolated Digit Recognition in a Noisy Environment The utterances were taken from the adult speakers of the TIDIGITS corpus [10], which is a multi-speaker small vocabulary database. The corpus vocabulary comprises 11 words (the digits 1 to 9 plus oh and zero ) spoken by 326 speakers, in both an isolated and a continuous manner. We used only the adult speakers of the corpus according to the following division: the training comprised 113 speakers (55 men, 58 women), and the test set comprised 115 speakers (57 men, 58 women). Each speaker utters each digit twice. In order to lower the baseline recognition rate we added white Gaussian noise to the speech signal so as to obtain a SNR level of 0 db. The feature vector comprised 12 MFCC, log energy and the corresponding delta and acceleration coefficients. The frame rate was 10 ms and the window size was 25 ms. Mean normalization was also applied. Each digit, including the silence segments surrounding it, was modeled by a HMM with 10 emitting states and with diagonal covariance matrices. In the first experiments single mixture Gaussian output distributions were chosen. The HMM topology was left to right with no skips. The baseline (ML) system was obtained by applying three segmental K-means iterations and seven Baum Welch iterations. The baseline recognition rate was 88.58%. In the first experiments one iteration of Approximation and one of Maximization were applied. It was found that re-estimating the means, variances and transition probabilities yielded better results than re-estimating only the means, or only the means and variances. Variant values of were tested, each value was used for the re-estimation of all the models. For large values of, variances and transition probabilities tended to become negative. In these cases, they were replaced by their ML values. However, when such an event occurred, the recognition rate deteriorated drastically. So, in further experiments, we have restricted the values to be sufficiently small. Fig. 2 shows the recognition rate versus on both the training and test sets. Best results were obtained for for both the training and test set, so in this case can be set using cross validation on the training set. A reduction of 57% and 28% in the error rate was observed on the training set and test set, respectively. The behavior of the algorithm along several iteration was also experimented. All iterations were applied with the same value of, and the following criteria were calculated. The recognition rate on the training set. The MMI objective function

7 210 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 Fig. 3. Evolution of various criteria along successive Approximation, Maximization steps. The MMI objective function under the approximation in (13) The objective function of the approximated MMI algorithm Iterations were applied using two schedules 1) Each iteration comprises a single Approximation step followed by a single Maximization step. 2) One Approximation step followed by several Maximization iterations. Fig. 3 shows the evolution of the above criteria along four iterations of the algorithm, where iterations were implemented in the first schedule with. Each iteration in the graph represents one step of Approximation followed by one step of Maximization. The zeroth iteration represents the values of the criteria before applying the first iteration. It is possible to see that after the first iteration no improvement was obtained, other than a consistent growth in the algorithm s objective function. Fig. 4 shows the corresponding evolution, where iterations were implemented in the second order. It is possible to see that a growth in all the objective functions:,, and was obtained. Recalling Section III-A, the approximated MMI criterion was obtained by an approximation of the MMI criterion. In the experiment reported the relative approximation error was only 0.1%. In Section III, we assumed that a maximization of (the Maximization step) increases the value of the approximated MMI criterion. Indeed, in the experiment where several iterations of Maximization were applied, each iteration yielded a monotonic growth in the approximated MMI criterion. In light of that, like in the case of the EM algorithm, our algorithm has the desirable property that it monotonically increases the objective function and is therefore stable. The best recognition rate obtained on the test set was 92.16%, which reflects a reduction of 31% in the error rate, in comparison to the ML baseline. This result was obtained by applying two iterations of Maximization with (second schedule). Table I summarizes the results obtained by the algorithm on the TIDIGITS database. The iteration columns in the table represent Maximization iterations. The performance of the discriminative training algorithm was also experimented while increasing the number of Gaussian mixtures in the output distributions. In all cases, we implemented one Approximation step followed by one Maximization step, with. The results of this experiment

8 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 211 Fig. 4. Evolution of various criteria along Maximization iterations. TABLE I SUMMARY OF THE RESULTS IN THE ISOLATED DIGIT RECOGNITION TASK TABLE II TIDIGITS RECOGNITION RATE WITH DIFFERENT NUMBERS OF MIXTURES However, the improvement decreased while increasing the number of mixtures. B. Comparison to Other Discriminative Training Algorithms In this subsection we compare the algorithm performance to the ones obtained by other two related algorithms, namely, the EBW algorithm as in [13] and the algorithm proposed in [19]. The EBW algorithm introduced in [13] is aimed to maximize the MMI objective function (34), and is an extension of the optimization procedure proposed in [5] to the case of continuous output HMMs. We implemented the EBW algorithm in the task of isolated noisy digits modeled by a single mixture HMM. The EBW re-estimation formulas are [18], [13], are summarized in Table II. It is possible to see that in all cases the algorithm yielded an improvement over ML estimation. (29)

9 212 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 where and Fig. 5. Evolution of the recognition rate along EBW iterations. TABLE III SUMMARY OF THE RESULTS OBTAINED BY THE EBW ALGORITHM (30) The parameter is set at the maximum between twice the value that ensures positive variances, and. As in [18], a different value of was set for each state. The evolution of the recognition rate on both the training set and test set are depicted in Fig. 5. The results are summarized in Table III. It appears that the EBW algorithm yields a better improvement on the training set (75% versus 57% improvement). However, the opposite occurs on the test set, a 15% improvement by the EBW algorithm and 31% by the proposed algorithm. We note that the recognition rate on the test set is the one of actual importance. We now compare the computational complexity of the two algorithms. We denote by the number of computations for one Baum Welch re-estimation pass over all the models, and Fig. 6. Evolution of the MMI objective function along EBW iterations. by the number of models. The new algorithm requires less than computations for the Approximation step (the determination of, assuming equal number of operations for the Viterbi and Baum Welch algorithms) and for each Maximization step (24) (27). In total, assuming one Approximation step and two Maximization steps (see Section V.A) we need less than computations. The EBW algorithm requires about computations for each iteration. Assuming iterations are sufficient for the EBW algorithm to converge, the total number of computations is. Observing the growth in the MMI objective function, the EBW algorithm yielded a growth from a value of to a value of, whereas the new algorithm yielded a growth to a value of. However, the growth in the new algorithm was faster (after four iterations the new algorithm reached a value of while the EBW reached a value of ). In Fig. 6 we show the evolution of the MMI objective function along the EBW iterations. We note that the EBW algorithm has many free parameters whose tuning was obtained as a result of long research. In our implementation, following [18], there is one parameter per state for each model, resulting in an overall number of 110 parameters. We also tried to use only one tunable parameter, common to all models, but this resulted in an extremely slow convergence rate. The estimation of the transition probabilities neither improved the performance. In the implementation of our algorithm we had only one tunable parameter,. Adding tunable parameters to the algorithm (e.g., a different for each model or for each state) may improve the performance, when the training database is sufficiently large (to avoid overfitting). As noted in Chapter III, our objective function resembles the H-criterion [5], which is (31) The authors of [5] used their optimization algorithm in order to maximize the H-criterion objective function. However, their implementation was for the case of discrete output HMMs. To

BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 213 TABLE IV SUMMARY OF THE IMPROVEMENTS IN THE H-CRITERION ALGORITHM (THE IMPROVEMENTS ARE IN PARENTHESES) Fig. 7.

10 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 213 TABLE IV SUMMARY OF THE IMPROVEMENTS IN THE H-CRITERION ALGORITHM (THE IMPROVEMENTS ARE IN PARENTHESES) Fig. 7. Evolution of the recognition rate along H-criterion iterations for different values of h. the best of our knowledge, the extension of the above optimization algorithm to continuous HMMs for the H-criterion is not as straightforward as in the MMI case. However, Zheng et al. [19] have proposed a gradient descent based optimization procedure that yields the following re-estimation formulas: where (32) (33) is a tunable parameter which is tuned as in the EBW case. is a tunable parameter which resembles the in our algorithm. yields the ML objective function and yields the MMI objective. Note that when, (32) coincides with (29). We tuned the parameter in the following way: we first implemented one iteration for various values of. yielded the best improvement (12.26%) on the test set and yielded the best improvement (18%) on the training set. Then, we implemented 30 iterations of the algorithm for these values and for. The evolution in the recognition rate is depicted in Fig. 7. The best improvements are summarized in Table IV. It appears from the results that our algorithm outperforms the H-criterion on both the training and test sets. The H-criterion outperforms the EBW on the test set but not on the training set. C. Connected Digit Recognition in a Noisy Environment The utterances were taken from the adult speakers of the TIDIGITS corpus after adding white Gaussian noise to the speech signal so as to obtain a SNR level of 0 db. Each speaker contribution to the corpus consisted two repetitions of each digit in isolation and 55 digit strings, of lengths 2, 3, 4, 5, and 7. The feature vector was the same as in the isolated digit recognition task. Each digit was modeled by a HMM with 8 emitting states, with a single mixture Gaussian output distribution and with diagonal covariance matrices. The entry and exit silences were modeled by a HMM with 3 emitting states, with a single mixture Gaussian output distribution and with diagonal covariance matrices. The HMM topology was left to right with no skips. The baseline (ML) system was obtained in three steps. First, the models were initialized by applying three segmental K-means iterations using the isolated digit utterances. Second, seven Baum Welch iterations were applied using the connected digit utterances that were previously labeled using alignment with models trained on clean speech. Lastly, the models were refined using 19 iterations of embedded training [6]. Embedded training does not use the require labeled utterances. Instead, while training, each continuous utterance is modeled by a composite HMM which is a concatenation of the uttered digits, and the accumulators are calculated using this composite HMM. Finally, the parameters are calculated from the accumulators in the usual way (2) (5). In order to generalize our algorithm to connected speech, we propose the following. The MMI objective function can be written as (34) where represents all the training set utterances and the complete training set transcription. We can also write (35) where the sum is over all possible transcriptions. Following the same steps as in Section III.A, we arrive at the following objective function: (36) where corresponds to the largest term in the sum of (35). In order to find we apply unconstrained Viterbi recognition on the training set. The accumulators were calculated using an embedded training version that does not require utterance segmentation of the known text. The discriminative accumulators were calculated by an embedded training pass using the transcription obtained by Viterbi recognition. Lastly, the parameters were calculated using (24) (27). Recognition was implemented using the Viterbi algorithm, where the beginning and the end of each utterance where constrained to be silences. A word insertion penalty [6] of was also used in order to reduce insertions.

11 214 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 TABLE V SUMMARY OF THE RESULTS IN THE CONNECTED DIGIT RECOGNITION TASK ON THE TRAINING SET TABLE VII RESULTS IN THE WORD-SPOTTING TASK TABLE VI SUMMARY OF THE RESULTS IN THE CONNECTED DIGIT RECOGNITION TASK ON THE TEST SET Performance was evaluated using the following two expressions: Where is the total number of words in the transcription files, is the number of words correctly recognized, and is the number of insertions. As in the isolated digit recognition task, the best improvement was obtained for both the training and test sets with the same value ( ). The results of this experiment are summarized in Tables V and VI. D. Word-Spotting The task involved the spotting of 20 keywords (KWs) on the conversational part of the Stonehenge corpus taken from the Road Rally database [15]. This corpus contains 20 identified KWs, spoken by 80 speakers (28 females, 52 males) that were recorded in laboratory conditions. The speech was then filtered to simulate a telephone line frequency response. The database transcription contains the KW locations. Non-KW speech is not transcribed. The training and test sets were chosen so as to give a good representation of confusable utterances. Sentences sf01-sf02,sf11-sf12,sf42,sf44- sf48,sm03-sm16,sm49-sm59 comprised the training set (85 minutes of speech, containing 1313 KW utterances). Sentences sf58,sf60-sf64,sm33-sm41,sm43 comprised the test set (39 minutes of speech, containing 617 KW utterances). We used the baseline word-spotter proposed in [16], in which likelihood ratio scoring was implemented. Each KW was modeled by a HMM with 18 emitting states and single mixture Gaussian output distributions. Only one filler model was used, modeled by a stationary HMM with 50 mixtures. The feature vector was where represents an MFCC coefficient. Mean normalization was also applied. Fig. 8. Receiver operating curves. The KW models were re-estimated using the new proposed algorithm. The filler model was trained using standard ML estimation. The Approximation step was implemented by performing recognition on the training set. It was experimentally shown better to use all the false alarms for the calculation of the discriminative accumulators, and not reduce them using scoring. One iteration of Maximization was used, and a different value of was used for each KW. Two estimation procedures were examined for : According to the first, is determined for each KW according to its figure of merit (FOM) on the test set. According to the second, is determined by an empirical rule. The first procedure involves the test set, and therefore is not feasible in a realistic situation. According to the second procedure is chosen for each word as some fraction of the value in which variances become negative. The fraction values that we tried were 0.5, 0.7, and 0.9. Results are shown in Table VII. The Improvement column represents the error rate reduction. Fig. 8 presents the receiver operating curves of the word-spotting system with and without discriminative training. VI. CONCLUSIONS This paper has described a new algorithm for discriminative training. We started by introducing a new estimation criterion referred to as the approximated MMI criterion. We then introduced an optimization technique similar to the EM algorithm. Unlike existing discriminative training algorithms, the training procedure can be implemented by a simple modification of the Baum Welch algorithm.

12 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 215 The algorithm has two major steps: Approximation, which is the derivation of the algorithm s criterion, and Maximization, which is similar to the EM algorithm. It was seen in experiments that the approximation yields a small relative error. The maximization process yielded a monotonic growth in the objective function along the iterations. This is a desirable property that can be proved for the EM algorithm. In the case of the new proposed algorithm, this property was shown to hold under certain conditions that were validated in the experiments. Three tasks were tested: isolated and connected digit recognition in a noisy environment and word-spotting. In the isolated digit recognition task, a reduction of 31% in the error rate was observed. In the connected digit recognition task, a reduction of 13% in the error rate was observed. In the word-spotting task, the best improvement was a reduction of 17% in the error rate. We also compared our algorithm to the EBW algorithm on an isolated digit recognition task. Our algorithm was shown to be superior in terms of its performance on the test database, and in terms of its computational complexity. The generalization to context dependent phonetic systems (e.g., triphone based) is conceptually simple, and is the same as in the generalization to connected speech, (34) (36). Tying is accounted for in the usual way by tying together the appropriate counters. where represents the complete underlying sequence of states and mixtures that correspond to the utterance. In the case of the HMMs defined in Section II.A, the auxiliary function is the first equation at the bottom of the page. In the M-step we maximize with respect to all the elements of the parameter vector. We start with the transition probabilities. In order to satisfy the constraints, we shall use the set of Lagrange multipliers, and maximize the Lagrangian. Hence, Therefore, APPENDIX DERIVATION OF EXPLICIT FORMULAS FOR HMMS We apply the estimation algorithm (22), (23) to a Gaussian mixture HMM, as defined in Section II. Let be the auxiliary function corresponding to Summing over we obtain where. Therefore, see the second equation at the bottom of the page. Deriving the formulas for

13 216 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004, we shall use the Lagrange multipliers in order to satisfy the constraints. Differentiating we obtain, Deriving the re-estimation formulas for the diagonal covariance matrices,, the elements of where. Summing over we obtain Hence, Hence, We now derive the re-estimation formulas for of the mean vectors, Hence,, the elements ACKNOWLEDGMENT The authors would like to thank the Cambridge University Engineering Department and, in particular, G. Evermann for providing and supporting the HTK3 toolkit. REFERENCES [1] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in Proc. ICASSP 86, Apr/ 1986, [2], A new algorithm for the estimation of hidden Markov model parameters, in Proc. ICASSP 88, 1988, pp [3] L. E. Baum, T. Peterie, G. Souled, and N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Statist., vol. 41, no. 1, pp , [4] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Statist. Soc., vol. 39, pp. 1 38, [5] P. S. Gopalakrishnan, D. Kanevsky, A. Nádas, and D. Nahamoo, An inequality for rational function with applications to some statistical estimation problems, IEEE Trans. Inform. Theory, vol. 37, Jan [6] HTK Hidden Markov Model Toolkit [Online]. Available: eng.cam.ac.uk [7] B.-H. Juang, W. Chou, and C.-H. Lee, Minimum classification error methods for speech recognition, IEEE Trans. Speech Audio Processing, vol. 5, no. 3, pp , [8] S. Kapadia, V. Valtchev, and S. J. Young, MMI training for continuous phoneme recognition on the TIMIT database, in Proc. ICASSP 1993, vol. 2, 1993, pp [9] S. Katagiri, B.-H. Juang, and C. H. Lee, Pattern recognition using a family of design algorithm based upon the generalized probabilistic descent method, Proc. IEEE, vol. 86, no. 11, pp [10] R. G. Leonard, A database for speaker-independent digit recognition, in Proc. ICASSP 84, [11] A. Nádas, A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional vesus conditional maximum likelihood, IEEE Trans. Acoust., Speech, Signal Processing, vol. 31, no. 4, pp , 1983.

BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 217 [12] A. Nádas, D. Nahamoo, and M. A. Picheny, On a model robust training method for speech recognition, IEEE Trans.

De Mori, High-performance connected digit recognition using maximum mutual information estimation, IEEE Trans. Speech Audio Processing, vol. 2, pp. 299 311, 1994. [14] L. R.

14 BEN-YISHAI AND BURSHTEIN: DISCRIMINATIVE TRAINING ALGORITHM FOR HIDDEN MARKOV MODELS 217 [12] A. Nádas, D. Nahamoo, and M. A. Picheny, On a model robust training method for speech recognition, IEEE Trans. Acoust., Speech, Signal Processing, vol. 39, no. 9, pp , [13] Y. Normandin, R. Cardin, and R. De Mori, High-performance connected digit recognition using maximum mutual information estimation, IEEE Trans. Speech Audio Processing, vol. 2, pp , [14] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, pp , [15] The Road Rally Word-Spotting Corpora (RDRALLY1), NIST, NIST Speech Disc6 1.1, [16] R. C. Rose and D. B. Paul, A hidden Markov model based keyword recognition system, in Proc ICASSP 90, vol. 2.24, Apr. 1990, pp [17] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, MMIE training of large vocabulary speech recognition systems, Speech Commun., vol. 22, pp , [18] P. C. Woodland and D. Povey, Large scale MMIE training for conversational telephone speech recognition, in Proc. Speech Transcription Workshop, [19] J. Zheng, J. Butzberger, H. Franco, and A. Stolke, Improved maximum mutual information estimation training of continuous density HMM s, in Proc. 7th Eur. Conf. Speech Communication and Technology, Aalborg, Denmark, Sept Assaf Ben-Yishai was born in Israel in He received the B.Sc. and M.Sc. degrees in electrical engineering in 1999 and 2001, respectively, both from Tel-Aviv University. He is currently pursuing the Ph.D. degree in electrical engineering at Tel-Aviv University. His research interests include speech recognition and information theory. David Burshtein (M 92 SM 99) received the B.Sc. and Ph.D. degrees in electrical engineering in 1982 and 1987, respectively, from Tel-Aviv University. During he was a Research Staff Member in the Speech Recognition Group of IBM T. J. Watson Research Center, Yorktown Heights, NY. In 1989, he joined the Department of Electrical Engineering Systems, Tel-Aviv University, where he is currently Associate Professor. His research interests include information theory, speech, and signal processing.

Discriminative training and Feature combination

Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics