Modeling Inter-cluster and Intra-cluster Discrimination Among Triphones

Modelng Inter-cluster and Intra-cluster Dscrmnaton Among Trphones Tom Ko, Bran Mak and Dongpeng Chen Department of Computer Scence and Engneerng The Hong Kong Unversty of Scence and Technology Clear Water Bay, Hong Kong {tomko, mak, dpchen}@cse.ust.hk Abstract Dscrmnatve tranng s a major contrbuton to the success of automatc speech recognton (ASR) n the last decade. However, snce most ASR systems employ state tyng whch tes smlar states n a cluster, dscrmnatve tranng may only mprove nter-cluster dscrmnaton, but states belongng to the same cluster obvously cannot be dstngushed. Recently, the concept of dstnct acoustc modelng was nvestgated by a new acoustc modelng method called egentrphone modelng. In the new method, states are grouped, but not ted, nto separate clusters, and the dfference vectors between mean vectors of the member states and ther cluster center vector are modeled by a bass approach usng a set of egenvectors whch are also called egentrphones. Ths paper nvestgates whether the ntercluster dscrmnaton acheved by dscrmnatve tranng and ntra-cluster dscrmnaton obtaned by egentrphone modelng are addtve. In a smple procedure that s appled to each state cluster, the dscrmnatvely traned cluster center vector s ntegrated wth the dfference vectors traned by egentrphone modelng to construct the fnal mean vectors of the dstnct states n the cluster. Expermental evaluaton on the WSJ0 5K task shows that the two technques are ndeed addtve. Index Terms Egentrphone, dscrmnatve tranng, adaptaton, regularzaton 1. Introducton In context-dependent phone-based acoustc modelng, nfrequent trphones need to be handled properly otherwse they wll greatly affect the system performance due to the classfcaton nature n speech recognton. Exstng solutons to robust modelng of nfrequent trphones can be roughly classfed nto three major categores: trphone-by-composton [1], parameter tyng [2] and bass approach [3]. In trphone-by-composton methods, parameters of nfrequent trphones are estmated through a composton of models of dfferent order of context dependency. Model nterpolaton [4] and quas-trphones [5] are typcal examples of trphone-by-composton. Parameter tyng methods manly dffer n ther choce of acoustc unts for tyng. Example tyng unts are generalzed trphones [6], state tyng [7], shared dstrbutons or senones [8], and ted subspace Gaussan dstrbutons [9]. Among the varous parameter tyng methods, phonetc decson tree-based tyng [10] s the most popular approach due Ths work was supported by the Research Grants Councl of the Hong Kong SAR under the grant numbers HKUST616513 and HKUST16206714. to ts proven effectveness n balancng the tranablty and resoluton of the acoustc models. The key s that the nfrequent trphones (and even unseen trphones) may share the same dstrbuton wth those frequent trphones n the same state cluster where the amount of tranng data s guaranteed. However, one potental problem s that the trphone states ted to the same cluster become dentcal to the recognzer, nducng quantzaton error n the state dstrbutons and causng confuson between trphones durng recognton. As an alternatve to the above two methods, bass approach tends to explot the underlyng relatonshp/factor between the context-dependent states. Examples of bass approach such as subspace Gaussan Mxture Model [11] or sem-contnuous HMM [12] can be summarzed by a general framework called the canoncal state model [3]. It s assumed n CSM that every context-dependent state n a system can be transformed from some canoncal states. These canoncal states represent the underlyng factor between the context-dependent states. In contrast to standard tyng schemes, the model parameters are now related wth each other. In other words, a soft tyng scheme s beng used. Recently, a new method for estmatng parameters of trphone models called egentrphone modelng [13, 14, 15] s proposed. In the most general form of egentrphone modelng, models are frst grouped nto clusters, then an orthogonal bass s constructed from a set of well-traned reference models n the cluster. Each model n the cluster s now constraned to le on the space spanned by the constructed bass, and s modeled as a lnear combnaton of the egenvectors of the bass. These egenvectors, whch are called egentrphones, capture the most mportant context-dependent characterstcs among the trphones. Snce the number of egentrphones s relatvely small, even the nfrequent models can be robustly traned usng the new approach. In [14, 15], the successful use of model-based egentrphones and state-based egentrphones were demonstrated. In both cases, an egenbass s derved for each monophone for modelng trphones n whch no states are ted. Snce the trphone models are dstnct from each other, they are more dscrmnatve as well. In the latest development of egentrphone modelng [16], trphone states are grouped nto clusters, from whch egentrphones are derved. The new method s called cluster-based egentrphone modelng. From another perspectve, egentrphone modelng attempts to model ntra-cluster dscrmnaton that s, dscrmnaton among states belongng to the same state cluster by modelng the dfference vector between each (dstnct) state n a cluster and ts cluster center vector usng a bass approach. In egenvoce [17], the mean of reference speaker supervectors s chosen as the cluster center.

In cluster-based egentrphone modelng, t s emprcally found that usng the cluster mean supervector, whch s estmated by maxmum lkelhood (ML) tranng, s better. At the same tme, dscrmnatve tranng [18, 19, 20] has become the commonplace n acoustc modelng. However, snce most automatc speech recognton systems employ state tyng whch tes smlar states together n a cluster, dscrmnatve tranng may only mprove nter-cluster dscrmnaton, but states belongng to the same cluster obvously cannot be dstngushed. In ths paper, we would lke to nvestgate whether the nter-cluster dscrmnaton acheved by dscrmnatve tranng and ntra-cluster dscrmnaton obtaned by egentrphone modelng are addtve or complementary to each other. In a smple procedure that s appled to each state cluster, the dscrmnatvely traned cluster center vector s ntegrated wth the dfference vectors traned by egentrphone modelng to construct the fnal mean vectors of the dstnct states n the cluster. Expermental evaluaton on the WSJ0 5K task shows that the two technques are ndeed addtve. Ths paper s organzed as follows. In Secton 2, we frst revew the cluster-based egentrphone acoustc modelng approach, and then descrbe how the procedures are modfed usng dscrmnatvely traned cluster centers. That s followed by expermental evaluaton n Secton 3, and conclusons n Secton 4. Fgure 1: Overvew of the cluster-based egentrphone acoustc modelng method. (WPCA = weghted prncpal component analyss; PMLED = penalzed maxmum-lkelhood egendecomposton) 2. Cluster-based Egentrphone Modelng Fg. 1 shows an overvew of the cluster-based egentrphone acoustc modelng method. All trphone states are frst represented by some supervectors and they are assumed to le n a low dmensonal space spanned by a set of egenvectors. In other words, each trphone state supervector s a lnear combnaton of a small set of egenvectors whch are now called egentrphones. 2.1. Dervaton of Cluster-based Egentrphones Cluster-based egentrphone modelng conssts of three major steps: (a) state clusterng va a phonetc decson tree, (b) dervaton of the egenbass, and (c) estmaton of egentrphone coeffcents. The steps are summarzed n further detals below. 2.1.1. Conventonal Ted-state Trphone HMM Tranng We follow the steps n [21] to tran a conventonal ted-state trphone acoustc model. A ted-state acoustc model Λ ML s obtaned through maxmum-lkelhood (ML) tranng. Each ted state s represented by a J-component Gaussan mxture model (GMM) wth dagonal covarance. For each ted state n Λ ML, create a state supervector by stackng up all Gaussan mean vectors n the state as below: = [ µ 1, µ 2,, µ J ], (1) where µ j, j = 1, 2,..., J s the mean vector of the jth Gaussan component of the th ted state. Now the ted states are treated as state clusters. 2.1.2. Dervaton of Cluster-based Egentrphones The followng procedure s repeated for each state cluster usng ts N trphone states that appear n the tranng corpus. STEP 1: Unte the Gaussan means of all the trphone states n the cluster except the unseen trphone states. The means of the cluster GMM are then cloned to ntalze all the unted trphone states. Note that the Gaussan varances and mxture weghts of states n the cluster are stll ted together. STEP 2: Re-estmate only the Gaussan means of all trphone states after clonng; ther Gaussan covarances and mxture weghts reman unchanged as those of ther cluster GMM. STEP 3: Create a trphone state supervector v p for each trphone state p n cluster by stackng up all ts Gaussan mean vectors from ts J-component GMM as n Eqn. 1. STEP 4: Collect the state mean supervectors v 1, v 2,..., v N as well as the ML-traned cluster center supervector of cluster, and derve an egenbass from ther correlaton matrx usng weghted prncpal component analyss (WPCA). The correlaton matrx s computed as follows: 1 F p(ˆv p )(ˆv p ), (2) F p where ˆv p are the standardzed verson of v p after t s normalzed by ts varances; F p s the frame count of the trphone state p n cluster, and F = p Fp. Note that we emprcally fnd that usng as the bas for correlaton computaton gves a better result than the arthmetc mean of the state supervectors {ˆv p, p = 1,..., N }. STEP 5: Arrange the egenvectors { e k, k = 1, 2,..., N } n descendng order of ther egenvalues λ k, and pck the top K (where K < N ) egenvectors to represent the egenbass of cluster. These K egenvectors are now called egentrphones of cluster. Note that, n general, dfferent clusters have a dfferent number of egentrphones. 2.1.3. Estmaton of the Egentrphone Coeffcents After the dervaton of the egentrphones, the supervector v p of any trphone state p n cluster s assumed to le n the space spanned by the K egentrphones. Thus, we have v p = + E w p, (3) cluster center dfference vector

where E = [e 1,..., e K ] s the matrx of the egentrphones that s used to model the ntra-cluster dscrmnaton among the member states cluster, and w p = [w p1,..., w pk ] s the egentrphone coeffcent vector of trphone state p n the cluster. The second term E w p models the dfference vector between the cluster center and each dstnct state n the cluster. The egentrphone coeffcent vector w p s estmated by maxmzng the objectve functon Q(w p) n the penalzed maxmum-lkelhood egen-decomposton (PMLED) [14] as follows K Q(w p) = L(w p) β k=1 w 2 pk λ k, (4) where L(w p) s the log-lkelhood of the tranng data; β s the regularzaton parameter; w pk s the coeffcent for the kth egentrphone. Fgure 2: An llustraton of the nter-cluster and ntra-cluster dscrmnatons provded by dscrmnatve tranng and egentrphone modelng respectvely. a and b are the centers of clusters a and b obtaned through ML tranng; m DT a and m DT b are the centers of clusters a and b obtaned through dscrmnatve tranng. 2.2. Investgaton Issue: Cluster-based egentrphones wth Dscrmnatvely Traned Bas Compared wth a conventonal ted-state system, the dscrmnaton among trphone states wthn the same state cluster or the ntra-cluster dsrmnaton - s now modeled by an addton of the dfference vectors E w p from the cluster centers n the acoustc space (Fg. 2). Meanwhle, the dscrmnaton between state clusters s gven by the ML-traned cluster centers. The nter-state-cluster dscrmnaton can be readly enhanced by dscrmnatve tranng (DT). Let us denote the correspondng DT-traned cluster centers by m DT. Thus, we have the followng two peces of addtonal dscrmnaton nformaton: addtonal nter-cluster dscrmnaton: m DT ntra-cluster dscrmnaton: E w p Ths paper nvestgates ntegratng these two peces of complementary dscrmnaton nformaton and models the supervector of the dstnct trphone state p of cluster as follows v p = + ( m DT ) addtonal nter-cluster dscrmnaton + E w p. (5) ntra-cluster dscrmnaton Table 1: Informaton of varous WSJ data sets. Data Set #Speakers #Utterances Vocab Sze SI-84 83 7,138 8,911 SI-284 283 37,413 13,646 dev. set 10 410 1,591 Nov 92 8 330 1,270 3. Expermental Evaluaton 3.1. Speech Corpora and Expermental Setup Two sets of experments were conducted on the Wall Street Journal (WSJ) contnuous speech recognton: one usng the smaller SI-84 WSJ0 tranng set and another one usng the larger SI-284 WSJ0+1 tranng set. The SI-84 tranng set conssts of 15 hours of 7,138 WSJ0 read utterances from 83 speakers. The SI-284 tranng set s a superset of the SI-84 tranng set, consstng of all the WSJ0 utterances plus an addton of 30,275 WSJ1 utterances from 200 speakers for a total of about 70 hours of read speech. All the tranng data were endponted. The standard Nov 92 5K non-verbalzed test set was used for evaluaton whle the 1992 WSJ 5K development data set was used for tunng the system parameters. These data sets are summarzed n Table 1. The language models n the experments were the standard 5K-vocabulary bgram and trgram that came along wth the WSJ corpus whch have a perplexty of 147 and 57 respectvely. There were altogether 15,061 cross-word trphones n WSJ0 tranng set and 18,777 cross-word trphones n WSJ0+1 tranng set based on 39 base phonemes. Each trphone model was a strctly left-to-rght 3-state contnuous-densty hdden Markov model (CDHMM), wth a Gaussan mxture densty of at most J = 16 components per state. In addton, there were a 1-state short pause model and a 3-state slence model. The tradtonal 39-dmensonal MFCC vectors were extracted at every 10ms over a wndow of 25ms. The HTK toolkt [21] was used for maxmum lkelhood HMM estmaton and dscrmnatve tranng as well as speech decodng. 3.2. Acoustc Modelng The performance (n term of word accuracy) of the followng four acoustc modelng methods are compared on the WSJ 5K recognton tasks: baselne1: conventonal ML tranng of ted-state trphone HMMs. baselne2: mnmum-phone-error (MPE) dscrmnatve tranng of ted-state trphone HMMs resulted from baselne1. cluster-based egentrphone modelng of trphone HMMs appled after baselne1 (and no states are ted). cluster-based egentrphone modelng of trphone HMMs appled after baselne1 wth dscrmnatvely traned cluster centers extracted from the models of baselne2 (agan no states are ted). The SI-84 ted-state baselnes consst of 1,277 ted-states and the SI-284 ted-state baselnes consst of 7,374 ted-states. For smplcty, the cluster-based egentrphone modelng was conducted usng the clusters defned by the ted states n the baselne systems. In general, the optmal choce of state clusters

Table 2: Recognton word accuracy (%) of varous systems on the WSJ Nov 92 5K evaluaton set usng bgram or trgram LM. Tran Model Descrpton Bgram Trgram SI-84 Baselne1: ML-traned ted-state trphones 93.09 95.46 Baselne2: MPE-traned ted-state trphones 93.46 (+0.37) 95.78 (+0.32) Cluster-based egentrphone modelng 93.89 (+0.80) 95.74 (+0.28) Cluster-based egentrphone modelng wth dscrmnatvely traned cluster centers 93.98 (+0.89) 95.95 (+0.49) SI-284 Baselne1: ML-traned ted-state trphones 94.25 96.32 Baselne2: MPE-traned ted-state trphones 94.28 (+0.03) 96.54 (+0.22) Cluster-based egentrphone modelng 94.30 (+0.05) 96.53 (+0.21) Cluster-based egentrphone modelng wth dscrmnatvely traned cluster centers 94.64 (+0.39) 96.73 (+0.41) for egentrphone modelng can be dfferent from the ted states chosen by conventonal ted-state HMM even though they come from the same state tyng tree. The dmenson of trphone state supervectors s 16 (mxtures) x 39 (MFCC) = 624. For each state cluster, all seen trphone states were used to derve the egentrphones, then the top 20% of egentrphones were used n PMLED. The regularzaton parameter β n PMLED was set to 10. Table 3: Relatve amount of trphones n the Nov 92 test set that are consdered nfrequent n the SI-84 or SI-284 tranng set for dfferent defnton of nfrequency. Sample Count Below SI-84 SI-284 10 5.37 0.82 20 11.2 1.75 30 15.7 2.55 40 19.5 3.55 50 23.5 4.53 WSJ0 and 0.41% n WSJ0+1 when trgram LM was used. Thus, t seems that egentrphone modelng and dscrmnatve tranng are complementary to each other, and the mprovement gven by each of them are addtve. As the strength of the egentrphone modelng method s ts ablty to construct dstnct models robustly for the nfrequent trphones, we hypothesze that the performance gan n a task wll depend on how often those trphones that are nfrequent n the tranng set appear n the test set. Thus, we count the relatve amount of nfrequent trphones n the two tranng sets that appear n the test set for dfferent defntons of nfrequency, and summarze the fndngs n Table 3. It can be seen that many more trphones n the Nov 92 test set appear nfrequently n WSJ0 than n WSJ0+1. Ths s expected as the tranng set of WSJ0+1 s about 4 tmes bgger than the tranng set of WSJ0, and the latter s actually a subset of the former. Thus, the beneft of egentrphone modelng s more pronounced n the WSJ0 task than n the WSJ0+1 task. 3.3. Results and Dscusson Word recognton results of varous systems are compared n Table 2. Frst of all, we can see that the prevously proposed cluster-based egentrphone modelng performs at least as well as the dscrmnatvely traned ted-state trphones. Wth a trgram LM, both of them gve an absolute 0.2% 0.3% (or, a relatve 5.4% 6.6%) reducton n the word error rates (WERs) when compared wth conventonal ML tranng of ted-state trphones. Ths suggests that the explotaton of ntra-cluster dscrmnaton between member states of a state cluster (obtaned by egentrphone modelng) may be as mportant as the addtonal nter-cluster dscrmnaton obtaned by dscrmnatve tranng. Moreover, the performance gan by egentrphone modelng alone s more promnent wth the smaller tranng set of SI-84 than the larger tranng set SI-284. Ths shows that egentrphone modelng s partcular effectve wth sparse tranng data. The last row for each of the tasks shows the recognton performance of the models after ntegratng the two approaches, and t gves the best performance among the four modelng methods: absolute reducton of WER by 0.89% n WSJ0 and 0.39% n WSJ0+1 when bgram LM was used, and 0.49% n 4. Conclusons and Relaton to Pror Work Ths paper successfully shows that the cluster-based egentrphone modelng [13, 14, 15, 16] can be further mproved by replacng the ML-traned cluster centers by the dscrmnatvely traned centers. Standard dscrmnatve tranng [18, 19, 20] of ted-state trphones ams at maxmzng the nter-cluster dscrmnaton among ted states, whereas the cluster-based egentrphone modelng elmnates the quantzaton errors n ted states by untyng the states belongng to the same ted state. Besdes untyng states, egentrphone modelng further models each dstnct member state of each state cluster (formerly a ted state) by a dfference vector from the cluster center, thus effectvely achevng addtonal dscrmnaton among the member states. The two approaches are ntegrated together n ths paper so that both nter- and ntra-cluster dscrmnatons are modeled n the new cluster-based egentrphone modelng algorthm that uses dscrmnatvely traned cluster centers. Expermental evaluaton on WSJ 5K task shows that the new algorthm may combne the gans acheved by each of dscrmnatve tranng and egentrphone modelng, and gves the best recognton performance.

5. References [1] Owens M. J Mng, O Boyle P. and Smth F. J., A Bayesan approach for buldng trphone models for contnuous speech recognton, IEEE Transactons on Speech and Audo Processng, vol. 7, pp. 678 684, 1999. [2] S. Takahash and S. Sagayama, Four-level ted-structure for effcent representaton of acoustc modelng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, 1995. [3] M. J. F. Gales and K. Yu, Canoncal state models for automatc speech recognton, n Proceedngs of Interspeech, 2010. [4] K. F. Lee, The Development of the SPHINX System, Kluwer Academc Publshers, 1989. [5] A. Ljolje, Hgh accuracy phone recognton usng context clusterng and quastrphonc models, Computer Speech and Language, vol. 8, pp. 129 151, 1994. [6] K. F. Lee, Context-dependent phonetc hdden Markov models for speaker-ndependent contnuous speech recognton, IEEE Transactons on Speech and Audo Processng, vol. 38, pp. 599 609, 1990. [7] S. J. Young and P. C. Woodland, The use of state tyng n contnuous speech recognton, n Proceedngs of the European Conference on Speech Communcaton and Technology, 1993. [8] M. Y. Hwang and X. D. Huang, Shared-dstrbuton hdden Markov model for speech recognton, IEEE Transactons on Speech and Audo Processng, vol. 1, pp. 414 420, 1993. [9] E. Boccher and B. Mak, Subspace dstrbuton clusterng hdden Markov model, IEEE Transactons on Speech and Audo Processng, vol. 9, pp. 264 275, 2001. [10] J. J. Odell S. J. Young and P. C. Woodland, Tree-based state tyng for hgh accuracy acoustc modellng, n Proceedngs of the Workshop on Human Language Technology, 1994. [11] D. Povey et al., Subspace Gaussan mxture models for speech recognton, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, 2010. [12] X. D. Huang and M. A. Jack, Sem-contnuous hdden Markov models for speech sgnals, Computer Speech and Language, vol. 3, pp. 239 251, 1989. [13] T. Ko and B. Mak, Egentrphones: A bass for contextdependent acoustc modelng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, 2011. [14] T. Ko and B. Mak, A fully automated dervaton of state-based egentrphones for trphone modelng wth no ted states usng regularzaton, n Proceedngs of Interspeech, 2011. [15] T. Ko and B. Mak, Dervaton of egentrphones by weghted prncpal component analyss, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, 2012. [16] T. Ko and B. Mak, Egentrphones for context-dependent acoustc modelng, IEEE Transactons on Audo, Speech and Language Processng, submtted. [17] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Nedzelsk, Rapd speaker adaptaton n egenvoce space, IEEE Transactons on Speech and Audo Processng, vol. 8, pp. 695 707, 2000. [18] B. H. Juang and S. Katagr, Dscrmnatve tranng for mnmum error classfcaton, IEEE Transacton on Sgnal Processng, vol. 40, no. 12, pp. 3043 3054, Dec 1992. [19] P. C. Woodland and D. Povey, Large scale dscrmnatve tranng of hdden Markov models for speech recognton, Computer Speech and Language, vol. 16, no. 1, pp. 25 47, Jan 2002. [20] D. Povey and P.C. Woodland, Mnmum phone error and -smoothng for mproved dscrmnatve tranng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, may 2002, vol. 1, pp. I 105 I 108. [21] Steve Young et al., The HTK Book (Verson 3.4), Unversty of Cambrdge, 2006.