Tree-based Dependence Models for Speech Recognition

Size: px

Start display at page:

Download "Tree-based Dependence Models for Speech Recognition"

Loreen Byrd
5 years ago
Views:

1 Tree-based Dependence Models for Speech Recognition Mari Ostendorf, Ashvin Kannan and Orith Ronen Electrical and Computer Engineering Department, Boston University 8 St. Mary's St., Boston, MA USA mo<dbu. edu Summary. The independence assumptions typically used to make speech recognition practical ignore the fact that different sounds in speech are highly correlated. Tree-structured dependence models make it possible to represent cross-class acoustic dependence in recognition when used in conjunction with hidden Markov or other such models. These models have Markov-like assumptions on the branches of a tree, which lead to efficient recursive algorithms for state estimation. This paper will describe general approaches to topology design and parameter estimation of tree-based models and outline more specific solutions for two examples: discrete-state hidden dependence trees and continuous-state multi scale models, drawing analogies to results for time series models. Initial results for both cases will be described, followed by a discussion of questions raised by the experiments. 1. Introduction In speech recognition, independence assumptions are typically made to reduce the complexity of automatic training and the recognition search. In particular, a standard assumption used in virtually all recognition systems is that each vector or segment is generated independently given an underlying state or phone sequence. In other words, in a speaker-independent system, there is no notion that an faa! and an fah/ (or even another faa!) in the same utterance or speaker session have something in common because they came from the same vocal tract. The assumption effectively allows two phones at different times in an utterance to come from different speakers. Vocal tract length (VTL) normalization (e.g. [1]) compensates for this problem to some extent, but it is clear that VTL normalization does not account for all speakerdependent effects because gains are additive when it is used in combination with acoustic model adaptation. In addition, sounds can be correlated for other reasons, such as the recording environment or dialect-related pronunciation patterns. Acoustic model adaptation is used to overcome this problem, but in large vocabulary recognition one often has very little data from the target speaker with which to adapt possibly millions of parameters. Therefore, most current adaptation techniques assume classes of models over which the adaptation transformation is tied (e.g. [2, 3]), or they may approximate the joint dependence of different speech sounds by defining regions of local dependence ([3, 4]). However, such approaches do not take full advantage of the predictive power that observations from one phone have for another. Ignoring the speech recognition application for the moment, our problem is to estimate a probability distribution that represents the joint dependence of variables in a very high dimensional space and to use this distribution to make inferences K. Ponting (ed.), Computational Models of Speech Pattern Processing Springer-Verlag Berlin Heidelberg 1999

2 Tree-based Dependence Models 41 about missing variables. We refer to such a probability distribution as a dependence model. For such a model to be practical, Markov-like assumptions are required; some examples include Markov random fields and Bayesian networks. In this work, we focus on a particular class of Markov dependence models that are tree-based, because of the additional efficiency of estimation and prediction algorithms and because topology design is simpler and arguably more robust for trees. The remainder of the chapter is organized as follows. In the next section, we introduce a general hidden tree framework that handles the type of variable-length observations encountered in speech applications. Then, we describe two specific examples of dependence models and how they can be used in speech processing applications. We conclude with a brief summary and discussion of open questions. 2. Hidden Tree Framework The problem of acoustic modeling of speech sounds includes the important issue of characterizing variable-length observations. A speech sound may occur a different number of times (or not at all) and each instance has a randomly varying length. We handle this problem in the dependence model by defining a fixed-dimension hidden state X = [Xl"'" XN] that represents the joint dependence of speech sounds Xi, which are associated with observations Y = {Yi; i = 1,..., N}. For a tree-based model, Xi corresponds to a node in the tree, and N is equal to the number of nodes in the tree. Each observation Yi = [Yi,l,..., Yi,L,] is a concatenation of the Li different instances of sound i (ignoring time order), and can be thought of as a random process with characteristics depending on the hidden state. Figure 1 illustrates a tree with a variable number of observations associated with each node. Depending on the underlying model of speech, Yi,j may correspond to a frame of speech (i.e. a vector of cepstral coefficients, the "observation" associated with an HMM state) or a segment trajectory (Le. the vector of coefficients describing a trajectory of features, the sufficient statistics for the observation associated with a polynomial segment model [19]). The complete set of observations Y may correspond to a single utterance or a collection of utterances. The probabilistic model is specified by: a function that defines the tree topology, e.g. 1T( i) is the parent of node i; Markov state distributions associated with branches in the tree, P(Xi IX1l"(i)); and observation distributions p(yiixi), assuming that observations are conditionally independent given the state. The conditional independence assumptions, like those in a hidden Markov model (HMM), make implementation practical. The differences with respect to an HMM are that dependence is between un-ordered sound classes rather than sequential states in time, and that the state sequence is a fixed-length vector rather than a random process. As with an HMM, there are three problems to solve in applying the dependence model: optimal state estimation, efficient computation of the likelihood p(y), and model design (i.e. topology design and distribution parameter estimation). These problems are solved using recursive algorithms that take advantage of assumptions of conditional independence of nodes and subtrees given the value of an intermediate node. The solutions are analogous to the corresponding HMM algorithms but

3 42 M. Ostendorf et al. Fig. 1. Illustration of a tree-structured model with a hidden state: open circles indicate the nodes of the tree that form the hidden state X; filled triangles denote observations associated with anode Y differ in that the updates follow the tree structure rather than a linear time sequence. Details of the algorithms depend on the particular state and observation distribution assumptions, but there are some issues that apply in general, as described below. There are two main types of tree topologies that could describe a collection of sound classes, and the relationship between the number of nodes in the tree Nand the number of sound classes M depends on the particular type of topology. At one extreme is the graph of connections between sound classes, where every node in the graph corresponds to one of a disjoint collection of classes (one Xi per class). In this case, N = M, and topology design involves finding the best graph that connects the classes. At the other extreme is a hierarchically organized tree, where the target sound classes comprise the leaves of the tree and sub-classes representing different levels of granularity are introduced at internal nodes. In this case, topology design involves clustering the M classes. If sub-classes are defined using a binary tree, then N = 2M - 1. Hybrid versions can be envisioned and probably will be the most effective solution, in part because the introduction of sub-classes is a useful tool for robust topology design when M is large. Parameter estimation for dependence models with a hidden tree structure must address the problem of unobserved variables in estimation, which is generally solved using the Expectation-Maximization (EM) algorithm [5]. The algorithm involves two steps: 1) finding the expected joint likelihood of the hidden state and observations given the current parameter estimates, and 2) computing the maximum likelihood estimate of the parameters in terms of the statistics found in step (1). Making Markov assumptions on the tree, the first step can be implemented efficiently with an algorithm analogous to the forward-backward algorithm used in HMM parameter estimation, except that it runs upward and downward on the tree rather than forward and backward in time. In the two sections that follow, we introduce two different examples of treebased dependence models: the hidden dependence tree, which represents a discrete-

4 Tree-based Dependence Models 43 valued hidden state using the disjoint-class topology; and the multiscale model, which represents a continuous-valued state using a hierarchical topology. Each section first describes the mathematical framework of the model and its application to speech recognition, followed by the algorithms for topology design and parameter estimation, and finally presents some experimental results. 3. Hidden Dependence Trees Hidden dependence trees are an extension of the discrete dependence trees introduced by Chow and Liu [6] to efficiently model the dependence among a set of random variables. The dependence tree represents a discrete underlying state, and the extension allows for variable-length and continuous-valued observations. 3.1 The Mathematical Framework In the discrete-state case, the joint probability function P(X) is modeled using a dependence tree [6]. A component Xi is assigned to node i in the tree, and each edge in the tree is associated with the conditional probability function of the two variables connected by the edge, i.e. P(xilxj) for the edge connecting the node of Xi to its parent Xj. The parent of node i, denoted by 1I"(i), is the first node on the path connecting node i to the root. The root of the tree is associated with the component xo, which is introduced for notational purposes and is not actually a component of X. The nodes Xi connected to the root have 11"( i) = 0 as their parent, and edges P(xilxo) are defined to be P(Xi). In other words, the Xi with 1I"(i) = 0 are independent and hence so are the respective subtrees associated with those nodes. The dependence tree state distribution model is and the joint observation-state distribution is N P(X) = II P(xilx7r(i)), (1) i=l (2) assuming that {Yi,j} are conditionally independent and identically distributed given the state. (Note that upper case P is used to denote a probability mass function, and lower case p a density function.) As in an HMM, the probability of a set of observations is computed by summing over the possible state vectors N p(y) = L IIp(Yilxi)P(xiIX7r(i)). (3) X i=l The sum can be computed efficiently using a recursive algorithm that incorporates probabilities from the leaves upward to the root of the tree, analogous to the forward algorithm for HMMs.

5 44 M. Ostendorf et at. 3.2 Application to Speech The dependence tree state is hidden in the same sense that the mode of a Gaussian mixture distribution is hidden; observations are continuous-valued cepstral features described by Gaussian distributions conditioned on the hidden state. The difference is that the dependence tree state is vector-valued, unlike the scalar mode of a Gaussian mixture distribution. An HMM that uses Gaussian mixture observation distributions also has a multi-dimensional state, but there are important differences with respect to dependence trees. The HMM state sequence is variable-length and timeordered. In the hidden dependence tree, on the other hand, the state dimension and order is fixed, and there is no notion of time. The state probability distributions in an HMM (aij = pest = i!st-l = j) describe sequence length and ordering, while the state probability distributions in a dependence tree (ai,jk = P(Xi = jix 7r(i) = k) describe the relationship between the values of states in a fixed order. The analogy of the hidden dependence tree to a Gaussian mixture and the differences with respect to an HMM suggest an application of the dependence tree in acoustic modeling. Consider an HMM that uses Gaussian mixture distributions. Let Xi = j in the dependence tree indicate that the mixture mode of HMM state i is j. Then the hidden dependence tree provides a model for correlation of the mixture modes across sound classes. With this interpretation, one can envision different applications of the dependence model used in conjunction with an HMM. Assume that an HMM is first used to provide a "transcription" and segmentation of an utterance in terms of the N sound classes in the dependence tree. The transcription is used to group the observations into subsets Yi, and the hidden dependence tree model is then used to compute p(y). This probability can be used in a likelihood ratio test of whether two segments of speech came from one vs. two speakers (or in other textindependent speaker/language identification problems), or as an additional "consistency score" in N-best rescoring of hypotheses for word recognition. Alternatively, the observations can be used to re-estimate mixture weights, i.e. ~ij = P(Xi = jly) for the j -th mixture weight associated with state i, for use in a subsequent decoding pass. This probability is computed using the upward-downward algorithm used for state estimation in model design, described next. 3.3 Topology Design and Parameter Estimation In this discussion, we will assume a non-hierarchical topology for the dependence tree structure; that is, the classes represented by the tree are disjoint. For the case where X is discrete-valued and fully observable, Chow and Liu [6] describe an algorithm for estimating both dependence tree structure and its parameters. In our case, where both the tree and subsets of the observations Yi are unobserved, we divide topology design and parameter estimation into two steps. However, we build on the Chow-Liu algorithm by defining an intermediate, partially observable discrete state, as described below. Class Definition and Topology. Topology design requires finding 7l"( i) for all nodes i = 1,...,N. The Chow and Liu algorithm finds the tree topology that minimizes the difference of the information contained in the true probability function and

6 Tree-based Dependence Models 45 that contained in its approximation by a dependence tree. This minimization criterion is equivalent to maximizing the total weight on the edges of the tree, where the weight of the edge connecting nodes Xi and X j is the mutual information I (Xi; X j ) based on relative frequency estimates of their joint probability distribution. Given all possible I(Xi; Xj), topology design is a minimum spanning tree search problem. The Chow and Liu algorithm works well when the samples of the vector X are complete, meaning that all the components of samples are observed, and when the number of samples of the vector X is large relative to the number of values an Xi can take on. When there are a small number of samples for a pair of variables, the mutual information estimate is biased above the true value, so (Xi, X j) pairs that are infrequently observed may be incorrectly assigned links in the tree. In order to use the Chow-Liu algorithm, we estimate a discrete state vector X for each training sample by setting Xi equal to the index of the vector quantization (VQ) codeword that minimizes the total distortion of the observations Yj E Yi. In order to keep the number of values for Xi small and still have a reasonable sampling of the vector observation space, node-dependent codebooks are designed. Assuming the tree nodes correspond to phonetic or sub-phonetic units, there will be some missing elements of the estimated state vectors, because of the wide variation in frequencies of occurrence of different phonemes. This imbalance of phone-pair occurrence rates can lead to bad estimates of the mutual information and poor tree topologies. To obtain robust trees, we modified the Chow-Liu algorithm to include a threshold on the number of co-occurrences of every phone pair for allowing a link between the pair and a limit on the number of connections in the tree. In addition, we used random sampling to obtain speaker-level state vectors to reduce the number of missing elements relative to utterance-level vectors. Parameter Estimation. Two sets of distribution parameters are needed to characterize the hidden dependence tree: the tree edge distributions P(xilx1I"(i)) and the observation distributions p(yixi = j) '" N(Jlij, Eij ), where N(Jl, E) denotes a Gaussian with mean Jl and covariance E. The above topology design gives an initial estimate for the edge distributions. Initial estimates for Jlij and Eij are given by the VQ codewords and associated error covariances. Given an initial estimate, the parameters can be refined based on the actual observations using the iterative EM algorithm. As mentioned earlier, the expectation step involves a recursive upwarddownward algorithm that is analogous to the forward-backward algorithm used in HMM parameter estimation. If the VQ-estimated observations of X are available, then the tree edge distributions can be estimated using the upward-downward algorithm for discrete dependence trees [7], which is an extension of Pearl's algorithm for belief propagation in causal trees [8] and a special case of the more general algorithm for Bayesian networks described by Lucke [9]. To estimate both the edge distributions and the observation distributions, the upward-downward algorithm is extended to use the observations Yi in the upward step during the update of node i, and parameter re-estimation for the node-dependent Gaussians is added to the maximization step [10]. The complete parameter estimation algorithm is similar to that used for HMMs with Gaussian mixture distributions, with the difference being added complexity due to tree-structured rather than time-based dependence.

7 46 M. Ostendorf et al. 3.4 Experiments Experiments assessing various methods for topology training and the usefulness of the hidden dependence model were conducted on two large vocabulary continuous speech recognition tasks using the Wall Street Journal [11] and the Switchboard [12] corpora, in both cases training on roughly 120 hours of speech. The WSJ corpus is based on read business news, and the Switchboard corpus comprises telephonequality conversational speech on a variety of topics. The feature vectors included 14 mel-warped cepstra (no derivatives), computed at a IOms frame rate using cepstral mean subtraction. In the experiments on Switchboard, the features were also normalized to compensate for vocal tract length [1]. The X vector associated each node with a phone, so the dimension was (i.e. context-independent models) and therefore the dependence tree models used only the frames in the center of the phone segment to minimize coarticulation influences. In development of the topology design approach, we evaluated the performance of the dependence tree models by computing the likelihood of an independent test set. The results showed that the dependence tree performed better than an independent-phone model, and that constraints on the topology of the tree generally improved performance. In addition, the automatically designed dependence tree outperformed a tree that had been specified by hand according to manner of articulation and other differences in articulatory features. As an example, the tree topology designed on the Switchboard corpus is given in Figure 2, illustrating learned dependence that reflects articulation manner (e.g. among fricatives and nasals) but also some connections that were probably dominated by co-articulation effects (e.g. laa/-/erl since laa/ is often followed by Ir/). We evaluated the performance of the hidden dependence tree model by using the likelihood p(y) as an additional score in N-best rescoring experiments. In these experiments an HMM-based recognition system from BBN [13] provided an N best list of hypotheses (N = 100) for all the utterances in the test set, along with an HMM acoustic score and a trigram language model score for each hypothesis. These hypotheses were rescored by the hidden dependence tree model. A linear combination of these scores plus word and phone counts (insertion penalties) was used for re-ranking the list of hypotheses and producing the final recognized output. The weights of the different scores were optimized on a development test set. The dependence tree model used in this experiment was a gender-dependent model with 10 node-dependent codewords per phone and a constrained tree. Table 1 shows the results of these experiments. There is a slight improvement when combining the likelihood score obtained from the dependence tree model with the HMM score. Further gains should be obtained by using more detailed sound classes, such as triphone states, but the resulting dependence tree would be large and likely require a hybrid hierarchical and Chow-Liu topology design strategy.

8 Tree-based Dependence Models ~ oy amo bm col ox moo fpm laf zb ~ n ey m I~ dihiyuw...---:/"1 ~ dhuboh g y ~I b v W Be ~ b# bb ay oi ix /I ~ ~ pau p aw oh Cpu ax dx I I I ~ t th I s ah I...---:/"1 ~ k ao er ow f sh z ~ ~ axr aa jh ch Fig. 2. Discrete dependence tree designed on the Switchboard corpus, where subtrees connected to the root node (indicated by "@") are independent Table 1. N-best rescoring results (word error rates) on the 1993 WSJ evaluation test and the 1996 SWBD evaluation test. The knowledge sources are the HMM acoustic score, the dependence tree score (DT), and a trigram language model (LM). Knowledge Sources HMM,LM HMM,DT,LM II WSJ Eval93 II SWBDEval Multiscale Tree Processes Multiscale stochastic processes represent an important class of models, of which a particularly useful subclass is based on scale-recursive dynamics on trees [14, 15]. These models allow efficient algorithms for both estimation and likelihood calculation resulting in a variety of applications. In this section, we describe the general framework and application to acoustic model adaptation in speech recognition. 4.1 The Mathematical Framework Denoting a node in a tree by t with parenti ft, a state-space model for the evolution in the tree of the Gaussian process X and its noisy observation Y is given by 1 The notation t1 represents the same information as 7r( i) for the hidden dependence tree; the two notations are used to be consistent with other literature in the respective areas.

9 48 M. Ostendorf et at. Xt Yt,i AtXt-r + Wt CtXt + Vt,i (4) (5) where Xt is the vector state of the process at node t. The root node state Xo has distribution N(O, Eo). The process noise Wt is white, independent of xo, and has distribution N(O, Qt). The state Xt is observed via a noisy measurement Yt,i, where the measurement noise Vt,i is white, independent of Xo and Wt, and has distribution N(O, Rt). Thus, ex = (Eo, {At, Qt}) are the parameters of p(x), and eylx = ({ Ct, Rt}) are the parameters of p(yix). The zero~mean assumptions of the root node Xo and the noise terms are not a requirement of the model, but result in simpler estimation equations. A degenerate tree with only one leaf node (parent nodes have only one child) can be interpreted as a standard linear dynamical system, i.e. having a time-like index. As an acoustic model for speech recognition, the standard dynamical system is a continuous-state alternative to the discrete-state HMM, where likelihood is computed using Kalman filtering recursions to obtain innovations and associated distribution parameters [16]. A similar approach can be used for the multiscale model extending the Kalman recursions on the tree [17]. For the adaptation application, state estimation is more important than likelihood computation. Given Y, the set of all available observations, the smoothed estimate 2 of the state Xt = E{xtIY} and the associated error covariance PtlY = E{[xt - X t][ Xt - Xt 1 T} is computed using a generalization of the Rauch-Thng -Striebel (RTS) algorithm [14]. Smoothing is done in two sweeps: an upward sweep from the leaves to the root, followed by a downward one from the root to the leaves. The complexity of the tree RTS smoother is O( d 3 N) where d is the dimensionality of the state (the d3 is due to matrix inversions), and N is the number of nodes in the tree. 4.2 Application to Speech The multiscale model can be used for the adaptation of means of acoustic models to a new speaker or new environmental conditions. For example, let each leaf r of the multi scale tree be associated with a set of Gaussians YT' and adapt the means of all Gaussians in class r by a common shared shift X T : where ILl: denotes the mean ILi after adaptation. Such a shared shift approach has been used for Gaussians in hidden Markov models (HMMs) [3] and the stochastic segment model (SSM) [18], and for polynomial segment models (PSMs) [19]. The observations YT,i E YT associated with node r are differences between the speaker Independent means ILi and the average of feature vectors observed for sound i E YT in an utterance. 2 The term "smoothed estimate" refers to the linear least squares (or for Gaussians, the minimum mean squares) estimate. It also corresponds to the maximum a posteriori estimate. (6)

10 Tree-based Dependence Models 49 Initial independent estimates for the shift x T and associated error covariance P T can be obtained from adaptation data for each class l by averaging the observed shifts for that class (YT,i E YT) and computing the equivalent covariance of the averaged variable [19]. Let us define a Gaussian tree-based shift process (Equation 4) with M leaves, and associate the leaf node states with the shifts of the M classes we wish to model dependence between. Given T and P T at a subset of leaves, adaptation involves estimating the hidden states x T for all leaves using the tree RTS smoother and then shifting the models within the respective classes. Due to the Bayesian nature of the estimate, the smoothed shift approaches the unsmoothed shift and converges to the standard ML speaker-dependent estimate as the amount of adaptation data increases. A similar tree-based model for adaptation is described in [20], but the upwarddownward propagation of mean shifts is based on a heuristic that does not account for degree of correlation or variance differences. Another Markovian model used in adaptation is based on Markov random fields [21]. Multiscale models offer a number of advantages over Markov random fields including a constant per-node complexity, the availability of an error covariance associated with smoothing, and the fact that state estimation algorithms are efficient, non-iterative, recursive and parallelizable. 4.3 Topology Design and Parameter Estimation To use the multiscale model of dependence in adaptation, we need to define the adaptation classes and tree topology, as well as estimate parameters for the process. Class Definition and Topology. In continuous speech recognition, contextdependent models are frequently clustered in the form of a tree for each region (or state) of a phone using ML clustering of Gaussians [22,23]. Figure 3(a) illustrates the tree for one region. Each node of the tree represents an equivalence class of triphones. Nodes at a certain "cut" through the tree, the boxes in Figure 3(a), define terminal adaptation classes to share shifts (Equation 6). One popular, but ad-hoc, option for adaptation is the "back-off" strategy, where the shift is computed at the most detailed node which has more than some threshold of adaptation frames and copied to all child terminal adaptation classes as shown in Figure 3(b). The topology of the clustering tree can also be used for multiscale smoothing. Class-dependent shifts Xt are computed at the terminal adaptation nodes and then smoothed using the multi scale model (Figure 3(c)) to get shift estimates Xt at all nodes. Parameter Estimation. Maximum-likelihood estimates of the parameters of the tree process (Eo, At, Qt, Ct, Rt) can be obtained by applying the RTS and EM algorithms to multiple independent sample vectors Y [24], where each conversation side contributes one sample. The general approach follows that described in [16] for a time-ordered dynamical system, which involves iteratively finding expected sufficient statistics of the hidden state (E-step), and then using multivariate regression to compute new process parameters (M-step). The main difference is in the recursions used in the E-step, which build on the tree algorithms developed in [14, 25]. Here, we assume Ct = I and Rt is effectively given by the sample variance of the observations, so there is no need to estimate Ct and Rt. In the experiments described

11 50 M. Ostendorf et al. Fig. 3. Trees used for adaptation: (a) shows the clustering tree with squares indicating terminal adaptation classes, (b) illustrates the "back -off" method of adaptation with dashed squares indicating back-off classes, and (c) shows the corresponding multiscale smoothing approach. In both (b) and (c), triangles indicate observations. next, the A and Q parameters are shared among all nodes of a phone; i.e. for K phones, there are K trees each with an (A, Q) pair. To start the EM iterations, we need initial estimates of 17(0), the A's and the Q's. For each speaker in the training set we compute covariances of ML (unsmoothed) shifts at each terminal shift node. A frequency-weighted average of these covariances across all speakers is used for initializing 17(0) and all Qt, and initial At = I for all t. 4.4 Experiments Experiments were conducted on the Switchboard corpus. The feature vectors were the same as for the hidden dependence tree experiments except that energy and feature derivatives are used. N-best rescoring is also used here, with the segmentmodel acoustic score substituted for the HMM score. Most experiments use sixty hours of speech for training the acoustic and multiscale (MS) models; 123 hours are used in the guided adaptation experiments. The PSM systems used a 2-region model, with each region modeled by a linear trajectory Gaussian process with a single full covariance. The SSM systems used a 5-region model, with each region represented by a full covariance Gaussian. Both cases used gender-dependent models and ML clustered triphones. The PSM and SSM adaptation systems had 300 and 150 terminal adaptation classes/region, respectively. For a fair comparison of MS smoothing vs. back-off approaches, the same topology is used for both types of adaptation. In batch-mode adaptation, the first half of each conversation is used as adaptation data and the second half for testing. The results in Table 2 for supervised adaptation indicate the MS-smoothing is better than the back-off approach. However, performance for both algorithms degrades relative to the baseline in unsupervised adaptation, indicating a sensitivity to the high error rate in the Switchboard tasks. Consequently, we use guided adaptation in further unsupervised experiments. In unsupervised transcription-mode adaptation two passes are made over the speech: the first to collect statistics for adaptation, and the second to perform recognition with the adapted models. Adaptation is guided in the sense that we adapt only

12 Tree-based Dependence Models 51 Table 2. Supervised batch recognition with 2-region PSMs. Error rates are on the second half of conversations in the Dev96 test set. SI baseline 44.5% with data from the subset of words in the top first-pass hypothesis with confidence over a specified threshold. This serves to lower the error-rate for the speech used in adaptation, which benefits both ML and MS adaptation. Guided adaptation also tests the generalization capability of the dependence model for unseen classes (in the "incorrect" parts of the speech). In experiments on the Dev97 test set with a 5-region SSM, we found that the ML back-off approach improved a 40.9% WER baseline 3 to 40.4% and that further improvement to 40.0% was obtained with MS smoothing. Table 3 shows that MS adaptation gives about 1 % absolute improvement in performance. A much greater gain is expected from using lattice decoding rather than N-best rescoring, based on BBN adaptation experiments [26]. Table 3. Guided unsupervised transcription mode adaptation with a 5-region SSM system Dev97 Eval97 Baseline MS adapt Discussion In summary, tree-based models of dependence provide an efficient framework for representing correlation across phones in speech (or sub-phonetic units represented by HMM states), for use in adaptation as well as other applications. Dependence models are a supplement to and not a replacement for existing techniques, such as HMMs, in that they model correlation across classes but not time. Markov-like assumptions combined with a tree structure make for efficient algorithms for computing the expected state given a set of observations. Dependence model design involves first finding the tree topology, which can be an direct connection of classes or a hierarchy of sub-classes, and then EM parameter estimation using an upwarddownward algorithm for handling the hidden state. Two important examples are described: the hidden dependence tree, which has a discrete hidden state and can be thought of as a mechanism for loosely coupling Gaussian mixture modes of different models; and the multiscale model, which has a continuous hidden state and relates models via a hierarchy with different levels of granularity. The two approaches differ primarily in the discrete vs. continuous hidden state, but they also 3 The lower baseline error rate is due to differences in the test set, language model, signal processing parameters, and training on 123 hours of speech.

13 52 M. Ostendorf et at. illustrate two extremes of topology design. It is an open question as to which of the two models is more useful: the hidden dependence tree is better at capturing non-linear dependence between classes, but the mixture mode dependence may be too weak a coupling. Initial results for both models are promising, but much work remains to explore their full potential. For example, the speaker adaptation experiments did not take advantage of several variations known to improve results, such as speaker-adaptive training [27] and iterative adaptation and decoding. Several questions are raised by the initial experimental results, particularly related to topology design and parameter tying. How can we best integrate the mutual information clustering technique, which is more general but not very robust, with hierarchical clustering techniques? What is the right number of classes to represent with the tree? For adaptation, theoretically it is better to use a large number of classes, but in practice we do not find this to be the case, probably because of inaccuracies in the model exacerbated by parameter tying assumptions. In the multiscale model experiments, we assumed that all branches of the tree for a particular phone shared the same transition matrix and process noise covariance. Is it possible to learn finer grained parameter sharing automatically? Of course, there is also the question of whether better results can be obtained by relaxing the tree-structure assumption and using less restrictive models such as Bayesian networks. However, it is likely that the computational efficiency of the tree structure will make the treebased dependence models more attractive in the near term. Acknowledgement. This work was supported by the United States DoD, grant ONR-NOOOI4-92-J References [1] E. Eide and H. Gish, "A parametric approach to vocal tract length nonnalization," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. I, pp , May [2] C. J. Leggetter and P.C. Woodland, "Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression," Proc. ARPA Workshop on Spoken Language Technology, pp , January [3] G. Zavaliagkos, R. Schwartz, J. McDonough, and J. Makhoul, "Adaptation algorithms for large scale HMM recognizers," Proc. European Conference on Speech Comm. and Tech., vol. 2, pp. 1 13l-1134,September [4] Q. Huo and C.-H. Lee, "On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. 2, pp , May [5] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood estimation from incomplete data;' Journal of the Royal Statistical Society (B), vol. 39, no. 1, pp. 1-38, [6] C.K. Chow and C.N. Liu, "Approximating discrete probability distributions with dependence trees," IEEE Trans. Information Theory, vol. IT-14, no. 3, pp , May [7] O. Ronen, J.R. Rohlicek, and M.Ostendorf, "Parameter estimation of dependence tree models using the EM algorithm,,,1 IEEE Signal Processing Letters, vol. 2, no. 8, pp ,1995.

14 Tree-based Dependence Models 53 [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA, [9] H. Lucke, "Which stochastic models allow Baum-Welch training?" IEEE Trans. Signal Proc., vol. 44, no. 11, pp , [10] O. Ronen, Dependence tree models of intra-utterance phone dependence, Boston University Ph.D. Thesis, [11] F. Kubala et ai., "The hub and spoke paradigm for CSR evaluation," Proc. of the ARPA Human Language Technology Workshop, pp , March [12] J.J. Godfrey, E.C. Holliman, and J. McDaniel, "SWITCHBOARD: Telephone speech corpus for research and development," Proc. Inter. Con! Acoust., Speech, and Signal Proc., vol. 1, pp , March [13] L. Nguyen et al., ''The 1994 BBN/BYBLOS speech recognition system," Proc. of the ARPA Spoken Language Systems Technology Workshop, pp , January [14] K. C. Chou, A. S. Willsky, and A. Benveniste, "Multiscale recursive estimation, data fusion, and regularization," IEEE Trans. Automatic Control, vol. 39, no. 3, pp , [15] M. R. Luettgen, W. C. Karl, A. S. Willsky, and R. R. Tenney, "Multiscale representations of Markov random fields," IEEE Trans. Signal Proc., vol. 41, no. 12, pp , [16] V. Digalakis, J.R. Rohlicek, and M. Ostendorf, "ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition," IEEE Trans. Speech and Audio Proc., vol. 1, no. 4, pp , [17] M. R. Luettgen and A. S. Willsky, "Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination," IEEE Trans. Image Proc., vol. 4,no. 2,pp , [18] A. Kannan and M. Ostendorf, "Modeling Dependency in Adaptation of Acoustic Models using Multiscale Tree Processes," Proc. Eurospeech, vol. 4, pp , [19] A. Kannan and M. Ostendorf, "Adaptation of polynomial trajectory segment models for large vocabulary speech recognition," Proc. Inter. Con! Acoust., Speech and Signal Proc., vol. 2,pp , April [20] D. Paul, "Extensions to phone-state decision-tree clustering single tree and tagged clustering," Proc.lnter. Con! Acoust., Speech and Signal Proc., vol. 2, pp , April [21] B. M. Shahshahani, "A Markov random field approach to Bayesian speaker adaptation," IEEE Trans. Speech and Audio Proc., vol. 5, no. 2, pp , [22] A. Kannan, M. Ostendorf and J. R. Rohlicek, "Maximum likelihood clustering of Gaussians for speech recognition," IEEE Trans. Speech and Audio Proc., vol. 2, no. 3, pp , [23] S. J. Young, J. J. Odell and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Workshop on Human Language Technology, pp , March [24] A. Kannan, M. Ostendorf, D. A. Castanon, and W. C. Karl, "ML parameter estimation of a multiscale tree process using the EM algorithm," Technical Report ECE , Boston University, November Available from ftp://raven.bu.edu/pub/reports. [25] M. R. Luettgen and A. S. Will sky, "Multiscale smoothing error models," IEEE Trans. Automatic Control, vol. 40, no. 1, pp , [26] G. Zavaliagkos, personal communication. [27] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, "A compact model for speaker-adaptive training," Proc. of the Inter. Con! on Spoken Language Processing, vol. 2, pp , October 1996.

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant