Tree-based Dependence Models for Speech Recognition

Size: px
Start display at page:

Download "Tree-based Dependence Models for Speech Recognition"

Transcription

1 Tree-based Dependence Models for Speech Recognition Mari Ostendorf, Ashvin Kannan and Orith Ronen Electrical and Computer Engineering Department, Boston University 8 St. Mary's St., Boston, MA USA mo<dbu. edu Summary. The independence assumptions typically used to make speech recognition practical ignore the fact that different sounds in speech are highly correlated. Tree-structured dependence models make it possible to represent cross-class acoustic dependence in recognition when used in conjunction with hidden Markov or other such models. These models have Markov-like assumptions on the branches of a tree, which lead to efficient recursive algorithms for state estimation. This paper will describe general approaches to topology design and parameter estimation of tree-based models and outline more specific solutions for two examples: discrete-state hidden dependence trees and continuous-state multi scale models, drawing analogies to results for time series models. Initial results for both cases will be described, followed by a discussion of questions raised by the experiments. 1. Introduction In speech recognition, independence assumptions are typically made to reduce the complexity of automatic training and the recognition search. In particular, a standard assumption used in virtually all recognition systems is that each vector or segment is generated independently given an underlying state or phone sequence. In other words, in a speaker-independent system, there is no notion that an faa! and an fah/ (or even another faa!) in the same utterance or speaker session have something in common because they came from the same vocal tract. The assumption effectively allows two phones at different times in an utterance to come from different speakers. Vocal tract length (VTL) normalization (e.g. [1]) compensates for this problem to some extent, but it is clear that VTL normalization does not account for all speakerdependent effects because gains are additive when it is used in combination with acoustic model adaptation. In addition, sounds can be correlated for other reasons, such as the recording environment or dialect-related pronunciation patterns. Acoustic model adaptation is used to overcome this problem, but in large vocabulary recognition one often has very little data from the target speaker with which to adapt possibly millions of parameters. Therefore, most current adaptation techniques assume classes of models over which the adaptation transformation is tied (e.g. [2, 3]), or they may approximate the joint dependence of different speech sounds by defining regions of local dependence ([3, 4]). However, such approaches do not take full advantage of the predictive power that observations from one phone have for another. Ignoring the speech recognition application for the moment, our problem is to estimate a probability distribution that represents the joint dependence of variables in a very high dimensional space and to use this distribution to make inferences K. Ponting (ed.), Computational Models of Speech Pattern Processing Springer-Verlag Berlin Heidelberg 1999

2 Tree-based Dependence Models 41 about missing variables. We refer to such a probability distribution as a dependence model. For such a model to be practical, Markov-like assumptions are required; some examples include Markov random fields and Bayesian networks. In this work, we focus on a particular class of Markov dependence models that are tree-based, because of the additional efficiency of estimation and prediction algorithms and because topology design is simpler and arguably more robust for trees. The remainder of the chapter is organized as follows. In the next section, we introduce a general hidden tree framework that handles the type of variable-length observations encountered in speech applications. Then, we describe two specific examples of dependence models and how they can be used in speech processing applications. We conclude with a brief summary and discussion of open questions. 2. Hidden Tree Framework The problem of acoustic modeling of speech sounds includes the important issue of characterizing variable-length observations. A speech sound may occur a different number of times (or not at all) and each instance has a randomly varying length. We handle this problem in the dependence model by defining a fixed-dimension hidden state X = [Xl"'" XN] that represents the joint dependence of speech sounds Xi, which are associated with observations Y = {Yi; i = 1,..., N}. For a tree-based model, Xi corresponds to a node in the tree, and N is equal to the number of nodes in the tree. Each observation Yi = [Yi,l,..., Yi,L,] is a concatenation of the Li different instances of sound i (ignoring time order), and can be thought of as a random process with characteristics depending on the hidden state. Figure 1 illustrates a tree with a variable number of observations associated with each node. Depending on the underlying model of speech, Yi,j may correspond to a frame of speech (i.e. a vector of cepstral coefficients, the "observation" associated with an HMM state) or a segment trajectory (Le. the vector of coefficients describing a trajectory of features, the sufficient statistics for the observation associated with a polynomial segment model [19]). The complete set of observations Y may correspond to a single utterance or a collection of utterances. The probabilistic model is specified by: a function that defines the tree topology, e.g. 1T( i) is the parent of node i; Markov state distributions associated with branches in the tree, P(Xi IX1l"(i)); and observation distributions p(yiixi), assuming that observations are conditionally independent given the state. The conditional independence assumptions, like those in a hidden Markov model (HMM), make implementation practical. The differences with respect to an HMM are that dependence is between un-ordered sound classes rather than sequential states in time, and that the state sequence is a fixed-length vector rather than a random process. As with an HMM, there are three problems to solve in applying the dependence model: optimal state estimation, efficient computation of the likelihood p(y), and model design (i.e. topology design and distribution parameter estimation). These problems are solved using recursive algorithms that take advantage of assumptions of conditional independence of nodes and subtrees given the value of an intermediate node. The solutions are analogous to the corresponding HMM algorithms but

3 42 M. Ostendorf et al. Fig. 1. Illustration of a tree-structured model with a hidden state: open circles indicate the nodes of the tree that form the hidden state X; filled triangles denote observations associated with anode Y differ in that the updates follow the tree structure rather than a linear time sequence. Details of the algorithms depend on the particular state and observation distribution assumptions, but there are some issues that apply in general, as described below. There are two main types of tree topologies that could describe a collection of sound classes, and the relationship between the number of nodes in the tree Nand the number of sound classes M depends on the particular type of topology. At one extreme is the graph of connections between sound classes, where every node in the graph corresponds to one of a disjoint collection of classes (one Xi per class). In this case, N = M, and topology design involves finding the best graph that connects the classes. At the other extreme is a hierarchically organized tree, where the target sound classes comprise the leaves of the tree and sub-classes representing different levels of granularity are introduced at internal nodes. In this case, topology design involves clustering the M classes. If sub-classes are defined using a binary tree, then N = 2M - 1. Hybrid versions can be envisioned and probably will be the most effective solution, in part because the introduction of sub-classes is a useful tool for robust topology design when M is large. Parameter estimation for dependence models with a hidden tree structure must address the problem of unobserved variables in estimation, which is generally solved using the Expectation-Maximization (EM) algorithm [5]. The algorithm involves two steps: 1) finding the expected joint likelihood of the hidden state and observations given the current parameter estimates, and 2) computing the maximum likelihood estimate of the parameters in terms of the statistics found in step (1). Making Markov assumptions on the tree, the first step can be implemented efficiently with an algorithm analogous to the forward-backward algorithm used in HMM parameter estimation, except that it runs upward and downward on the tree rather than forward and backward in time. In the two sections that follow, we introduce two different examples of treebased dependence models: the hidden dependence tree, which represents a discrete-

4 Tree-based Dependence Models 43 valued hidden state using the disjoint-class topology; and the multiscale model, which represents a continuous-valued state using a hierarchical topology. Each section first describes the mathematical framework of the model and its application to speech recognition, followed by the algorithms for topology design and parameter estimation, and finally presents some experimental results. 3. Hidden Dependence Trees Hidden dependence trees are an extension of the discrete dependence trees introduced by Chow and Liu [6] to efficiently model the dependence among a set of random variables. The dependence tree represents a discrete underlying state, and the extension allows for variable-length and continuous-valued observations. 3.1 The Mathematical Framework In the discrete-state case, the joint probability function P(X) is modeled using a dependence tree [6]. A component Xi is assigned to node i in the tree, and each edge in the tree is associated with the conditional probability function of the two variables connected by the edge, i.e. P(xilxj) for the edge connecting the node of Xi to its parent Xj. The parent of node i, denoted by 1I"(i), is the first node on the path connecting node i to the root. The root of the tree is associated with the component xo, which is introduced for notational purposes and is not actually a component of X. The nodes Xi connected to the root have 11"( i) = 0 as their parent, and edges P(xilxo) are defined to be P(Xi). In other words, the Xi with 1I"(i) = 0 are independent and hence so are the respective subtrees associated with those nodes. The dependence tree state distribution model is and the joint observation-state distribution is N P(X) = II P(xilx7r(i)), (1) i=l (2) assuming that {Yi,j} are conditionally independent and identically distributed given the state. (Note that upper case P is used to denote a probability mass function, and lower case p a density function.) As in an HMM, the probability of a set of observations is computed by summing over the possible state vectors N p(y) = L IIp(Yilxi)P(xiIX7r(i)). (3) X i=l The sum can be computed efficiently using a recursive algorithm that incorporates probabilities from the leaves upward to the root of the tree, analogous to the forward algorithm for HMMs.

5 44 M. Ostendorf et at. 3.2 Application to Speech The dependence tree state is hidden in the same sense that the mode of a Gaussian mixture distribution is hidden; observations are continuous-valued cepstral features described by Gaussian distributions conditioned on the hidden state. The difference is that the dependence tree state is vector-valued, unlike the scalar mode of a Gaussian mixture distribution. An HMM that uses Gaussian mixture observation distributions also has a multi-dimensional state, but there are important differences with respect to dependence trees. The HMM state sequence is variable-length and timeordered. In the hidden dependence tree, on the other hand, the state dimension and order is fixed, and there is no notion of time. The state probability distributions in an HMM (aij = pest = i!st-l = j) describe sequence length and ordering, while the state probability distributions in a dependence tree (ai,jk = P(Xi = jix 7r(i) = k) describe the relationship between the values of states in a fixed order. The analogy of the hidden dependence tree to a Gaussian mixture and the differences with respect to an HMM suggest an application of the dependence tree in acoustic modeling. Consider an HMM that uses Gaussian mixture distributions. Let Xi = j in the dependence tree indicate that the mixture mode of HMM state i is j. Then the hidden dependence tree provides a model for correlation of the mixture modes across sound classes. With this interpretation, one can envision different applications of the dependence model used in conjunction with an HMM. Assume that an HMM is first used to provide a "transcription" and segmentation of an utterance in terms of the N sound classes in the dependence tree. The transcription is used to group the observations into subsets Yi, and the hidden dependence tree model is then used to compute p(y). This probability can be used in a likelihood ratio test of whether two segments of speech came from one vs. two speakers (or in other textindependent speaker/language identification problems), or as an additional "consistency score" in N-best rescoring of hypotheses for word recognition. Alternatively, the observations can be used to re-estimate mixture weights, i.e. ~ij = P(Xi = jly) for the j -th mixture weight associated with state i, for use in a subsequent decoding pass. This probability is computed using the upward-downward algorithm used for state estimation in model design, described next. 3.3 Topology Design and Parameter Estimation In this discussion, we will assume a non-hierarchical topology for the dependence tree structure; that is, the classes represented by the tree are disjoint. For the case where X is discrete-valued and fully observable, Chow and Liu [6] describe an algorithm for estimating both dependence tree structure and its parameters. In our case, where both the tree and subsets of the observations Yi are unobserved, we divide topology design and parameter estimation into two steps. However, we build on the Chow-Liu algorithm by defining an intermediate, partially observable discrete state, as described below. Class Definition and Topology. Topology design requires finding 7l"( i) for all nodes i = 1,...,N. The Chow and Liu algorithm finds the tree topology that minimizes the difference of the information contained in the true probability function and

6 Tree-based Dependence Models 45 that contained in its approximation by a dependence tree. This minimization criterion is equivalent to maximizing the total weight on the edges of the tree, where the weight of the edge connecting nodes Xi and X j is the mutual information I (Xi; X j ) based on relative frequency estimates of their joint probability distribution. Given all possible I(Xi; Xj), topology design is a minimum spanning tree search problem. The Chow and Liu algorithm works well when the samples of the vector X are complete, meaning that all the components of samples are observed, and when the number of samples of the vector X is large relative to the number of values an Xi can take on. When there are a small number of samples for a pair of variables, the mutual information estimate is biased above the true value, so (Xi, X j) pairs that are infrequently observed may be incorrectly assigned links in the tree. In order to use the Chow-Liu algorithm, we estimate a discrete state vector X for each training sample by setting Xi equal to the index of the vector quantization (VQ) codeword that minimizes the total distortion of the observations Yj E Yi. In order to keep the number of values for Xi small and still have a reasonable sampling of the vector observation space, node-dependent codebooks are designed. Assuming the tree nodes correspond to phonetic or sub-phonetic units, there will be some missing elements of the estimated state vectors, because of the wide variation in frequencies of occurrence of different phonemes. This imbalance of phone-pair occurrence rates can lead to bad estimates of the mutual information and poor tree topologies. To obtain robust trees, we modified the Chow-Liu algorithm to include a threshold on the number of co-occurrences of every phone pair for allowing a link between the pair and a limit on the number of connections in the tree. In addition, we used random sampling to obtain speaker-level state vectors to reduce the number of missing elements relative to utterance-level vectors. Parameter Estimation. Two sets of distribution parameters are needed to characterize the hidden dependence tree: the tree edge distributions P(xilx1I"(i)) and the observation distributions p(yixi = j) '" N(Jlij, Eij ), where N(Jl, E) denotes a Gaussian with mean Jl and covariance E. The above topology design gives an initial estimate for the edge distributions. Initial estimates for Jlij and Eij are given by the VQ codewords and associated error covariances. Given an initial estimate, the parameters can be refined based on the actual observations using the iterative EM algorithm. As mentioned earlier, the expectation step involves a recursive upwarddownward algorithm that is analogous to the forward-backward algorithm used in HMM parameter estimation. If the VQ-estimated observations of X are available, then the tree edge distributions can be estimated using the upward-downward algorithm for discrete dependence trees [7], which is an extension of Pearl's algorithm for belief propagation in causal trees [8] and a special case of the more general algorithm for Bayesian networks described by Lucke [9]. To estimate both the edge distributions and the observation distributions, the upward-downward algorithm is extended to use the observations Yi in the upward step during the update of node i, and parameter re-estimation for the node-dependent Gaussians is added to the maximization step [10]. The complete parameter estimation algorithm is similar to that used for HMMs with Gaussian mixture distributions, with the difference being added complexity due to tree-structured rather than time-based dependence.

7 46 M. Ostendorf et al. 3.4 Experiments Experiments assessing various methods for topology training and the usefulness of the hidden dependence model were conducted on two large vocabulary continuous speech recognition tasks using the Wall Street Journal [11] and the Switchboard [12] corpora, in both cases training on roughly 120 hours of speech. The WSJ corpus is based on read business news, and the Switchboard corpus comprises telephonequality conversational speech on a variety of topics. The feature vectors included 14 mel-warped cepstra (no derivatives), computed at a IOms frame rate using cepstral mean subtraction. In the experiments on Switchboard, the features were also normalized to compensate for vocal tract length [1]. The X vector associated each node with a phone, so the dimension was (i.e. context-independent models) and therefore the dependence tree models used only the frames in the center of the phone segment to minimize coarticulation influences. In development of the topology design approach, we evaluated the performance of the dependence tree models by computing the likelihood of an independent test set. The results showed that the dependence tree performed better than an independent-phone model, and that constraints on the topology of the tree generally improved performance. In addition, the automatically designed dependence tree outperformed a tree that had been specified by hand according to manner of articulation and other differences in articulatory features. As an example, the tree topology designed on the Switchboard corpus is given in Figure 2, illustrating learned dependence that reflects articulation manner (e.g. among fricatives and nasals) but also some connections that were probably dominated by co-articulation effects (e.g. laa/-/erl since laa/ is often followed by Ir/). We evaluated the performance of the hidden dependence tree model by using the likelihood p(y) as an additional score in N-best rescoring experiments. In these experiments an HMM-based recognition system from BBN [13] provided an N best list of hypotheses (N = 100) for all the utterances in the test set, along with an HMM acoustic score and a trigram language model score for each hypothesis. These hypotheses were rescored by the hidden dependence tree model. A linear combination of these scores plus word and phone counts (insertion penalties) was used for re-ranking the list of hypotheses and producing the final recognized output. The weights of the different scores were optimized on a development test set. The dependence tree model used in this experiment was a gender-dependent model with 10 node-dependent codewords per phone and a constrained tree. Table 1 shows the results of these experiments. There is a slight improvement when combining the likelihood score obtained from the dependence tree model with the HMM score. Further gains should be obtained by using more detailed sound classes, such as triphone states, but the resulting dependence tree would be large and likely require a hybrid hierarchical and Chow-Liu topology design strategy.

8 Tree-based Dependence Models ~ oy amo bm col ox moo fpm laf zb ~ n ey m I~ dihiyuw...---:/"1 ~ dhuboh g y ~I b v W Be ~ b# bb ay oi ix /I ~ ~ pau p aw oh Cpu ax dx I I I ~ t th I s ah I...---:/"1 ~ k ao er ow f sh z ~ ~ axr aa jh ch Fig. 2. Discrete dependence tree designed on the Switchboard corpus, where subtrees connected to the root node (indicated by "@") are independent Table 1. N-best rescoring results (word error rates) on the 1993 WSJ evaluation test and the 1996 SWBD evaluation test. The knowledge sources are the HMM acoustic score, the dependence tree score (DT), and a trigram language model (LM). Knowledge Sources HMM,LM HMM,DT,LM II WSJ Eval93 II SWBDEval Multiscale Tree Processes Multiscale stochastic processes represent an important class of models, of which a particularly useful subclass is based on scale-recursive dynamics on trees [14, 15]. These models allow efficient algorithms for both estimation and likelihood calculation resulting in a variety of applications. In this section, we describe the general framework and application to acoustic model adaptation in speech recognition. 4.1 The Mathematical Framework Denoting a node in a tree by t with parenti ft, a state-space model for the evolution in the tree of the Gaussian process X and its noisy observation Y is given by 1 The notation t1 represents the same information as 7r( i) for the hidden dependence tree; the two notations are used to be consistent with other literature in the respective areas.

9 48 M. Ostendorf et at. Xt Yt,i AtXt-r + Wt CtXt + Vt,i (4) (5) where Xt is the vector state of the process at node t. The root node state Xo has distribution N(O, Eo). The process noise Wt is white, independent of xo, and has distribution N(O, Qt). The state Xt is observed via a noisy measurement Yt,i, where the measurement noise Vt,i is white, independent of Xo and Wt, and has distribution N(O, Rt). Thus, ex = (Eo, {At, Qt}) are the parameters of p(x), and eylx = ({ Ct, Rt}) are the parameters of p(yix). The zero~mean assumptions of the root node Xo and the noise terms are not a requirement of the model, but result in simpler estimation equations. A degenerate tree with only one leaf node (parent nodes have only one child) can be interpreted as a standard linear dynamical system, i.e. having a time-like index. As an acoustic model for speech recognition, the standard dynamical system is a continuous-state alternative to the discrete-state HMM, where likelihood is computed using Kalman filtering recursions to obtain innovations and associated distribution parameters [16]. A similar approach can be used for the multiscale model extending the Kalman recursions on the tree [17]. For the adaptation application, state estimation is more important than likelihood computation. Given Y, the set of all available observations, the smoothed estimate 2 of the state Xt = E{xtIY} and the associated error covariance PtlY = E{[xt - X t][ Xt - Xt 1 T} is computed using a generalization of the Rauch-Thng -Striebel (RTS) algorithm [14]. Smoothing is done in two sweeps: an upward sweep from the leaves to the root, followed by a downward one from the root to the leaves. The complexity of the tree RTS smoother is O( d 3 N) where d is the dimensionality of the state (the d3 is due to matrix inversions), and N is the number of nodes in the tree. 4.2 Application to Speech The multiscale model can be used for the adaptation of means of acoustic models to a new speaker or new environmental conditions. For example, let each leaf r of the multi scale tree be associated with a set of Gaussians YT' and adapt the means of all Gaussians in class r by a common shared shift X T : where ILl: denotes the mean ILi after adaptation. Such a shared shift approach has been used for Gaussians in hidden Markov models (HMMs) [3] and the stochastic segment model (SSM) [18], and for polynomial segment models (PSMs) [19]. The observations YT,i E YT associated with node r are differences between the speaker Independent means ILi and the average of feature vectors observed for sound i E YT in an utterance. 2 The term "smoothed estimate" refers to the linear least squares (or for Gaussians, the minimum mean squares) estimate. It also corresponds to the maximum a posteriori estimate. (6)

10 Tree-based Dependence Models 49 Initial independent estimates for the shift x T and associated error covariance P T can be obtained from adaptation data for each class l by averaging the observed shifts for that class (YT,i E YT) and computing the equivalent covariance of the averaged variable [19]. Let us define a Gaussian tree-based shift process (Equation 4) with M leaves, and associate the leaf node states with the shifts of the M classes we wish to model dependence between. Given T and P T at a subset of leaves, adaptation involves estimating the hidden states x T for all leaves using the tree RTS smoother and then shifting the models within the respective classes. Due to the Bayesian nature of the estimate, the smoothed shift approaches the unsmoothed shift and converges to the standard ML speaker-dependent estimate as the amount of adaptation data increases. A similar tree-based model for adaptation is described in [20], but the upwarddownward propagation of mean shifts is based on a heuristic that does not account for degree of correlation or variance differences. Another Markovian model used in adaptation is based on Markov random fields [21]. Multiscale models offer a number of advantages over Markov random fields including a constant per-node complexity, the availability of an error covariance associated with smoothing, and the fact that state estimation algorithms are efficient, non-iterative, recursive and parallelizable. 4.3 Topology Design and Parameter Estimation To use the multiscale model of dependence in adaptation, we need to define the adaptation classes and tree topology, as well as estimate parameters for the process. Class Definition and Topology. In continuous speech recognition, contextdependent models are frequently clustered in the form of a tree for each region (or state) of a phone using ML clustering of Gaussians [22,23]. Figure 3(a) illustrates the tree for one region. Each node of the tree represents an equivalence class of triphones. Nodes at a certain "cut" through the tree, the boxes in Figure 3(a), define terminal adaptation classes to share shifts (Equation 6). One popular, but ad-hoc, option for adaptation is the "back-off" strategy, where the shift is computed at the most detailed node which has more than some threshold of adaptation frames and copied to all child terminal adaptation classes as shown in Figure 3(b). The topology of the clustering tree can also be used for multiscale smoothing. Class-dependent shifts Xt are computed at the terminal adaptation nodes and then smoothed using the multi scale model (Figure 3(c)) to get shift estimates Xt at all nodes. Parameter Estimation. Maximum-likelihood estimates of the parameters of the tree process (Eo, At, Qt, Ct, Rt) can be obtained by applying the RTS and EM algorithms to multiple independent sample vectors Y [24], where each conversation side contributes one sample. The general approach follows that described in [16] for a time-ordered dynamical system, which involves iteratively finding expected sufficient statistics of the hidden state (E-step), and then using multivariate regression to compute new process parameters (M-step). The main difference is in the recursions used in the E-step, which build on the tree algorithms developed in [14, 25]. Here, we assume Ct = I and Rt is effectively given by the sample variance of the observations, so there is no need to estimate Ct and Rt. In the experiments described

11 50 M. Ostendorf et al. Fig. 3. Trees used for adaptation: (a) shows the clustering tree with squares indicating terminal adaptation classes, (b) illustrates the "back -off" method of adaptation with dashed squares indicating back-off classes, and (c) shows the corresponding multiscale smoothing approach. In both (b) and (c), triangles indicate observations. next, the A and Q parameters are shared among all nodes of a phone; i.e. for K phones, there are K trees each with an (A, Q) pair. To start the EM iterations, we need initial estimates of 17(0), the A's and the Q's. For each speaker in the training set we compute covariances of ML (unsmoothed) shifts at each terminal shift node. A frequency-weighted average of these covariances across all speakers is used for initializing 17(0) and all Qt, and initial At = I for all t. 4.4 Experiments Experiments were conducted on the Switchboard corpus. The feature vectors were the same as for the hidden dependence tree experiments except that energy and feature derivatives are used. N-best rescoring is also used here, with the segmentmodel acoustic score substituted for the HMM score. Most experiments use sixty hours of speech for training the acoustic and multiscale (MS) models; 123 hours are used in the guided adaptation experiments. The PSM systems used a 2-region model, with each region modeled by a linear trajectory Gaussian process with a single full covariance. The SSM systems used a 5-region model, with each region represented by a full covariance Gaussian. Both cases used gender-dependent models and ML clustered triphones. The PSM and SSM adaptation systems had 300 and 150 terminal adaptation classes/region, respectively. For a fair comparison of MS smoothing vs. back-off approaches, the same topology is used for both types of adaptation. In batch-mode adaptation, the first half of each conversation is used as adaptation data and the second half for testing. The results in Table 2 for supervised adaptation indicate the MS-smoothing is better than the back-off approach. However, performance for both algorithms degrades relative to the baseline in unsupervised adaptation, indicating a sensitivity to the high error rate in the Switchboard tasks. Consequently, we use guided adaptation in further unsupervised experiments. In unsupervised transcription-mode adaptation two passes are made over the speech: the first to collect statistics for adaptation, and the second to perform recognition with the adapted models. Adaptation is guided in the sense that we adapt only

12 Tree-based Dependence Models 51 Table 2. Supervised batch recognition with 2-region PSMs. Error rates are on the second half of conversations in the Dev96 test set. SI baseline 44.5% with data from the subset of words in the top first-pass hypothesis with confidence over a specified threshold. This serves to lower the error-rate for the speech used in adaptation, which benefits both ML and MS adaptation. Guided adaptation also tests the generalization capability of the dependence model for unseen classes (in the "incorrect" parts of the speech). In experiments on the Dev97 test set with a 5-region SSM, we found that the ML back-off approach improved a 40.9% WER baseline 3 to 40.4% and that further improvement to 40.0% was obtained with MS smoothing. Table 3 shows that MS adaptation gives about 1 % absolute improvement in performance. A much greater gain is expected from using lattice decoding rather than N-best rescoring, based on BBN adaptation experiments [26]. Table 3. Guided unsupervised transcription mode adaptation with a 5-region SSM system Dev97 Eval97 Baseline MS adapt Discussion In summary, tree-based models of dependence provide an efficient framework for representing correlation across phones in speech (or sub-phonetic units represented by HMM states), for use in adaptation as well as other applications. Dependence models are a supplement to and not a replacement for existing techniques, such as HMMs, in that they model correlation across classes but not time. Markov-like assumptions combined with a tree structure make for efficient algorithms for computing the expected state given a set of observations. Dependence model design involves first finding the tree topology, which can be an direct connection of classes or a hierarchy of sub-classes, and then EM parameter estimation using an upwarddownward algorithm for handling the hidden state. Two important examples are described: the hidden dependence tree, which has a discrete hidden state and can be thought of as a mechanism for loosely coupling Gaussian mixture modes of different models; and the multiscale model, which has a continuous hidden state and relates models via a hierarchy with different levels of granularity. The two approaches differ primarily in the discrete vs. continuous hidden state, but they also 3 The lower baseline error rate is due to differences in the test set, language model, signal processing parameters, and training on 123 hours of speech.

13 52 M. Ostendorf et at. illustrate two extremes of topology design. It is an open question as to which of the two models is more useful: the hidden dependence tree is better at capturing non-linear dependence between classes, but the mixture mode dependence may be too weak a coupling. Initial results for both models are promising, but much work remains to explore their full potential. For example, the speaker adaptation experiments did not take advantage of several variations known to improve results, such as speaker-adaptive training [27] and iterative adaptation and decoding. Several questions are raised by the initial experimental results, particularly related to topology design and parameter tying. How can we best integrate the mutual information clustering technique, which is more general but not very robust, with hierarchical clustering techniques? What is the right number of classes to represent with the tree? For adaptation, theoretically it is better to use a large number of classes, but in practice we do not find this to be the case, probably because of inaccuracies in the model exacerbated by parameter tying assumptions. In the multiscale model experiments, we assumed that all branches of the tree for a particular phone shared the same transition matrix and process noise covariance. Is it possible to learn finer grained parameter sharing automatically? Of course, there is also the question of whether better results can be obtained by relaxing the tree-structure assumption and using less restrictive models such as Bayesian networks. However, it is likely that the computational efficiency of the tree structure will make the treebased dependence models more attractive in the near term. Acknowledgement. This work was supported by the United States DoD, grant ONR-NOOOI4-92-J References [1] E. Eide and H. Gish, "A parametric approach to vocal tract length nonnalization," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. I, pp , May [2] C. J. Leggetter and P.C. Woodland, "Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression," Proc. ARPA Workshop on Spoken Language Technology, pp , January [3] G. Zavaliagkos, R. Schwartz, J. McDonough, and J. Makhoul, "Adaptation algorithms for large scale HMM recognizers," Proc. European Conference on Speech Comm. and Tech., vol. 2, pp. 1 13l-1134,September [4] Q. Huo and C.-H. Lee, "On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. 2, pp , May [5] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood estimation from incomplete data;' Journal of the Royal Statistical Society (B), vol. 39, no. 1, pp. 1-38, [6] C.K. Chow and C.N. Liu, "Approximating discrete probability distributions with dependence trees," IEEE Trans. Information Theory, vol. IT-14, no. 3, pp , May [7] O. Ronen, J.R. Rohlicek, and M.Ostendorf, "Parameter estimation of dependence tree models using the EM algorithm,,,1 IEEE Signal Processing Letters, vol. 2, no. 8, pp ,1995.

14 Tree-based Dependence Models 53 [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA, [9] H. Lucke, "Which stochastic models allow Baum-Welch training?" IEEE Trans. Signal Proc., vol. 44, no. 11, pp , [10] O. Ronen, Dependence tree models of intra-utterance phone dependence, Boston University Ph.D. Thesis, [11] F. Kubala et ai., "The hub and spoke paradigm for CSR evaluation," Proc. of the ARPA Human Language Technology Workshop, pp , March [12] J.J. Godfrey, E.C. Holliman, and J. McDaniel, "SWITCHBOARD: Telephone speech corpus for research and development," Proc. Inter. Con! Acoust., Speech, and Signal Proc., vol. 1, pp , March [13] L. Nguyen et al., ''The 1994 BBN/BYBLOS speech recognition system," Proc. of the ARPA Spoken Language Systems Technology Workshop, pp , January [14] K. C. Chou, A. S. Willsky, and A. Benveniste, "Multiscale recursive estimation, data fusion, and regularization," IEEE Trans. Automatic Control, vol. 39, no. 3, pp , [15] M. R. Luettgen, W. C. Karl, A. S. Willsky, and R. R. Tenney, "Multiscale representations of Markov random fields," IEEE Trans. Signal Proc., vol. 41, no. 12, pp , [16] V. Digalakis, J.R. Rohlicek, and M. Ostendorf, "ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition," IEEE Trans. Speech and Audio Proc., vol. 1, no. 4, pp , [17] M. R. Luettgen and A. S. Willsky, "Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination," IEEE Trans. Image Proc., vol. 4,no. 2,pp , [18] A. Kannan and M. Ostendorf, "Modeling Dependency in Adaptation of Acoustic Models using Multiscale Tree Processes," Proc. Eurospeech, vol. 4, pp , [19] A. Kannan and M. Ostendorf, "Adaptation of polynomial trajectory segment models for large vocabulary speech recognition," Proc. Inter. Con! Acoust., Speech and Signal Proc., vol. 2,pp , April [20] D. Paul, "Extensions to phone-state decision-tree clustering single tree and tagged clustering," Proc.lnter. Con! Acoust., Speech and Signal Proc., vol. 2, pp , April [21] B. M. Shahshahani, "A Markov random field approach to Bayesian speaker adaptation," IEEE Trans. Speech and Audio Proc., vol. 5, no. 2, pp , [22] A. Kannan, M. Ostendorf and J. R. Rohlicek, "Maximum likelihood clustering of Gaussians for speech recognition," IEEE Trans. Speech and Audio Proc., vol. 2, no. 3, pp , [23] S. J. Young, J. J. Odell and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Workshop on Human Language Technology, pp , March [24] A. Kannan, M. Ostendorf, D. A. Castanon, and W. C. Karl, "ML parameter estimation of a multiscale tree process using the EM algorithm," Technical Report ECE , Boston University, November Available from ftp://raven.bu.edu/pub/reports. [25] M. R. Luettgen and A. S. Will sky, "Multiscale smoothing error models," IEEE Trans. Automatic Control, vol. 40, no. 1, pp , [26] G. Zavaliagkos, personal communication. [27] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, "A compact model for speaker-adaptive training," Proc. of the Inter. Con! on Spoken Language Processing, vol. 2, pp , October 1996.

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Martin Karafiát Λ, Igor Szöke, and Jan Černocký Brno University of Technology, Faculty of Information Technology Department

More information

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:

More information

1 1 λ ( i 1) Sync diagram is the lack of a synchronization stage, which isthe main advantage of this method. Each iteration of ITSAT performs ex

1 1 λ ( i 1) Sync diagram is the lack of a synchronization stage, which isthe main advantage of this method. Each iteration of ITSAT performs ex Fast Robust Inverse Transform SAT and Multi-stage ation Hubert Jin, Spyros Matsoukas, Richard Schwartz, Francis Kubala BBN Technologies 70 Fawcett Street, Cambridge, MA 02138 ABSTRACT We present a new

More information

Introduction to HTK Toolkit

Introduction to HTK Toolkit Introduction to HTK Toolkit Berlin Chen 2003 Reference: - The HTK Book, Version 3.2 Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools Homework:

More information

Loopy Belief Propagation

Loopy Belief Propagation Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

Learning The Lexicon!

Learning The Lexicon! Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

MRF-based Algorithms for Segmentation of SAR Images

MRF-based Algorithms for Segmentation of SAR Images This paper originally appeared in the Proceedings of the 998 International Conference on Image Processing, v. 3, pp. 770-774, IEEE, Chicago, (998) MRF-based Algorithms for Segmentation of SAR Images Robert

More information

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,

More information

Hierarchical Mixture Models for Nested Data Structures

Hierarchical Mixture Models for Nested Data Structures Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR

More information

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department

More information

TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM

TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM Hy Murveit, Peter Monaco, Vassilios Digalakis, John Butzberger SRI International Speech Technology and Research Laboratory

More information

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008 peech Recogni,on using HTK C4706 Fadi Biadsy April 21 st, 2008 1 Outline peech Recogni,on Feature Extrac,on HMM 3 basic problems HTK teps to Build a speech recognizer 2 peech Recogni,on peech ignal to

More information

A Gaussian Mixture Model Spectral Representation for Speech Recognition

A Gaussian Mixture Model Spectral Representation for Speech Recognition A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

An Introduction to Pattern Recognition

An Introduction to Pattern Recognition An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included

More information

THE most popular training method for hidden Markov

THE most popular training method for hidden Markov 204 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 A Discriminative Training Algorithm for Hidden Markov Models Assaf Ben-Yishai and David Burshtein, Senior Member, IEEE Abstract

More information

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao GMM-FREE DNN TRAINING Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao Google Inc., New York {andrewsenior,heigold,michiel,hankliao}@google.com ABSTRACT While deep neural networks (DNNs) have

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Client Dependent GMM-SVM Models for Speaker Verification

Client Dependent GMM-SVM Models for Speaker Verification Client Dependent GMM-SVM Models for Speaker Verification Quan Le, Samy Bengio IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland {quan,bengio}@idiap.ch Abstract. Generative Gaussian Mixture Models (GMMs)

More information

Modeling time series with hidden Markov models

Modeling time series with hidden Markov models Modeling time series with hidden Markov models Advanced Machine learning 2017 Nadia Figueroa, Jose Medina and Aude Billard Time series data Barometric pressure Temperature Data Humidity Time What s going

More information

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Evaluation

More information

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of

More information

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION Ken Sauef and Charles A. Bournant *Department of Electrical Engineering, University of Notre Dame Notre Dame, IN 46556, (219) 631-6999 tschoo1

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos Machine Learning for Computer Vision 1 18 October, 2013 MVA ENS Cachan Lecture 6: Introduction to graphical models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Wenzhun Huang 1, a and Xinxin Xie 1, b 1 School of Information Engineering, Xijing University, Xi an

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems Di Zhong a, Shih-Fu Chang a, John R. Smith b a Department of Electrical Engineering, Columbia University, NY,

More information

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3 Applications of Keyword-Constraining in Speaker Recognition Howard Lei hlei@icsi.berkeley.edu July 2, 2007 Contents 1 Introduction 3 2 The keyword HMM system 4 2.1 Background keyword HMM training............................

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++ Dietrich Paulus Joachim Hornegger Pattern Recognition of Images and Speech in C++ To Dorothea, Belinda, and Dominik In the text we use the following names which are protected, trademarks owned by a company

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Modeling Phonetic Context with Non-random Forests for Speech Recognition

Modeling Phonetic Context with Non-random Forests for Speech Recognition Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 /

More information

Text-Independent Speaker Identification

Text-Independent Speaker Identification December 8, 1999 Text-Independent Speaker Identification Til T. Phan and Thomas Soong 1.0 Introduction 1.1 Motivation The problem of speaker identification is an area with many different applications.

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System Mono-font Cursive Arabic Text Recognition Using Speech Recognition System M.S. Khorsheed Computer & Electronics Research Institute, King AbdulAziz City for Science and Technology (KACST) PO Box 6086, Riyadh

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Bayesian Curve Fitting (1) Polynomial Bayesian

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Automatic basis selection for RBF networks using Stein s unbiased risk estimator Automatic basis selection for RBF networks using Stein s unbiased risk estimator Ali Ghodsi School of omputer Science University of Waterloo University Avenue West NL G anada Email: aghodsib@cs.uwaterloo.ca

More information

Optimization of HMM by the Tabu Search Algorithm

Optimization of HMM by the Tabu Search Algorithm JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 949-957 (2004) Optimization of HMM by the Tabu Search Algorithm TSONG-YI CHEN, XIAO-DAN MEI *, JENG-SHYANG PAN AND SHENG-HE SUN * Department of Electronic

More information

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient quality) 3. I suggest writing it on one presentation. 4. Include

More information

Challenges motivating deep learning. Sargur N. Srihari

Challenges motivating deep learning. Sargur N. Srihari Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation

More information

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

A ROBUST SPEAKER CLUSTERING ALGORITHM

A ROBUST SPEAKER CLUSTERING ALGORITHM A ROBUST SPEAKER CLUSTERING ALGORITHM J. Ajmera IDIAP P.O. Box 592 CH-1920 Martigny, Switzerland jitendra@idiap.ch C. Wooters ICSI 1947 Center St., Suite 600 Berkeley, CA 94704, USA wooters@icsi.berkeley.edu

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Context based optimal shape coding

Context based optimal shape coding IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing September 13-15, 1999, Copenhagen, Denmark Electronic Proceedings 1999 IEEE Context based optimal shape coding Gerry Melnikov,

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Skill. Robot/ Controller

Skill. Robot/ Controller Skill Acquisition from Human Demonstration Using a Hidden Markov Model G. E. Hovland, P. Sikka and B. J. McCarragher Department of Engineering Faculty of Engineering and Information Technology The Australian

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

A Model Selection Criterion for Classification: Application to HMM Topology Optimization

A Model Selection Criterion for Classification: Application to HMM Topology Optimization A Model Selection Criterion for Classification Application to HMM Topology Optimization Alain Biem IBM T. J. Watson Research Center P.O Box 218, Yorktown Heights, NY 10549, USA biem@us.ibm.com Abstract

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Constraints in Particle Swarm Optimization of Hidden Markov Models

Constraints in Particle Swarm Optimization of Hidden Markov Models Constraints in Particle Swarm Optimization of Hidden Markov Models Martin Macaš, Daniel Novák, and Lenka Lhotská Czech Technical University, Faculty of Electrical Engineering, Dep. of Cybernetics, Prague,

More information

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Todd K. Moon and Jacob H. Gunther Utah State University Abstract The popular Sudoku puzzle bears structural resemblance to

More information

Discriminative Training and Adaptation of Large Vocabulary ASR Systems

Discriminative Training and Adaptation of Large Vocabulary ASR Systems Discriminative Training and Adaptation of Large Vocabulary ASR Systems Phil Woodland March 30th 2004 ICSI Seminar: March 30th 2004 Overview Why use discriminative training for LVCSR? MMIE/CMLE criterion

More information

Confidence Measures: how much we can trust our speech recognizers

Confidence Measures: how much we can trust our speech recognizers Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition

More information

VIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD. Ertem Tuncel and Levent Onural

VIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD. Ertem Tuncel and Levent Onural VIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD Ertem Tuncel and Levent Onural Electrical and Electronics Engineering Department, Bilkent University, TR-06533, Ankara, Turkey

More information

Epitomic Analysis of Human Motion

Epitomic Analysis of Human Motion Epitomic Analysis of Human Motion Wooyoung Kim James M. Rehg Department of Computer Science Georgia Institute of Technology Atlanta, GA 30332 {wooyoung, rehg}@cc.gatech.edu Abstract Epitomic analysis is

More information

Machine Learning. Sourangshu Bhattacharya

Machine Learning. Sourangshu Bhattacharya Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares

More information

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Repeating Segment Detection in Songs using Audio Fingerprint Matching Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme arxiv:1811.06857v1 [math.st] 16 Nov 2018 On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme Mahdi Teimouri Email: teimouri@aut.ac.ir

More information

Robustness of Non-Exact Multi-Channel Equalization in Reverberant Environments

Robustness of Non-Exact Multi-Channel Equalization in Reverberant Environments Robustness of Non-Exact Multi-Channel Equalization in Reverberant Environments Fotios Talantzis and Lazaros C. Polymenakos Athens Information Technology, 19.5 Km Markopoulo Ave., Peania/Athens 19002, Greece

More information

Performance Characterization in Computer Vision

Performance Characterization in Computer Vision Performance Characterization in Computer Vision Robert M. Haralick University of Washington Seattle WA 98195 Abstract Computer vision algorithms axe composed of different sub-algorithms often applied in

More information

Approximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree

Approximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree Approximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree David Bellot and Pierre Bessière GravirIMAG CNRS and INRIA Rhône-Alpes Zirst - 6 avenue de l Europe - Montbonnot

More information

Clustering. Shishir K. Shah

Clustering. Shishir K. Shah Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,

More information

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation 009 10th International Conference on Document Analysis and Recognition HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation Yaregal Assabie and Josef Bigun School of Information Science,

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information