Tree-based Dependence Models for Speech Recognition
|
|
- Loreen Byrd
- 5 years ago
- Views:
Transcription
1 Tree-based Dependence Models for Speech Recognition Mari Ostendorf, Ashvin Kannan and Orith Ronen Electrical and Computer Engineering Department, Boston University 8 St. Mary's St., Boston, MA USA mo<dbu. edu Summary. The independence assumptions typically used to make speech recognition practical ignore the fact that different sounds in speech are highly correlated. Tree-structured dependence models make it possible to represent cross-class acoustic dependence in recognition when used in conjunction with hidden Markov or other such models. These models have Markov-like assumptions on the branches of a tree, which lead to efficient recursive algorithms for state estimation. This paper will describe general approaches to topology design and parameter estimation of tree-based models and outline more specific solutions for two examples: discrete-state hidden dependence trees and continuous-state multi scale models, drawing analogies to results for time series models. Initial results for both cases will be described, followed by a discussion of questions raised by the experiments. 1. Introduction In speech recognition, independence assumptions are typically made to reduce the complexity of automatic training and the recognition search. In particular, a standard assumption used in virtually all recognition systems is that each vector or segment is generated independently given an underlying state or phone sequence. In other words, in a speaker-independent system, there is no notion that an faa! and an fah/ (or even another faa!) in the same utterance or speaker session have something in common because they came from the same vocal tract. The assumption effectively allows two phones at different times in an utterance to come from different speakers. Vocal tract length (VTL) normalization (e.g. [1]) compensates for this problem to some extent, but it is clear that VTL normalization does not account for all speakerdependent effects because gains are additive when it is used in combination with acoustic model adaptation. In addition, sounds can be correlated for other reasons, such as the recording environment or dialect-related pronunciation patterns. Acoustic model adaptation is used to overcome this problem, but in large vocabulary recognition one often has very little data from the target speaker with which to adapt possibly millions of parameters. Therefore, most current adaptation techniques assume classes of models over which the adaptation transformation is tied (e.g. [2, 3]), or they may approximate the joint dependence of different speech sounds by defining regions of local dependence ([3, 4]). However, such approaches do not take full advantage of the predictive power that observations from one phone have for another. Ignoring the speech recognition application for the moment, our problem is to estimate a probability distribution that represents the joint dependence of variables in a very high dimensional space and to use this distribution to make inferences K. Ponting (ed.), Computational Models of Speech Pattern Processing Springer-Verlag Berlin Heidelberg 1999
2 Tree-based Dependence Models 41 about missing variables. We refer to such a probability distribution as a dependence model. For such a model to be practical, Markov-like assumptions are required; some examples include Markov random fields and Bayesian networks. In this work, we focus on a particular class of Markov dependence models that are tree-based, because of the additional efficiency of estimation and prediction algorithms and because topology design is simpler and arguably more robust for trees. The remainder of the chapter is organized as follows. In the next section, we introduce a general hidden tree framework that handles the type of variable-length observations encountered in speech applications. Then, we describe two specific examples of dependence models and how they can be used in speech processing applications. We conclude with a brief summary and discussion of open questions. 2. Hidden Tree Framework The problem of acoustic modeling of speech sounds includes the important issue of characterizing variable-length observations. A speech sound may occur a different number of times (or not at all) and each instance has a randomly varying length. We handle this problem in the dependence model by defining a fixed-dimension hidden state X = [Xl"'" XN] that represents the joint dependence of speech sounds Xi, which are associated with observations Y = {Yi; i = 1,..., N}. For a tree-based model, Xi corresponds to a node in the tree, and N is equal to the number of nodes in the tree. Each observation Yi = [Yi,l,..., Yi,L,] is a concatenation of the Li different instances of sound i (ignoring time order), and can be thought of as a random process with characteristics depending on the hidden state. Figure 1 illustrates a tree with a variable number of observations associated with each node. Depending on the underlying model of speech, Yi,j may correspond to a frame of speech (i.e. a vector of cepstral coefficients, the "observation" associated with an HMM state) or a segment trajectory (Le. the vector of coefficients describing a trajectory of features, the sufficient statistics for the observation associated with a polynomial segment model [19]). The complete set of observations Y may correspond to a single utterance or a collection of utterances. The probabilistic model is specified by: a function that defines the tree topology, e.g. 1T( i) is the parent of node i; Markov state distributions associated with branches in the tree, P(Xi IX1l"(i)); and observation distributions p(yiixi), assuming that observations are conditionally independent given the state. The conditional independence assumptions, like those in a hidden Markov model (HMM), make implementation practical. The differences with respect to an HMM are that dependence is between un-ordered sound classes rather than sequential states in time, and that the state sequence is a fixed-length vector rather than a random process. As with an HMM, there are three problems to solve in applying the dependence model: optimal state estimation, efficient computation of the likelihood p(y), and model design (i.e. topology design and distribution parameter estimation). These problems are solved using recursive algorithms that take advantage of assumptions of conditional independence of nodes and subtrees given the value of an intermediate node. The solutions are analogous to the corresponding HMM algorithms but
3 42 M. Ostendorf et al. Fig. 1. Illustration of a tree-structured model with a hidden state: open circles indicate the nodes of the tree that form the hidden state X; filled triangles denote observations associated with anode Y differ in that the updates follow the tree structure rather than a linear time sequence. Details of the algorithms depend on the particular state and observation distribution assumptions, but there are some issues that apply in general, as described below. There are two main types of tree topologies that could describe a collection of sound classes, and the relationship between the number of nodes in the tree Nand the number of sound classes M depends on the particular type of topology. At one extreme is the graph of connections between sound classes, where every node in the graph corresponds to one of a disjoint collection of classes (one Xi per class). In this case, N = M, and topology design involves finding the best graph that connects the classes. At the other extreme is a hierarchically organized tree, where the target sound classes comprise the leaves of the tree and sub-classes representing different levels of granularity are introduced at internal nodes. In this case, topology design involves clustering the M classes. If sub-classes are defined using a binary tree, then N = 2M - 1. Hybrid versions can be envisioned and probably will be the most effective solution, in part because the introduction of sub-classes is a useful tool for robust topology design when M is large. Parameter estimation for dependence models with a hidden tree structure must address the problem of unobserved variables in estimation, which is generally solved using the Expectation-Maximization (EM) algorithm [5]. The algorithm involves two steps: 1) finding the expected joint likelihood of the hidden state and observations given the current parameter estimates, and 2) computing the maximum likelihood estimate of the parameters in terms of the statistics found in step (1). Making Markov assumptions on the tree, the first step can be implemented efficiently with an algorithm analogous to the forward-backward algorithm used in HMM parameter estimation, except that it runs upward and downward on the tree rather than forward and backward in time. In the two sections that follow, we introduce two different examples of treebased dependence models: the hidden dependence tree, which represents a discrete-
4 Tree-based Dependence Models 43 valued hidden state using the disjoint-class topology; and the multiscale model, which represents a continuous-valued state using a hierarchical topology. Each section first describes the mathematical framework of the model and its application to speech recognition, followed by the algorithms for topology design and parameter estimation, and finally presents some experimental results. 3. Hidden Dependence Trees Hidden dependence trees are an extension of the discrete dependence trees introduced by Chow and Liu [6] to efficiently model the dependence among a set of random variables. The dependence tree represents a discrete underlying state, and the extension allows for variable-length and continuous-valued observations. 3.1 The Mathematical Framework In the discrete-state case, the joint probability function P(X) is modeled using a dependence tree [6]. A component Xi is assigned to node i in the tree, and each edge in the tree is associated with the conditional probability function of the two variables connected by the edge, i.e. P(xilxj) for the edge connecting the node of Xi to its parent Xj. The parent of node i, denoted by 1I"(i), is the first node on the path connecting node i to the root. The root of the tree is associated with the component xo, which is introduced for notational purposes and is not actually a component of X. The nodes Xi connected to the root have 11"( i) = 0 as their parent, and edges P(xilxo) are defined to be P(Xi). In other words, the Xi with 1I"(i) = 0 are independent and hence so are the respective subtrees associated with those nodes. The dependence tree state distribution model is and the joint observation-state distribution is N P(X) = II P(xilx7r(i)), (1) i=l (2) assuming that {Yi,j} are conditionally independent and identically distributed given the state. (Note that upper case P is used to denote a probability mass function, and lower case p a density function.) As in an HMM, the probability of a set of observations is computed by summing over the possible state vectors N p(y) = L IIp(Yilxi)P(xiIX7r(i)). (3) X i=l The sum can be computed efficiently using a recursive algorithm that incorporates probabilities from the leaves upward to the root of the tree, analogous to the forward algorithm for HMMs.
5 44 M. Ostendorf et at. 3.2 Application to Speech The dependence tree state is hidden in the same sense that the mode of a Gaussian mixture distribution is hidden; observations are continuous-valued cepstral features described by Gaussian distributions conditioned on the hidden state. The difference is that the dependence tree state is vector-valued, unlike the scalar mode of a Gaussian mixture distribution. An HMM that uses Gaussian mixture observation distributions also has a multi-dimensional state, but there are important differences with respect to dependence trees. The HMM state sequence is variable-length and timeordered. In the hidden dependence tree, on the other hand, the state dimension and order is fixed, and there is no notion of time. The state probability distributions in an HMM (aij = pest = i!st-l = j) describe sequence length and ordering, while the state probability distributions in a dependence tree (ai,jk = P(Xi = jix 7r(i) = k) describe the relationship between the values of states in a fixed order. The analogy of the hidden dependence tree to a Gaussian mixture and the differences with respect to an HMM suggest an application of the dependence tree in acoustic modeling. Consider an HMM that uses Gaussian mixture distributions. Let Xi = j in the dependence tree indicate that the mixture mode of HMM state i is j. Then the hidden dependence tree provides a model for correlation of the mixture modes across sound classes. With this interpretation, one can envision different applications of the dependence model used in conjunction with an HMM. Assume that an HMM is first used to provide a "transcription" and segmentation of an utterance in terms of the N sound classes in the dependence tree. The transcription is used to group the observations into subsets Yi, and the hidden dependence tree model is then used to compute p(y). This probability can be used in a likelihood ratio test of whether two segments of speech came from one vs. two speakers (or in other textindependent speaker/language identification problems), or as an additional "consistency score" in N-best rescoring of hypotheses for word recognition. Alternatively, the observations can be used to re-estimate mixture weights, i.e. ~ij = P(Xi = jly) for the j -th mixture weight associated with state i, for use in a subsequent decoding pass. This probability is computed using the upward-downward algorithm used for state estimation in model design, described next. 3.3 Topology Design and Parameter Estimation In this discussion, we will assume a non-hierarchical topology for the dependence tree structure; that is, the classes represented by the tree are disjoint. For the case where X is discrete-valued and fully observable, Chow and Liu [6] describe an algorithm for estimating both dependence tree structure and its parameters. In our case, where both the tree and subsets of the observations Yi are unobserved, we divide topology design and parameter estimation into two steps. However, we build on the Chow-Liu algorithm by defining an intermediate, partially observable discrete state, as described below. Class Definition and Topology. Topology design requires finding 7l"( i) for all nodes i = 1,...,N. The Chow and Liu algorithm finds the tree topology that minimizes the difference of the information contained in the true probability function and
6 Tree-based Dependence Models 45 that contained in its approximation by a dependence tree. This minimization criterion is equivalent to maximizing the total weight on the edges of the tree, where the weight of the edge connecting nodes Xi and X j is the mutual information I (Xi; X j ) based on relative frequency estimates of their joint probability distribution. Given all possible I(Xi; Xj), topology design is a minimum spanning tree search problem. The Chow and Liu algorithm works well when the samples of the vector X are complete, meaning that all the components of samples are observed, and when the number of samples of the vector X is large relative to the number of values an Xi can take on. When there are a small number of samples for a pair of variables, the mutual information estimate is biased above the true value, so (Xi, X j) pairs that are infrequently observed may be incorrectly assigned links in the tree. In order to use the Chow-Liu algorithm, we estimate a discrete state vector X for each training sample by setting Xi equal to the index of the vector quantization (VQ) codeword that minimizes the total distortion of the observations Yj E Yi. In order to keep the number of values for Xi small and still have a reasonable sampling of the vector observation space, node-dependent codebooks are designed. Assuming the tree nodes correspond to phonetic or sub-phonetic units, there will be some missing elements of the estimated state vectors, because of the wide variation in frequencies of occurrence of different phonemes. This imbalance of phone-pair occurrence rates can lead to bad estimates of the mutual information and poor tree topologies. To obtain robust trees, we modified the Chow-Liu algorithm to include a threshold on the number of co-occurrences of every phone pair for allowing a link between the pair and a limit on the number of connections in the tree. In addition, we used random sampling to obtain speaker-level state vectors to reduce the number of missing elements relative to utterance-level vectors. Parameter Estimation. Two sets of distribution parameters are needed to characterize the hidden dependence tree: the tree edge distributions P(xilx1I"(i)) and the observation distributions p(yixi = j) '" N(Jlij, Eij ), where N(Jl, E) denotes a Gaussian with mean Jl and covariance E. The above topology design gives an initial estimate for the edge distributions. Initial estimates for Jlij and Eij are given by the VQ codewords and associated error covariances. Given an initial estimate, the parameters can be refined based on the actual observations using the iterative EM algorithm. As mentioned earlier, the expectation step involves a recursive upwarddownward algorithm that is analogous to the forward-backward algorithm used in HMM parameter estimation. If the VQ-estimated observations of X are available, then the tree edge distributions can be estimated using the upward-downward algorithm for discrete dependence trees [7], which is an extension of Pearl's algorithm for belief propagation in causal trees [8] and a special case of the more general algorithm for Bayesian networks described by Lucke [9]. To estimate both the edge distributions and the observation distributions, the upward-downward algorithm is extended to use the observations Yi in the upward step during the update of node i, and parameter re-estimation for the node-dependent Gaussians is added to the maximization step [10]. The complete parameter estimation algorithm is similar to that used for HMMs with Gaussian mixture distributions, with the difference being added complexity due to tree-structured rather than time-based dependence.
7 46 M. Ostendorf et al. 3.4 Experiments Experiments assessing various methods for topology training and the usefulness of the hidden dependence model were conducted on two large vocabulary continuous speech recognition tasks using the Wall Street Journal [11] and the Switchboard [12] corpora, in both cases training on roughly 120 hours of speech. The WSJ corpus is based on read business news, and the Switchboard corpus comprises telephonequality conversational speech on a variety of topics. The feature vectors included 14 mel-warped cepstra (no derivatives), computed at a IOms frame rate using cepstral mean subtraction. In the experiments on Switchboard, the features were also normalized to compensate for vocal tract length [1]. The X vector associated each node with a phone, so the dimension was (i.e. context-independent models) and therefore the dependence tree models used only the frames in the center of the phone segment to minimize coarticulation influences. In development of the topology design approach, we evaluated the performance of the dependence tree models by computing the likelihood of an independent test set. The results showed that the dependence tree performed better than an independent-phone model, and that constraints on the topology of the tree generally improved performance. In addition, the automatically designed dependence tree outperformed a tree that had been specified by hand according to manner of articulation and other differences in articulatory features. As an example, the tree topology designed on the Switchboard corpus is given in Figure 2, illustrating learned dependence that reflects articulation manner (e.g. among fricatives and nasals) but also some connections that were probably dominated by co-articulation effects (e.g. laa/-/erl since laa/ is often followed by Ir/). We evaluated the performance of the hidden dependence tree model by using the likelihood p(y) as an additional score in N-best rescoring experiments. In these experiments an HMM-based recognition system from BBN [13] provided an N best list of hypotheses (N = 100) for all the utterances in the test set, along with an HMM acoustic score and a trigram language model score for each hypothesis. These hypotheses were rescored by the hidden dependence tree model. A linear combination of these scores plus word and phone counts (insertion penalties) was used for re-ranking the list of hypotheses and producing the final recognized output. The weights of the different scores were optimized on a development test set. The dependence tree model used in this experiment was a gender-dependent model with 10 node-dependent codewords per phone and a constrained tree. Table 1 shows the results of these experiments. There is a slight improvement when combining the likelihood score obtained from the dependence tree model with the HMM score. Further gains should be obtained by using more detailed sound classes, such as triphone states, but the resulting dependence tree would be large and likely require a hybrid hierarchical and Chow-Liu topology design strategy.
8 Tree-based Dependence Models ~ oy amo bm col ox moo fpm laf zb ~ n ey m I~ dihiyuw...---:/"1 ~ dhuboh g y ~I b v W Be ~ b# bb ay oi ix /I ~ ~ pau p aw oh Cpu ax dx I I I ~ t th I s ah I...---:/"1 ~ k ao er ow f sh z ~ ~ axr aa jh ch Fig. 2. Discrete dependence tree designed on the Switchboard corpus, where subtrees connected to the root node (indicated by "@") are independent Table 1. N-best rescoring results (word error rates) on the 1993 WSJ evaluation test and the 1996 SWBD evaluation test. The knowledge sources are the HMM acoustic score, the dependence tree score (DT), and a trigram language model (LM). Knowledge Sources HMM,LM HMM,DT,LM II WSJ Eval93 II SWBDEval Multiscale Tree Processes Multiscale stochastic processes represent an important class of models, of which a particularly useful subclass is based on scale-recursive dynamics on trees [14, 15]. These models allow efficient algorithms for both estimation and likelihood calculation resulting in a variety of applications. In this section, we describe the general framework and application to acoustic model adaptation in speech recognition. 4.1 The Mathematical Framework Denoting a node in a tree by t with parenti ft, a state-space model for the evolution in the tree of the Gaussian process X and its noisy observation Y is given by 1 The notation t1 represents the same information as 7r( i) for the hidden dependence tree; the two notations are used to be consistent with other literature in the respective areas.
9 48 M. Ostendorf et at. Xt Yt,i AtXt-r + Wt CtXt + Vt,i (4) (5) where Xt is the vector state of the process at node t. The root node state Xo has distribution N(O, Eo). The process noise Wt is white, independent of xo, and has distribution N(O, Qt). The state Xt is observed via a noisy measurement Yt,i, where the measurement noise Vt,i is white, independent of Xo and Wt, and has distribution N(O, Rt). Thus, ex = (Eo, {At, Qt}) are the parameters of p(x), and eylx = ({ Ct, Rt}) are the parameters of p(yix). The zero~mean assumptions of the root node Xo and the noise terms are not a requirement of the model, but result in simpler estimation equations. A degenerate tree with only one leaf node (parent nodes have only one child) can be interpreted as a standard linear dynamical system, i.e. having a time-like index. As an acoustic model for speech recognition, the standard dynamical system is a continuous-state alternative to the discrete-state HMM, where likelihood is computed using Kalman filtering recursions to obtain innovations and associated distribution parameters [16]. A similar approach can be used for the multiscale model extending the Kalman recursions on the tree [17]. For the adaptation application, state estimation is more important than likelihood computation. Given Y, the set of all available observations, the smoothed estimate 2 of the state Xt = E{xtIY} and the associated error covariance PtlY = E{[xt - X t][ Xt - Xt 1 T} is computed using a generalization of the Rauch-Thng -Striebel (RTS) algorithm [14]. Smoothing is done in two sweeps: an upward sweep from the leaves to the root, followed by a downward one from the root to the leaves. The complexity of the tree RTS smoother is O( d 3 N) where d is the dimensionality of the state (the d3 is due to matrix inversions), and N is the number of nodes in the tree. 4.2 Application to Speech The multiscale model can be used for the adaptation of means of acoustic models to a new speaker or new environmental conditions. For example, let each leaf r of the multi scale tree be associated with a set of Gaussians YT' and adapt the means of all Gaussians in class r by a common shared shift X T : where ILl: denotes the mean ILi after adaptation. Such a shared shift approach has been used for Gaussians in hidden Markov models (HMMs) [3] and the stochastic segment model (SSM) [18], and for polynomial segment models (PSMs) [19]. The observations YT,i E YT associated with node r are differences between the speaker Independent means ILi and the average of feature vectors observed for sound i E YT in an utterance. 2 The term "smoothed estimate" refers to the linear least squares (or for Gaussians, the minimum mean squares) estimate. It also corresponds to the maximum a posteriori estimate. (6)
10 Tree-based Dependence Models 49 Initial independent estimates for the shift x T and associated error covariance P T can be obtained from adaptation data for each class l by averaging the observed shifts for that class (YT,i E YT) and computing the equivalent covariance of the averaged variable [19]. Let us define a Gaussian tree-based shift process (Equation 4) with M leaves, and associate the leaf node states with the shifts of the M classes we wish to model dependence between. Given T and P T at a subset of leaves, adaptation involves estimating the hidden states x T for all leaves using the tree RTS smoother and then shifting the models within the respective classes. Due to the Bayesian nature of the estimate, the smoothed shift approaches the unsmoothed shift and converges to the standard ML speaker-dependent estimate as the amount of adaptation data increases. A similar tree-based model for adaptation is described in [20], but the upwarddownward propagation of mean shifts is based on a heuristic that does not account for degree of correlation or variance differences. Another Markovian model used in adaptation is based on Markov random fields [21]. Multiscale models offer a number of advantages over Markov random fields including a constant per-node complexity, the availability of an error covariance associated with smoothing, and the fact that state estimation algorithms are efficient, non-iterative, recursive and parallelizable. 4.3 Topology Design and Parameter Estimation To use the multiscale model of dependence in adaptation, we need to define the adaptation classes and tree topology, as well as estimate parameters for the process. Class Definition and Topology. In continuous speech recognition, contextdependent models are frequently clustered in the form of a tree for each region (or state) of a phone using ML clustering of Gaussians [22,23]. Figure 3(a) illustrates the tree for one region. Each node of the tree represents an equivalence class of triphones. Nodes at a certain "cut" through the tree, the boxes in Figure 3(a), define terminal adaptation classes to share shifts (Equation 6). One popular, but ad-hoc, option for adaptation is the "back-off" strategy, where the shift is computed at the most detailed node which has more than some threshold of adaptation frames and copied to all child terminal adaptation classes as shown in Figure 3(b). The topology of the clustering tree can also be used for multiscale smoothing. Class-dependent shifts Xt are computed at the terminal adaptation nodes and then smoothed using the multi scale model (Figure 3(c)) to get shift estimates Xt at all nodes. Parameter Estimation. Maximum-likelihood estimates of the parameters of the tree process (Eo, At, Qt, Ct, Rt) can be obtained by applying the RTS and EM algorithms to multiple independent sample vectors Y [24], where each conversation side contributes one sample. The general approach follows that described in [16] for a time-ordered dynamical system, which involves iteratively finding expected sufficient statistics of the hidden state (E-step), and then using multivariate regression to compute new process parameters (M-step). The main difference is in the recursions used in the E-step, which build on the tree algorithms developed in [14, 25]. Here, we assume Ct = I and Rt is effectively given by the sample variance of the observations, so there is no need to estimate Ct and Rt. In the experiments described
11 50 M. Ostendorf et al. Fig. 3. Trees used for adaptation: (a) shows the clustering tree with squares indicating terminal adaptation classes, (b) illustrates the "back -off" method of adaptation with dashed squares indicating back-off classes, and (c) shows the corresponding multiscale smoothing approach. In both (b) and (c), triangles indicate observations. next, the A and Q parameters are shared among all nodes of a phone; i.e. for K phones, there are K trees each with an (A, Q) pair. To start the EM iterations, we need initial estimates of 17(0), the A's and the Q's. For each speaker in the training set we compute covariances of ML (unsmoothed) shifts at each terminal shift node. A frequency-weighted average of these covariances across all speakers is used for initializing 17(0) and all Qt, and initial At = I for all t. 4.4 Experiments Experiments were conducted on the Switchboard corpus. The feature vectors were the same as for the hidden dependence tree experiments except that energy and feature derivatives are used. N-best rescoring is also used here, with the segmentmodel acoustic score substituted for the HMM score. Most experiments use sixty hours of speech for training the acoustic and multiscale (MS) models; 123 hours are used in the guided adaptation experiments. The PSM systems used a 2-region model, with each region modeled by a linear trajectory Gaussian process with a single full covariance. The SSM systems used a 5-region model, with each region represented by a full covariance Gaussian. Both cases used gender-dependent models and ML clustered triphones. The PSM and SSM adaptation systems had 300 and 150 terminal adaptation classes/region, respectively. For a fair comparison of MS smoothing vs. back-off approaches, the same topology is used for both types of adaptation. In batch-mode adaptation, the first half of each conversation is used as adaptation data and the second half for testing. The results in Table 2 for supervised adaptation indicate the MS-smoothing is better than the back-off approach. However, performance for both algorithms degrades relative to the baseline in unsupervised adaptation, indicating a sensitivity to the high error rate in the Switchboard tasks. Consequently, we use guided adaptation in further unsupervised experiments. In unsupervised transcription-mode adaptation two passes are made over the speech: the first to collect statistics for adaptation, and the second to perform recognition with the adapted models. Adaptation is guided in the sense that we adapt only
12 Tree-based Dependence Models 51 Table 2. Supervised batch recognition with 2-region PSMs. Error rates are on the second half of conversations in the Dev96 test set. SI baseline 44.5% with data from the subset of words in the top first-pass hypothesis with confidence over a specified threshold. This serves to lower the error-rate for the speech used in adaptation, which benefits both ML and MS adaptation. Guided adaptation also tests the generalization capability of the dependence model for unseen classes (in the "incorrect" parts of the speech). In experiments on the Dev97 test set with a 5-region SSM, we found that the ML back-off approach improved a 40.9% WER baseline 3 to 40.4% and that further improvement to 40.0% was obtained with MS smoothing. Table 3 shows that MS adaptation gives about 1 % absolute improvement in performance. A much greater gain is expected from using lattice decoding rather than N-best rescoring, based on BBN adaptation experiments [26]. Table 3. Guided unsupervised transcription mode adaptation with a 5-region SSM system Dev97 Eval97 Baseline MS adapt Discussion In summary, tree-based models of dependence provide an efficient framework for representing correlation across phones in speech (or sub-phonetic units represented by HMM states), for use in adaptation as well as other applications. Dependence models are a supplement to and not a replacement for existing techniques, such as HMMs, in that they model correlation across classes but not time. Markov-like assumptions combined with a tree structure make for efficient algorithms for computing the expected state given a set of observations. Dependence model design involves first finding the tree topology, which can be an direct connection of classes or a hierarchy of sub-classes, and then EM parameter estimation using an upwarddownward algorithm for handling the hidden state. Two important examples are described: the hidden dependence tree, which has a discrete hidden state and can be thought of as a mechanism for loosely coupling Gaussian mixture modes of different models; and the multiscale model, which has a continuous hidden state and relates models via a hierarchy with different levels of granularity. The two approaches differ primarily in the discrete vs. continuous hidden state, but they also 3 The lower baseline error rate is due to differences in the test set, language model, signal processing parameters, and training on 123 hours of speech.
13 52 M. Ostendorf et at. illustrate two extremes of topology design. It is an open question as to which of the two models is more useful: the hidden dependence tree is better at capturing non-linear dependence between classes, but the mixture mode dependence may be too weak a coupling. Initial results for both models are promising, but much work remains to explore their full potential. For example, the speaker adaptation experiments did not take advantage of several variations known to improve results, such as speaker-adaptive training [27] and iterative adaptation and decoding. Several questions are raised by the initial experimental results, particularly related to topology design and parameter tying. How can we best integrate the mutual information clustering technique, which is more general but not very robust, with hierarchical clustering techniques? What is the right number of classes to represent with the tree? For adaptation, theoretically it is better to use a large number of classes, but in practice we do not find this to be the case, probably because of inaccuracies in the model exacerbated by parameter tying assumptions. In the multiscale model experiments, we assumed that all branches of the tree for a particular phone shared the same transition matrix and process noise covariance. Is it possible to learn finer grained parameter sharing automatically? Of course, there is also the question of whether better results can be obtained by relaxing the tree-structure assumption and using less restrictive models such as Bayesian networks. However, it is likely that the computational efficiency of the tree structure will make the treebased dependence models more attractive in the near term. Acknowledgement. This work was supported by the United States DoD, grant ONR-NOOOI4-92-J References [1] E. Eide and H. Gish, "A parametric approach to vocal tract length nonnalization," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. I, pp , May [2] C. J. Leggetter and P.C. Woodland, "Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression," Proc. ARPA Workshop on Spoken Language Technology, pp , January [3] G. Zavaliagkos, R. Schwartz, J. McDonough, and J. Makhoul, "Adaptation algorithms for large scale HMM recognizers," Proc. European Conference on Speech Comm. and Tech., vol. 2, pp. 1 13l-1134,September [4] Q. Huo and C.-H. Lee, "On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition," Proc. Inter. Conf. on Acoust., Speech and Signal Proc., vol. 2, pp , May [5] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood estimation from incomplete data;' Journal of the Royal Statistical Society (B), vol. 39, no. 1, pp. 1-38, [6] C.K. Chow and C.N. Liu, "Approximating discrete probability distributions with dependence trees," IEEE Trans. Information Theory, vol. IT-14, no. 3, pp , May [7] O. Ronen, J.R. Rohlicek, and M.Ostendorf, "Parameter estimation of dependence tree models using the EM algorithm,,,1 IEEE Signal Processing Letters, vol. 2, no. 8, pp ,1995.
14 Tree-based Dependence Models 53 [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA, [9] H. Lucke, "Which stochastic models allow Baum-Welch training?" IEEE Trans. Signal Proc., vol. 44, no. 11, pp , [10] O. Ronen, Dependence tree models of intra-utterance phone dependence, Boston University Ph.D. Thesis, [11] F. Kubala et ai., "The hub and spoke paradigm for CSR evaluation," Proc. of the ARPA Human Language Technology Workshop, pp , March [12] J.J. Godfrey, E.C. Holliman, and J. McDaniel, "SWITCHBOARD: Telephone speech corpus for research and development," Proc. Inter. Con! Acoust., Speech, and Signal Proc., vol. 1, pp , March [13] L. Nguyen et al., ''The 1994 BBN/BYBLOS speech recognition system," Proc. of the ARPA Spoken Language Systems Technology Workshop, pp , January [14] K. C. Chou, A. S. Willsky, and A. Benveniste, "Multiscale recursive estimation, data fusion, and regularization," IEEE Trans. Automatic Control, vol. 39, no. 3, pp , [15] M. R. Luettgen, W. C. Karl, A. S. Willsky, and R. R. Tenney, "Multiscale representations of Markov random fields," IEEE Trans. Signal Proc., vol. 41, no. 12, pp , [16] V. Digalakis, J.R. Rohlicek, and M. Ostendorf, "ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition," IEEE Trans. Speech and Audio Proc., vol. 1, no. 4, pp , [17] M. R. Luettgen and A. S. Willsky, "Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination," IEEE Trans. Image Proc., vol. 4,no. 2,pp , [18] A. Kannan and M. Ostendorf, "Modeling Dependency in Adaptation of Acoustic Models using Multiscale Tree Processes," Proc. Eurospeech, vol. 4, pp , [19] A. Kannan and M. Ostendorf, "Adaptation of polynomial trajectory segment models for large vocabulary speech recognition," Proc. Inter. Con! Acoust., Speech and Signal Proc., vol. 2,pp , April [20] D. Paul, "Extensions to phone-state decision-tree clustering single tree and tagged clustering," Proc.lnter. Con! Acoust., Speech and Signal Proc., vol. 2, pp , April [21] B. M. Shahshahani, "A Markov random field approach to Bayesian speaker adaptation," IEEE Trans. Speech and Audio Proc., vol. 5, no. 2, pp , [22] A. Kannan, M. Ostendorf and J. R. Rohlicek, "Maximum likelihood clustering of Gaussians for speech recognition," IEEE Trans. Speech and Audio Proc., vol. 2, no. 3, pp , [23] S. J. Young, J. J. Odell and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Workshop on Human Language Technology, pp , March [24] A. Kannan, M. Ostendorf, D. A. Castanon, and W. C. Karl, "ML parameter estimation of a multiscale tree process using the EM algorithm," Technical Report ECE , Boston University, November Available from ftp://raven.bu.edu/pub/reports. [25] M. R. Luettgen and A. S. Will sky, "Multiscale smoothing error models," IEEE Trans. Automatic Control, vol. 40, no. 1, pp , [26] G. Zavaliagkos, personal communication. [27] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, "A compact model for speaker-adaptive training," Proc. of the Inter. Con! on Spoken Language Processing, vol. 2, pp , October 1996.
Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationUsing Gradient Descent Optimization for Acoustics Training from Heterogeneous Data
Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Martin Karafiát Λ, Igor Szöke, and Jan Černocký Brno University of Technology, Faculty of Information Technology Department
More informationSpeech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:
More information1 1 λ ( i 1) Sync diagram is the lack of a synchronization stage, which isthe main advantage of this method. Each iteration of ITSAT performs ex
Fast Robust Inverse Transform SAT and Multi-stage ation Hubert Jin, Spyros Matsoukas, Richard Schwartz, Francis Kubala BBN Technologies 70 Fawcett Street, Cambridge, MA 02138 ABSTRACT We present a new
More informationIntroduction to HTK Toolkit
Introduction to HTK Toolkit Berlin Chen 2003 Reference: - The HTK Book, Version 3.2 Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools Homework:
More informationLoopy Belief Propagation
Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure
More informationScalable Trigram Backoff Language Models
Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work
More informationVariable-Component Deep Neural Network for Robust Speech Recognition
Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft
More informationAssignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018
Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments
More informationProbabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information
Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural
More informationLearning The Lexicon!
Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationMRF-based Algorithms for Segmentation of SAR Images
This paper originally appeared in the Proceedings of the 998 International Conference on Image Processing, v. 3, pp. 770-774, IEEE, Chicago, (998) MRF-based Algorithms for Segmentation of SAR Images Robert
More informationEM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition
EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,
More informationHierarchical Mixture Models for Nested Data Structures
Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationMaximum Likelihood Beamforming for Robust Automatic Speech Recognition
Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR
More informationJoint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training
Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department
More informationTECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM
TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM Hy Murveit, Peter Monaco, Vassilios Digalakis, John Butzberger SRI International Speech Technology and Research Laboratory
More informationSpeech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008
peech Recogni,on using HTK C4706 Fadi Biadsy April 21 st, 2008 1 Outline peech Recogni,on Feature Extrac,on HMM 3 basic problems HTK teps to Build a speech recognizer 2 peech Recogni,on peech ignal to
More informationA Gaussian Mixture Model Spectral Representation for Speech Recognition
A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationAn Introduction to Pattern Recognition
An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included
More informationTHE most popular training method for hidden Markov
204 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004 A Discriminative Training Algorithm for Hidden Markov Models Assaf Ben-Yishai and David Burshtein, Senior Member, IEEE Abstract
More informationGMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao
GMM-FREE DNN TRAINING Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao Google Inc., New York {andrewsenior,heigold,michiel,hankliao}@google.com ABSTRACT While deep neural networks (DNNs) have
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationClient Dependent GMM-SVM Models for Speaker Verification
Client Dependent GMM-SVM Models for Speaker Verification Quan Le, Samy Bengio IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland {quan,bengio}@idiap.ch Abstract. Generative Gaussian Mixture Models (GMMs)
More informationModeling time series with hidden Markov models
Modeling time series with hidden Markov models Advanced Machine learning 2017 Nadia Figueroa, Jose Medina and Aude Billard Time series data Barometric pressure Temperature Data Humidity Time What s going
More informationEvaluation of Moving Object Tracking Techniques for Video Surveillance Applications
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Evaluation
More informationInvariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction
Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of
More informationNONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION
NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION Ken Sauef and Charles A. Bournant *Department of Electrical Engineering, University of Notre Dame Notre Dame, IN 46556, (219) 631-6999 tschoo1
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationWhy DNN Works for Speech and How to Make it More Efficient?
Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationHidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017
Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models
More information18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos
Machine Learning for Computer Vision 1 18 October, 2013 MVA ENS Cachan Lecture 6: Introduction to graphical models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris
More informationDynamic Time Warping
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW
More informationResearch on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models
Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Wenzhun Huang 1, a and Xinxin Xie 1, b 1 School of Information Engineering, Xijing University, Xi an
More informationCS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationDifferential Compression and Optimal Caching Methods for Content-Based Image Search Systems
Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems Di Zhong a, Shih-Fu Chang a, John R. Smith b a Department of Electrical Engineering, Columbia University, NY,
More informationApplications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3
Applications of Keyword-Constraining in Speaker Recognition Howard Lei hlei@icsi.berkeley.edu July 2, 2007 Contents 1 Introduction 3 2 The keyword HMM system 4 2.1 Background keyword HMM training............................
More informationAcoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing
Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se
More informationDietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++
Dietrich Paulus Joachim Hornegger Pattern Recognition of Images and Speech in C++ To Dorothea, Belinda, and Dominik In the text we use the following names which are protected, trademarks owned by a company
More informationQUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose
QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationModeling Phonetic Context with Non-random Forests for Speech Recognition
Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 /
More informationText-Independent Speaker Identification
December 8, 1999 Text-Independent Speaker Identification Til T. Phan and Thomas Soong 1.0 Introduction 1.1 Motivation The problem of speaker identification is an area with many different applications.
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationDiscriminative training and Feature combination
Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationSpoken Document Retrieval (SDR) for Broadcast News in Indian Languages
Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE
More informationMono-font Cursive Arabic Text Recognition Using Speech Recognition System
Mono-font Cursive Arabic Text Recognition Using Speech Recognition System M.S. Khorsheed Computer & Electronics Research Institute, King AbdulAziz City for Science and Technology (KACST) PO Box 6086, Riyadh
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Bayesian Curve Fitting (1) Polynomial Bayesian
More informationHigh throughput Data Analysis 2. Cluster Analysis
High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO
More informationAutomatic basis selection for RBF networks using Stein s unbiased risk estimator
Automatic basis selection for RBF networks using Stein s unbiased risk estimator Ali Ghodsi School of omputer Science University of Waterloo University Avenue West NL G anada Email: aghodsib@cs.uwaterloo.ca
More informationOptimization of HMM by the Tabu Search Algorithm
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 949-957 (2004) Optimization of HMM by the Tabu Search Algorithm TSONG-YI CHEN, XIAO-DAN MEI *, JENG-SHYANG PAN AND SHENG-HE SUN * Department of Electronic
More informationWorkshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient
Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient quality) 3. I suggest writing it on one presentation. 4. Include
More informationChallenges motivating deep learning. Sargur N. Srihari
Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More information2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology
ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationOptimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification
Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing
More informationRegularization and model selection
CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial
More informationExpectation Maximization (EM) and Gaussian Mixture Models
Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation
More informationA ROBUST SPEAKER CLUSTERING ALGORITHM
A ROBUST SPEAKER CLUSTERING ALGORITHM J. Ajmera IDIAP P.O. Box 592 CH-1920 Martigny, Switzerland jitendra@idiap.ch C. Wooters ICSI 1947 Center St., Suite 600 Berkeley, CA 94704, USA wooters@icsi.berkeley.edu
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationContext based optimal shape coding
IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing September 13-15, 1999, Copenhagen, Denmark Electronic Proceedings 1999 IEEE Context based optimal shape coding Gerry Melnikov,
More informationRegularization and Markov Random Fields (MRF) CS 664 Spring 2008
Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))
More informationSkill. Robot/ Controller
Skill Acquisition from Human Demonstration Using a Hidden Markov Model G. E. Hovland, P. Sikka and B. J. McCarragher Department of Engineering Faculty of Engineering and Information Technology The Australian
More informationClustering: Classic Methods and Modern Views
Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering
More informationECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov
ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern
More informationA Model Selection Criterion for Classification: Application to HMM Topology Optimization
A Model Selection Criterion for Classification Application to HMM Topology Optimization Alain Biem IBM T. J. Watson Research Center P.O Box 218, Yorktown Heights, NY 10549, USA biem@us.ibm.com Abstract
More informationEstimating Human Pose in Images. Navraj Singh December 11, 2009
Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks
More informationConstraints in Particle Swarm Optimization of Hidden Markov Models
Constraints in Particle Swarm Optimization of Hidden Markov Models Martin Macaš, Daniel Novák, and Lenka Lhotská Czech Technical University, Faculty of Electrical Engineering, Dep. of Cybernetics, Prague,
More informationMultiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku
Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Todd K. Moon and Jacob H. Gunther Utah State University Abstract The popular Sudoku puzzle bears structural resemblance to
More informationDiscriminative Training and Adaptation of Large Vocabulary ASR Systems
Discriminative Training and Adaptation of Large Vocabulary ASR Systems Phil Woodland March 30th 2004 ICSI Seminar: March 30th 2004 Overview Why use discriminative training for LVCSR? MMIE/CMLE criterion
More informationConfidence Measures: how much we can trust our speech recognizers
Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition
More informationVIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD. Ertem Tuncel and Levent Onural
VIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD Ertem Tuncel and Levent Onural Electrical and Electronics Engineering Department, Bilkent University, TR-06533, Ankara, Turkey
More informationEpitomic Analysis of Human Motion
Epitomic Analysis of Human Motion Wooyoung Kim James M. Rehg Department of Computer Science Georgia Institute of Technology Atlanta, GA 30332 {wooyoung, rehg}@cc.gatech.edu Abstract Epitomic analysis is
More informationMachine Learning. Sourangshu Bhattacharya
Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares
More informationMachine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves
Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationAudio-visual interaction in sparse representation features for noise robust audio-visual speech recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for
More informationA Graph Theoretic Approach to Image Database Retrieval
A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500
More informationPattern Clustering with Similarity Measures
Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationRepeating Segment Detection in Songs using Audio Fingerprint Matching
Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationOn the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme
arxiv:1811.06857v1 [math.st] 16 Nov 2018 On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme Mahdi Teimouri Email: teimouri@aut.ac.ir
More informationRobustness of Non-Exact Multi-Channel Equalization in Reverberant Environments
Robustness of Non-Exact Multi-Channel Equalization in Reverberant Environments Fotios Talantzis and Lazaros C. Polymenakos Athens Information Technology, 19.5 Km Markopoulo Ave., Peania/Athens 19002, Greece
More informationPerformance Characterization in Computer Vision
Performance Characterization in Computer Vision Robert M. Haralick University of Washington Seattle WA 98195 Abstract Computer vision algorithms axe composed of different sub-algorithms often applied in
More informationApproximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree
Approximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree David Bellot and Pierre Bessière GravirIMAG CNRS and INRIA Rhône-Alpes Zirst - 6 avenue de l Europe - Montbonnot
More informationClustering. Shishir K. Shah
Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique
More informationConditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,
Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative
More informationPitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery
Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,
More informationHMM-Based Handwritten Amharic Word Recognition with Feature Concatenation
009 10th International Conference on Document Analysis and Recognition HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation Yaregal Assabie and Josef Bigun School of Information Science,
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More information