Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Similar documents
Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering Lecture 5: Mixture Model

Discriminative training and Feature combination

Introduction to HTK Toolkit

application of learning vector quantization algorithms. In Proceedings of the International Joint Conference on

K-Means and Gaussian Mixture Models

THE most popular training method for hidden Markov

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Clustering and The Expectation-Maximization Algorithm

Discriminative Training and Adaptation of Large Vocabulary ASR Systems

Effect of Initial HMM Choices in Multiple Sequence Training for Gesture Recognition

Clustering web search results

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Introduction to The HTK Toolkit

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE 5424: Introduction to Machine Learning

Experimental Evaluation of Latent Variable Models. for Dimensionality Reduction

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Inference and Representation

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Hidden Markov Model for Sequential Data

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Introduction to Machine Learning CMU-10701

Clustering. Machine Learning: Jordan Boyd-Graber University of Maryland K -MEANS AND GMM. Machine Learning: Jordan Boyd-Graber UMD Clustering 1 / 1

Ergodic Hidden Markov Models for Workload Characterization Problems

Biology 644: Bioinformatics

Content-based image and video analysis. Machine learning

Chapter 3. Speech segmentation. 3.1 Preprocessing

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Image classification by a Two Dimensional Hidden Markov Model

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Lecture 5: Markov models

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System

A Gaussian Mixture Model Spectral Representation for Speech Recognition

Lecture 11: Clustering Introduction and Projects Machine Learning

Time series, HMMs, Kalman Filters

MINIMUM EXACT WORD ERROR TRAINING. G. Heigold, W. Macherey, R. Schlüter, H. Ney

Multi-Modal Human Verification Using Face and Speech

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Grundlagen der Künstlichen Intelligenz

Client Dependent GMM-SVM Models for Speaker Verification

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

Fisher vector image representation

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering. Image segmentation, document clustering, protein class discovery, compression

Speech User Interface for Information Retrieval

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

SD 372 Pattern Recognition

732A54/TDDE31 Big Data Analytics

Deformable Object Registration

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

c COPYRIGHT Microsoft Corporation. c COPYRIGHT Cambridge University Engineering Department.

Modeling time series with hidden Markov models

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

1 1 λ ( i 1) Sync diagram is the lack of a synchronization stage, which isthe main advantage of this method. Each iteration of ITSAT performs ex

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

COMS 4771 Clustering. Nakul Verma

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Network Traffic Measurements and Analysis

Mixture Models and the EM Algorithm

27: Hybrid Graphical Models and Neural Networks

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

Transductive Phoneme Classification Using Local Scaling And Confidence

Speech Recognition Lecture 12: Lattice Algorithms. Cyril Allauzen Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

MURDOCH RESEARCH REPOSITORY

Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA

Constraints in Particle Swarm Optimization of Hidden Markov Models

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

Clustering. Shishir K. Shah

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

An Introduction to Pattern Recognition

Audio-Visual Speech Activity Detection

AN ITERATIVE APPROACH TO DECISION TREE TRAINING FOR CONTEXT DEPENDENT SPEECH SYNTHESIS. Xiayu Chen, Yang Zhang, Mark Hasegawa-Johnson

PARALLEL TRAINING ALGORITHMS FOR CONTINUOUS SPEECH RECOGNITION, IMPLEMENTED IN A MESSAGE PASSING FRAMEWORK

Expectation Maximization: Inferring model parameters and class labels

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

10-701/15-781, Fall 2006, Final

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System

Large Scale Distributed Acoustic Modeling With Back-off N-grams

A Parallel Implementation of Viterbi Training for Acoustic Models using Graphics Processing Units

Using Document Summarization Techniques for Speech Data Subset Selection

Transcription:

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri

Speech Recognition Components Acoustic and pronunciation model: Pr(o w) = d,c,p Pr(o d)pr(d c)pr(c p)pr(p w). acoustic model Pr(o d) : observation seq. distribution seq. Pr(d c) : distribution seq. CD phone seq. Pr(c p) : CD phone seq. phoneme seq. Pr(p w) : phoneme seq. word seq. Language model: seq. Pr(w), distribution over word 2

Acoustic Models Critical component of a speech recognition system. Different types: context-independent (CI) phones vs. contextdependent (CD) phones. speaker-independent vs. speaker-dependent. Complex design and training techniques in largevocabulary speech recognition. 3

Context-Dependent Phones Idea: phoneme pronunciation depends on environment (allophones, co-articulation). model phone in context Context-dependent rules: Context-dependent units: Allophonic rules: t/v better accuracy. Complex contexts: regular expressions. (Lee, 1990; Young et al., 1994) ae/b d ae b,d. V dx. 4

This Lecture Acoustic models Training algorithms 5

Continuous Speech Models Graph topology: 3-state HMM model: for each CD phone. ae b,d d0:ε d1:ε d2:ε (Rabiner and Juang, 1993) 0 d0:ε 1 d1:ε 2 d2:ae b,d 3 Interpretation: beginning, middle, and end of CD phone. Continuous case: transition weights based on distributions over feature vectors in R N, typically with N =39. 6

Distributions Simple cases: e.g., single speaker, single Gaussian distribution N (x; µ, σ) = 1 exp (2π) N/2 σ 1/2 covariance matrix σ typically diagonal. General case: mixtures of Gaussians. with λ i 0 and M λ k N (x; µ k,σ k ), k=1 M i=1 ( 1 ) 2 (x µ) σ 1 (x µ). λ i =1.Typically, M =16. 7

GMMs - Illustration 8

Silence Model Motivation: accounting for pause between words and sentences. Model: optional pause symbol between words and at the beginning and end of utterances in language model. specific silence acoustic model, which can be context-dependent or not. 9

Parameter Reduction Problem: too many parameters (> 200M). large number of GMMs provides better modeling flexibility. but requires much more training data. Solution: tying mixtures, i.e., equality constraints on distributions. within the same HMM or distributions in different HMMs (similar CD phone transitions). semi-continuous: same distrib. different mixtures. 10

Composite HMM model Composite model: obtained by taking the union and closure of all CD phone models. Illustration: 0 : d0: 1 d0: d2:p d 2:p d1: 2 d1: d2: 3 d 2: Tying can reduce the size. : d 0: d 1: 3 d 1: 1 d 0: 2 11

This Lecture Acoustic models Training algorithms 12

Parameter Estimation Data: sample of Parameters: sequences of the form: mean and variance of Gaussians µ j,σ j. mixture coefficients λ k. Problems: segmentation. m Feature vectors: CD phones: o 1,o 2,...,o T R N p 1,p 2,...,p l ae c,t. model initialization. 13

Estimation Algorithm Baum-Welch algorithm: maximum likelihood principle. generalizes to continuous case with Gaussians and GMMs. Questions: segmentation, model initialization. 14

Univariate Gaussian - ML solution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 m 2 m log(2 (x i µ) 2 2 ) 2 2. is differentiable and concave; m i=1 x i p(x) σ 2 i=1 =0 σ 2 = 1 m ) m x 2 i µ 2. i=1. 15

Gradients General identities: log of determinant (log det( )) = ( 1 ) =. bilinear form (x x) =xx. 16

Multivariate Gaussian - ML solution L = Log likelihood: sample. ( 1 Pr[x i ]= exp 1 ) (2π) N/2 σ 1/2 2 (x i µ) σ 1 (x i µ). m m log Pr[x i ]= N 2 log(2π)+1 2 log σ 1 1 2 (x i µ) σ 1 (x i µ). For each x i, i=1 i=1 ML solution: x 1,...,x m L m µ = i=1 L m σ 1 = i=1 σ 1 (x i µ) =0 µ = 1 m m x i. i=1 1 2 (σ (x i µ)(x i µ) )=0 σ = 1 m m (x i µ)(x i µ). i=1 17

GMMs - EM Algorithm Mixture of p θ [x] = M EM algorithm: let Gaussians: M λ k N (x; µ k,σ k ) L = k=1 i=1 k=1 m M log λ k N (x i ; µ k,σ k ). p t i,k = N (x i ; µ t k,σ t k). E-step: M-step: q t+1 i,k µ t+1 k = σ t+1 k = λ t+1 k = p θ t[z = k x i]= m i=1 qt+1 i,k x i m i=1 qt+1 = 1 m i,k m i=1 qt+1 m i=1 λ t k pt i,k M. k=1 λt k pt i,k i,k (x i µ t+1 k )(x i µ t+1 q t+1 i,k. m i=1 qt+1 i,k 18 k )

Initialization Selection of model: HMM topology (states and transitions). number of mixtures ( M). Flat start: all distributions with same average values (mean and variance) computed over entire training set. Existing hand-labeled segmentation: partial segmentation basis. Uniform segmentation: equal number of HMM transitions per training example. 19

Forced Alignment Viterbi training: approximate but faster method to determine HMM path. Segmental K-means: approximate but faster method to determine which Gaussian of a mixture the training instance has been sampled from. 20

Viterbi Alignment Idea: faster alignment based on most likely path. use current acoustic model. align sequence of feature vectors with most likely path (best path algorithm, e.g., Viterbi). o 1 2 3 4 5 6 7 8 9 10 e e 11e 11 e 11 e 12 e 22 e 22 e 23 e 33 e 34 e 44 21

Segmental K-Means Idea: use clustering algorithm to determine which Gaussian generated each observation. Solution: use K-means clustering algorithm to initialize distribution means. Initialization: select K centroids c 1,...,c K. Repeat until no centroid change: for each point x i find closest centroid c j and assign to cluster. for each cluster j, redefine mass. x i j c j as the center of 22

Notes GMM EM algorithm: soft version of K-means. 23

Variable Number of Mixtures Problems: number of mixtures required. possible over- or underfitting. Solution: originally single Gaussian distribution. create two-component mixture with slightly perturbed means µ ± ɛ with the same covariance matrix. model parameters reestimated until desired complexity reached. 24

Acoustic Modeling In practice: complicated recipes or heuristics with large number of ad hoc techniques. key skilled human supervision: choice of initial parameters to avoid local minima, segmentation, choice and number of parameters. computationally very costly: may take many days of several processors in large-vocabulary speech recognition. 25

Improvements Adaptation (VTLN, MLLR). Ensemble methods (ROVER). Better features (e.g., LDA, MMI). Discriminative training. 26

References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Jeff A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, 1998. Chung-an Chan. A tutorial for the Baum-Welch Algorithm, 2013 Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4): 599-609, 1990. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993. Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994. 27