Modeling Phonetic Context with Non-random Forests for Speech Recognition

Similar documents
Lecture 7: Neural network acoustic models in speech recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Why DNN Works for Speech and How to Make it More Efficient?

Lab 4: Hybrid Acoustic Models

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Variable-Component Deep Neural Network for Robust Speech Recognition

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Technology Using in Wechat

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Large Scale Distributed Acoustic Modeling With Back-off N-grams

Tutorial of Building an LVCSR System

CS 229 Midterm Review

Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

arxiv: v1 [cs.cl] 30 Jan 2018

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Introduction to HTK Toolkit

27: Hybrid Graphical Models and Neural Networks

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

Hierarchical Phrase-Based Translation with WFSTs. Weighted Finite State Transducers

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Dynamic Time Warping

Sequence Prediction with Neural Segmental Models. Hao Tang

Discriminative Training and Adaptation of Large Vocabulary ASR Systems

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lab 2: Training monophone models

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

Discriminative training and Feature combination

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Weighted Finite State Transducers in Automatic Speech Recognition

Structured Learning. Jun Zhu

RLAT Rapid Language Adaptation Toolkit

Xing Fan, Carlos Busso and John H.L. Hansen

Automatic Tracking of Moving Objects in Video for Surveillance Applications

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Statistical Methods for NLP

A ROBUST SPEAKER CLUSTERING ALGORITHM

7. Boosting and Bagging Bagging

Handwritten Text Recognition

Introduction to Hidden Markov models

Time series, HMMs, Kalman Filters

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

Conditional Random Fields. Mike Brodie CS 778

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

MEMMs (Log-Linear Tagging Models)

Hybrid Accelerated Optimization for Speech Recognition

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Weighted Finite State Transducers in Automatic Speech Recognition

Lecture 7: Decision Trees

Learning The Lexicon!

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Unsupervised Learning: Clustering

Using Document Summarization Techniques for Speech Data Subset Selection

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008

Ping-pong decoding Combining forward and backward search

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Logical Rhythm - Class 3. August 27, 2018

Kernels vs. DNNs for Speech Recognition

Speaker Diarization System Based on GMM and BIC

Conditional Random Fields : Theory and Application

Detection and Extraction of Events from s

Introduction to SLAM Part II. Paul Robertson

COMBINING FEATURE SETS WITH SUPPORT VECTOR MACHINES: APPLICATION TO SPEAKER RECOGNITION

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays

Scalable Trigram Backoff Language Models

Imagine How-To Videos

Handwritten Text Recognition

WHO WANTS TO BE A MILLIONAIRE?

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Natural Language Processing

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

An Introduction to Pattern Recognition

10701 Machine Learning. Clustering

Network Traffic Measurements and Analysis

Lecture 11: Clustering Introduction and Projects Machine Learning

Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic

Simplifying OCR Neural Networks with Oracle Learning

Modeling time series with hidden Markov models

Assignment 4 CSE 517: Natural Language Processing

UNSUPERVISED SUBMODULAR SUBSET SELECTION FOR SPEECH DATA :EXTENDED VERSION. Kai Wei Yuzong Liu Katrin Kirchhoff Jeff Bilmes

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

COMS 4771 Clustering. Nakul Verma

Computationally Efficient M-Estimation of Log-Linear Structure Models

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Transcription:

Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 / 25

Outline In this presentation, we will Give a very brief introduction of how decision trees are used in the standard automatic speech recognition frameworks. Present our method that generalizes from using one deision tree to multiple trees (the non-random forest) which is combined with ensemble methods to improve recognition accuracy. Hainan Xu September 4, 2015 2 / 25

Speech Recognition 101, HMMs Till this day, the Hidden Markov Model is still the most successful model for speech recognition. The 3-state HMM assumes there are stages of acoustic realization of phones. Its parameters include transition probabilities p(s s) and emission probabilities p(o s). Hainan Xu September 4, 2015 3 / 25

HMM for Speech Recognition We have a 3-state HMM for each phone. We concatenate the phone models as word models. Word models are connected according to the structure of a language model to form a decoding graph. In decoding, given the acoustic observation, we find the most likely sequence in the graph. Hainan Xu September 4, 2015 4 / 25

Phones as ASR modeling units Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! t in better and return Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! t in better and return t in teacher and mountain Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! t in better and return t in teacher and mountain r in car and ray Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! t in better and return t in teacher and mountain r in car and ray And then, linguistics came up with allophones. Hainan Xu September 4, 2015 5 / 25

Phones as ASR modeling units It s simple then. We have a model for each IPA symbol, such as t, b, h, ay, etc... TA-DA! Not so easy! t in better and return t in teacher and mountain r in car and ray And then, linguistics came up with allophones. But that helped just a little bit. Same allophones might still look very different in spectrograms. Hainan Xu September 4, 2015 5 / 25

Context Matters! The realizations of w varies but similar patterns occur with the same context. Instead of using phones regardless of their contexts (monophones), we usually consider the phone to the left and to the right. A phone in such context is called a triphone. Hainan Xu September 4, 2015 6 / 25

Triphone Triphones trees: SIL-tr+ee tr-ee+z ee-z+sil better: SIL-b+eh b-eh+t eh-t+er t-er+sil Hainan Xu September 4, 2015 7 / 25

Triphone Triphones trees: SIL-tr+ee tr-ee+z ee-z+sil better: SIL-b+eh b-eh+t eh-t+er t-er+sil There are around 50 phones in English. Hainan Xu September 4, 2015 7 / 25

Triphone Triphones trees: SIL-tr+ee tr-ee+z ee-z+sil better: SIL-b+eh b-eh+t eh-t+er t-er+sil There are around 50 phones in English. Thus around 50 3 triphones. We need 50 3 3 HMM states Some triphones are rarely, if at all, seen in training data, but still require a model, which is a problem. Hainan Xu September 4, 2015 7 / 25

Triphone Triphones trees: SIL-tr+ee tr-ee+z ee-z+sil better: SIL-b+eh b-eh+t eh-t+er t-er+sil There are around 50 phones in English. Thus around 50 3 triphones. We need 50 3 3 HMM states Some triphones are rarely, if at all, seen in training data, but still require a model, which is a problem. Solution: use decisions tree to create equivalence classes as units for parameter sharing. Hainan Xu September 4, 2015 7 / 25

Decision Trees picture from http://speechlab.sjtu.edu.cn/ kyu/sites/kyu/files/teaching/lecture02.pdf Hainan Xu September 4, 2015 8 / 25

Key Factors for Decision Tree Building A set of questions An objective function to maximize, F([partial] tree) Then we could define the gain of a split or the cost of a merge. A stopping criteria Algorithm for growing the tree Getting the optimal decision tree is NP-hard. Usually a greedy algorithm is used, i.e. keep splitting the tree with the best question until stopping criteria is met. Hainan Xu September 4, 2015 9 / 25

Phonetic Decision Trees Questions are in the form of is the previous phone in the set {m, n, ng}? Usually we want the questions to be complete, meaning there is a single element set for each phone which you could use in a question. The objective function is usually the Gaussian likelihood of all data, assuming data mapped to the same leaf is generated from a Gaussian distribution. We predefine the number of leaves we want in the tree as the stopping criteria. In speech recognition, the most used tree-building algorithm is from Steve Young s paper Tree-based state tying for high accuracy acoustic modelling. Hainan Xu September 4, 2015 10 / 25

Building Single Phonetic Decision Tree - Algorithm Single-tree Building Algorithm, Steve Young et al. 1: initialize a monophone state tree 2: n = number of leaves we want 3: while we have < n leaves do 4: s = the best split on the current tree 5: apply split s 6: end while Hainan Xu September 4, 2015 11 / 25

Building Single Phonetic Decision Tree - Algorithm Single-tree Building Algorithm, Steve Young et al. 1: initialize a monophone state tree 2: n = number of leaves we want 3: while we have < n leaves do 4: s = the best split on the current tree 5: apply split s 6: end while 7: while true do 8: c = the smallest cost of a merge between any leaves in the current tree 9: if c < merge threshold then 10: apply the merge corresponding to c 11: else 12: terminate and return the tree 13: end if 14: end while Hainan Xu September 4, 2015 11 / 25

Models on Top of the Tree The tree provides a mapping from a triphone state to one of its leaves. Based on the mapping, we train an acoustic model that could compute log p(observation triphone state) = log p(observation leaf in the tree) Now with number of leaves << number of all possible triphones, acoustic model training is possible. We build a graph which incorporates the information of leaves in the tree for decoding. Hainan Xu September 4, 2015 12 / 25

Extend to Multiple Trees However, a single decision tree might be biased. Hainan Xu September 4, 2015 13 / 25

Extend to Multiple Trees However, a single decision tree might be biased. Not optimal since we use a greedy algorithm. The objective function for growing the tree might not be the ground truth goodness function (remember we used Gaussian).... Hainan Xu September 4, 2015 13 / 25

Extend to Multiple Trees However, a single decision tree might be biased. Not optimal since we use a greedy algorithm. The objective function for growing the tree might not be the ground truth goodness function (remember we used Gaussian).... We want to build different trees, and (hopefully) compensate for the bias by somehow combining the trees. Hainan Xu September 4, 2015 13 / 25

Extend to Multiple Trees However, a single decision tree might be biased. Not optimal since we use a greedy algorithm. The objective function for growing the tree might not be the ground truth goodness function (remember we used Gaussian).... We want to build different trees, and (hopefully) compensate for the bias by somehow combining the trees. Questions, stopping criteria: no change required. Objective function, algorithm: there is a problem since they re deterministic, and will build exactly same trees. Hainan Xu September 4, 2015 13 / 25

Extend to Multiple Trees However, a single decision tree might be biased. Not optimal since we use a greedy algorithm. The objective function for growing the tree might not be the ground truth goodness function (remember we used Gaussian).... We want to build different trees, and (hopefully) compensate for the bias by somehow combining the trees. Questions, stopping criteria: no change required. Objective function, algorithm: there is a problem since they re deterministic, and will build exactly same trees. Our solution: to include an entropy term in the objective function for tree-building. Hainan Xu September 4, 2015 13 / 25

Entropy of a Decision Tree A decision tree defines a distribution /partition on the data. In the example above, the entropy is Hainan Xu September 4, 2015 14 / 25

Entropy of a Decision Tree A decision tree defines a distribution /partition on the data. In the example above, the entropy is 0.3 log 0.3 0.3 log 0.3 0.4 log 0.4 Hainan Xu September 4, 2015 14 / 25

(Joint-)Entropy of Multiple Decision Trees Multiple (n) decision trees split the data into an n-dimension grid. Note: not all combinations are possible. Also, not all possible combinations exist in data. The entropy of the above example is Hainan Xu September 4, 2015 15 / 25

(Joint-)Entropy of Multiple Decision Trees Multiple (n) decision trees split the data into an n-dimension grid. Note: not all combinations are possible. Also, not all possible combinations exist in data. The entropy of the above example is 0.2 log 0.2 0.2 log 0.2 0.2 log 0.2 0.2 log 0.2 0.2 log 0.2 Hainan Xu September 4, 2015 15 / 25

Multiple Tree - the New Objective Function We have an objective function defined on single tree as G(tree). We have introduced entropy of tree[s], noted as H(tree[s]). Hainan Xu September 4, 2015 16 / 25

Multiple Tree - the New Objective Function We have an objective function defined on single tree as G(tree). We have introduced entropy of tree[s], noted as H(tree[s]). For multiple trees, we define the new objective function as, ( ) i H(ith tree) λ H(all trees) + G(ith tree) n i The first 2 terms could push the joint-entropy to grow larger, while keeping the single tree entropies smaller, making sure the trees are different. Hainan Xu September 4, 2015 16 / 25

Building Multiple Trees - Algorithm Multi-tree Building Algorithm 1: initialize a set of monophone state trees 2: n = number of leaves per tree that we want 3: while not all trees have n leaves do 4: s = the best split on trees having < n leaves 5: apply split s 6: end while Hainan Xu September 4, 2015 17 / 25

Building Multiple Trees - Algorithm Multi-tree Building Algorithm 1: initialize a set of monophone state trees 2: n = number of leaves per tree that we want 3: while not all trees have n leaves do 4: s = the best split on trees having < n leaves 5: apply split s 6: end while 7: while true do 8: c = the smallest cost of a merge between any leaves in a same tree 9: if c < merge threshold then 10: apply the merge corresponding to c 11: else 12: terminate and return the trees 13: end if 14: end while Hainan Xu September 4, 2015 17 / 25

Multi-tree Model Training and Decoding Acoustic models are independently trained on top of each tree. The independent trainings ensure that the multi-tree methods could work regardless the type of acoustic model used (GMM or DNN or LSTM). There are different ways to combine the dependently trained acoustic models. Merge decoded lattices - Minimum Bayes Risk decoding. Our method: to merge acoustic likelihoods. Hainan Xu September 4, 2015 18 / 25

Method to Combine Models Our method tries to combine p(observation triphone state). For the same p(o s), each model would give a different likelihood score, in log-likelihoods {l i }. We use a weighted average that favors the larger likelihood scores (larger probabilities). l = i l i exp(c l i ) i exp(c l, C = acoustic scale = 0.1 i) Transition probabilities in HMM are relatively much less important and we simply use the arithmetic means of the transition probabilities from each model. Hainan Xu September 4, 2015 19 / 25

Model Combination Implementation A virtual tree is built such that each leaf in the virtual tree corresponds to a unique and valid combination of leaves from individual trees. might have 5 leaves in the virtual tree. The virtual tree leaves correspond to the de facto parameter sharing units The virtual tree will be used in building decoding graphs. Hainan Xu September 4, 2015 20 / 25

Impact of the added Entropy Term This table shows that the added entropy term creates much finer paramater sharing units, i.e. having large number of virtual leaves. The stopping criteria is when each tree reaches 5000 leaves. The average number is smaller because of merging. # trees λ avg # leaves # virtual-leaves 1-3973 3973 2 0.1 4030 8173 2 0.25 4115 12969 2 0.5 4204 21138 2 1 4237.5 36828 3 1 4123 97999 4 1 4078.5 164811 Table: Number of leaves in multi-trees (TED-LIUM) Hainan Xu September 4, 2015 21 / 25

Impact of the added Entropy Term, cont ed This table shows that the added entropy term does increase the joint-entropy while not increasing single tree entropies too much. # trees λ avg-entropy joint-entropy 1-7.63 7.63 2 0.1 7.67 7.85 2 0.25 7.72 8.11 2 0.5 7.76 8.41 2 1 7.78 8.78 3 1 7.74 9.00 4 1 7.72 9.07 Table: Entropy of multi-trees (TED-LIUM) Hainan Xu September 4, 2015 22 / 25

Performance of Multi-tree Decoding Here we compare the recognition results by decoding each individual model, the MBR decoding and our joint -decoding method. dev test # trees clean other clean other baseline 5.93 20.42 6.59 22.47 tree 1 6.20 20.67 6.75 22.68 tree 2 6.27 21.07 6.87 22.84 MBR 6.00 20.87 6.59 22.84 joint 5.82 19.86 6.46 21.62 Table: WER of individual and combined DNN models on Librispeech (λ = 1) Hainan Xu September 4, 2015 23 / 25

More results WSJ SWBD TED-LIUM # trees eval92 dev93 swbd eval2000 dev test 1 7.07 4.06 13.4 19.2 21.7 19.4 2 6.55 4.08 13.0 18.8 21.2 18.6 3 6.46 3.72 12.8 18.7 21.2 18.5 Table: WER of DNN models on WSJ, SWBD and TED-LIUM (λ = 1) dev test # trees clean other clean other 1 5.93 20.42 6.59 22.47 2 5.82 19.86 6.46 21.62 3 5.80 19.77 6.27 21.68 Table: WER of DNN models on Librispeech (λ = 1) Hainan Xu September 4, 2015 24 / 25

Conclusion Multi-tree systems could consistently give better results than single tree systems. Multi-tree systems are especially helpful for speech recordings with noisy backgrounds. The more trees we use, the more it helps; though the gain becomes much smaller for larger numbers. Hainan Xu September 4, 2015 25 / 25