Learning The Lexicon!

Similar documents
Overview. Search and Decoding. HMM Speech Recognition. The Search Problem in ASR (1) Today s lecture. Steve Renals

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Weighted Finite State Transducers in Automatic Speech Recognition

Weighted Finite State Transducers in Automatic Speech Recognition

Modeling Coarticulation in Continuous Speech

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Chapter 3. Speech segmentation. 3.1 Preprocessing

Sequence Prediction with Neural Segmental Models. Hao Tang

Speech Technology Using in Wechat

Constrained Discriminative Training of N-gram Language Models

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Lab 4 Large Vocabulary Decoding: A Love Story

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008

1st component influence. y axis location (mm) Incoming context phone. Audio Visual Codebook. Visual phoneme similarity matrix

Lecture 9. LVCSR Decoding (cont d) and Robustness. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Ping-pong decoding Combining forward and backward search

RLAT Rapid Language Adaptation Toolkit

Discriminative training and Feature combination

Handwritten Text Recognition

Contents. Resumen. List of Acronyms. List of Mathematical Symbols. List of Figures. List of Tables. I Introduction 1

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

COMBINING FEATURE SETS WITH SUPPORT VECTOR MACHINES: APPLICATION TO SPEAKER RECOGNITION

PHONE-BASED SPOKEN DOCUMENT RETRIEVAL IN CONFORMANCE WITH THE MPEG-7 STANDARD

Facial Animation System Based on Image Warping Algorithm

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment, Model-Building and Decoding

Memory-Efficient Heterogeneous Speech Recognition Hybrid in GPU-Equipped Mobile Devices

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

A study of large vocabulary speech recognition decoding using finite-state graphs 1

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK

An Experimental Evaluation of Keyword-Filler Hidden Markov Models

Design of the CMU Sphinx-4 Decoder

SVD-based Universal DNN Modeling for Multiple Scenarios

CONTINUOUS VISUAL SPEECH RECOGNITION FOR MULTIMODAL FUSION

Automatic Speech Recognition (ASR)

MLSALT11: Large Vocabulary Speech Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Automatic Speech Recognition using Dynamic Bayesian Networks

Lecture 7: Neural network acoustic models in speech recognition

Annotation Graphs, Annotation Servers and Multi-Modal Resources

LATENT SEMANTIC INDEXING BY SELF-ORGANIZING MAP. Mikko Kurimo and Chafic Mokbel

Large Scale Distributed Acoustic Modeling With Back-off N-grams

FINDING INFORMATION IN AUDIO: A NEW PARADIGM FOR AUDIO BROWSING AND RETRIEVAL

A Hidden Markov Model for Alphabet Soup Word Recognition

Dynamic Time Warping

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Visual Modeling and Feature Adaptation in Sign Language Recognition

Confidence Measures: how much we can trust our speech recognizers

Lecture 8. LVCSR Decoding. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

MINIMUM EXACT WORD ERROR TRAINING. G. Heigold, W. Macherey, R. Schlüter, H. Ney

Knowledge-Based Word Lattice Rescoring in a Dynamic Context. Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow

Spoken Term Detection Using Multiple Speech Recognizers Outputs at NTCIR-9 SpokenDoc STD subtask

arxiv: v1 [cs.cv] 3 Oct 2017

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Speech Articulation Training PART 1. VATA (Vowel Articulation Training Aid)

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Multimedia Indexing. Lecture 12: EE E6820: Speech & Audio Processing & Recognition. Spoken document retrieval Audio databases.

Lecture 12: Multimedia Indexing. Spoken Document Retrieval (SDR)

Neighborhood Analysis Methods in Acoustic Modeling for Automatic Speech Recognition. Natasha Singh-Miller

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Discriminative Training and Adaptation of Large Vocabulary ASR Systems

Speech Applications. How do they work?

WFST: Weighted Finite State Transducer. September 12, 2017 Prof. Marie Meteer

Learning N-gram Language Models from Uncertain Data

Scalable Trigram Backoff Language Models

Introduction to HTK Toolkit

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Using Document Summarization Techniques for Speech Data Subset Selection

Lecture 8: Speech Recognition Using Finite State Transducers

SPIDER: A Continuous Speech Light Decoder

Large-scale discriminative language model reranking for voice-search

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Encrypted Deep Learning: A Guide to Privacy Preserving Speech Processing

Fusion of LVCSR and Posteriorgram Based Keyword Search

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

Unsupervised Learning

Selection of Best Match Keyword using Spoken Term Detection for Spoken Document Indexing

INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING

Google Search by Voice

Modeling Phonetic Context with Non-random Forests for Speech Recognition

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

AN ITERATIVE APPROACH TO DECISION TREE TRAINING FOR CONTEXT DEPENDENT SPEECH SYNTHESIS. Xiayu Chen, Yang Zhang, Mark Hasegawa-Johnson

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

WER [%] WER [%] Tested on 16 comp. HMMs. Tested on 1 comp. HMMs

Detector. Flash. Detector

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12

Optimizing feature representation for speaker diarization using PCA and LDA

Efficient Scalable Encoding for Distributed Speech Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

CSCI 5582 Artificial Intelligence. Today 10/31

DESIGN & IMPLEMENTATION OF A CO-PROCESSOR FOR EMBEDDED, REAL-TIME, SPEAKER-INDEPENDENT, CONTINUOUS SPEECH RECOGNITION SYSTEM-ON-A-CHIP.

Transcription:

Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge, MA, USA!

Automatic Speech Recognition! A Perspective on Resources! ~98% of the worldʼs languages have not been addressed by resource and expert intensive supervised speech recognition training methods.! Phonetic Units, Lexicon,! Parallel Speech/Text! Human Expertise! Parallel Speech/Text! Speech, Text! Speech! Technical Difficulty (more learning)!

Automatic Speech Recognition! P (A B) P (B W ) P (W ) A! B! B! W! W! Acoustic Models! Lexicon! Language Model! Speech! Front! End! Features! Decoder! W = w 1...w n Fundamental Equation of ASR?! W = argmax W max B Viterbi! Approximation!! P (A B)P (B W )P (W )

Stochastic Lexicon! P (B W )= j=1 P (b j w j ) e.g.! P (hh ax w ay iy hawaii)

Stochastic Lexicon: A Single Entry! P (b w) 1 1 {b w } θ b w Learned Parameters!

Pronunciation Related ASR Research! Letter-to-sound! Out-of-Vocabulary! Pronunciation! Modeling! Pronunciation Mixture Model!

Modeling Pronunciation Variation! Model! Domain! Impact on WER%! Rule learning from manual transcriptions [Riley et al. 1999]! Decision trees + dynamic lexicon [Fosler-Lussier 1999]! Knowledge-based rules + FST weight learning [Hazen et al. 2002]! Broadcast news! 12.7 10.0! Conversational! 44.7 43.8! Broadcast news! 21.4 20.4! Weather queries! 12.1 11.0!

Phonological Rules! A set of generic rewrite rules convert a baseform from phonemes to its possible variants at the phone level.! Example:! nearest tornado! Phoneme: n ih r ax s td # t ao r n ey df ow! Phone: n ih r ax s # tcl t er n ey dx ow! n ih r ax s tcl t # tcl t ao r n ey dcl d ow!! Expanding the lexicon!

Pronunciation Mixture Model! w i Gyllenhaal! A i 1 = 0.76 Candidate Pronunciations! {b} jh ih l eh n hh aa l! Gyllenhaal! 2 = 0.23 jh ih l ax n hh aa l = 0.05 jh ih l eh n hh aa ao l Gyllenhaal! M Recognition using! θ b w θ is used in a subsequent iteration,! or directly in a stochastic lexicon!!

Pronunciation Mixture Model! E-step:! M θ [w, p] is expected number of times that pronunciation is used for word.! p w M-step:! θ p w is the normalized M θ [w, p] across all pronunciations for a given word.! w

Continuous Speech PMM Example! # what # do # you # know # 0.5584 # w ah tcl 0.2250 # w ah tcl # dcl d uw # y uw # n ow # # dcl d # y uw # n ow # 0.0645 # w ah tcl # dcl d uw # y ax # n ow # 0.0434 # w ah tcl t # dcl d uw # y uw # n ow # 0.0376 # w ah tcl # dcl d ow # y uw # n ow # 0.0244 # w ah tcl # dcl d ax # y uw # n ow # 0.0219 # w ah tcl 0.0097 # w ah tcl # dcl d iy # y uw # n ow # # dcl d # y ax # n ow # 0.0083 # w ah tcl t # dcl d uw # y ax # n ow # 0.0063 # w ah tcl # dcl jh # y uw # n ow # θ yuw you θ # thank # you # yax you θ yah you 0.4954 # th ae ng kcl k # y ax # 0.4891 # th ae ng kcl k # y uw # 0.0068 # th ae ng kcl k # y ah # 0.0035 # th ae ng kcl k # y axr # 0.0010 # th ae ng kcl # y ax # 0.0010 # th ae ng kcl # y uw # 0.0010 # th ae ng kcl k # y ow 0.0007 # th ae ng kcl k # y el # # 0.0004 # th ae ng kcl k # y aa uw # 0.0004 # th ae ng kcl k # y aw # Normalize!!

Pronunciation Related ASR Research! Letter-to-sound! Out-of-Vocabulary! Pronunciation! Modeling! Pronunciation Mixture Model!

Initializing a PMM: Choosing the Support! Expert Pronunciations The SLS dictionary is based on PronLex! * Contains around 150,000 words! * Has an average of 1.2 pronunciations per word.! * Supplementary phonological rules expand pronunciations.! * Expanded lexicon has an average of 4.0 pronunciations per word.! Letter-to-sound L2S System Joint sequence models [Bisani and Ney, 2008]! * Graphonemes: Train on expert lexicon.! * Graphones: Train on expanded expert lexicon.!

Joint-sequence Models for L2S! b = argmax b P (w, b) S(w,b)! w! =! c! o! u! p! l! e! b! =! k! ah! p! ax! l! =! k! ah! p! ax! l! g 1! =! c/k! o/ah! u/! p/p! /ax! l/l! e/! g 2! =! c/k! o/! u/ah! p/p! /ax! l/l! e/! P (w, b) = g S(w,b) P (g) max P (g) g S(w,b) EM used to infer alignments and build an M-gram over multigrams.!

Non-singular Graphones! Parameters: L = the number of graphemes! R = the number of phonetic units! M = language model context size! w! =! c! o! u! p! l! e! b! =! kcl_k! ah! pcl_p! ax! l! =! kcl_k! ah! pcl_p! ax! l! g 1! =! c/kcl_k! o/ah! u/! p/pcl_p! /ax! l/l! e/! g 2! =! c/kcl_k! o/! u/ah! p/pcl_p! /ax! l/l! e/!

Weighted Finite State Transducers! J P (w, b) Jŵ P (ŵ, b) G L P C Language Model! Stochastic Lexicon! Phonological Rules! Context Dependent Labels! S W i Standard Recognition! R = C P L G PMM Training! = J w i 1 #J w i 2 #...#J w i W R i = C P S W i R i = C S W i

Recognition Output During Training! (Effectively)! # what # do # you # know # 0.5584 # w ah tcl 0.2250 # w ah tcl # dcl d uw # y uw # n ow # # dcl d # y uw # n ow # 0.0645 # w ah tcl # dcl d uw # y ax # n ow # 0.0434 # w ah tcl t # dcl d uw # y uw # n ow # 0.0376 # w ah tcl # dcl d ow # y uw # n ow # 0.0244 # w ah tcl # dcl d ax # y uw # n ow # 0.0219 # w ah tcl 0.0097 # w ah tcl # dcl d iy # y uw # n ow # # dcl d # y ax # n ow # 0.0083 # w ah tcl t # dcl d uw # y ax # n ow # 0.0063 # w ah tcl # dcl jh # y uw # n ow # # thank # you # 0.4954 # th ae ng kcl k # y ax # 0.4891 # th ae ng kcl k # y uw # 0.0068 # th ae ng kcl k # y axr # 0.0035 # th ae ng kcl k # y el # 0.0010 # th ae ng kcl # y ax # 0.0010 # th ae ng kcl # y uw # 0.0010 # th ae ng kcl k # y ow 0.0007 # th ae ng kcl k # y ah # # 0.0004 # th ae ng kcl k # y aa uw # 0.0004 # th ae ng kcl k # y aw #

SUMMIT Recognizer Setup! Landmark-based speech recognizer! MFCC averages are taken at varying durations around hypothesized boundaries! 112-dimensional feature vectors are whitened with a PCA rotation! 50 principal components are kept! Context dependent acoustic models! Up to 75 diagonal gaussian mixture components each! Maximum-likelihood back-off models are trained on a corpus of telephone speech! Lexicon! Expert dictionary based on PronLex! ~150,000 words for training graphones/graphonemes! All lexicons limited to training set words for testing!

Isolated Words: Experimental Setup! Phonebook Corpus! Isolated words spoken over the telephone by Americans! 2,000 words chosen randomly from the set that had at leas 13 speakers! For each word, 1 male and 1 female utterance was held out for testing! The remaining 22,000 utterances were used for training! Lexicon! The 2,000 test words were removed from the lexicon! A simple edit distance criterion was used to prune similar words! A 5-gram graphoneme language model was trained! Testing! We test lexicons trained using varying amounts of acoustic data! We compare with two baselines:! * Expert Baseline WER: 12.4%! * Graphoneme L2S WER: 16.7%!

Isolated Word: Experimental Results! 17 Lexicon Adaptation using Acoustic Examples 16 Graphoneme Lexicon WER on Test Set (%) 15 14 13 12 Unweighted Top PMM pronunciation Weighted PMM pronunciations Expert Lexicon Collect 10 pronunciations! for each word.! Train a PMM just on these pronunciations.! 11 10 0 2 4 6 8 10 Number of utterances used to learn baseforms (M)

Analysis! 83% of the top pronunciations are identical! Word Dictionary Baseform Top PMM Pronunciation parishoners [sic] p AE r ih sh ax n er z p AX r ih sh ax n er z traumatic tr r AO m ae tf ax kd tr r AX m ae tf ax kd winnifred w ih n ax f r AX dd w ih n ax f r EH dd crosby k r ao Z biy k r aa S biy melrose mehlrowz mehlrows arenas ER iy n ax z AX R iy n ax z billowy bihlow iy bihlax W iy whitener waytf AX ner waytd ner airsickness eh r SH ih kd n EH s eh r S ih kd n AX s Isabel AX S AA behl IH Z AX behl

Continuous Speech: Experimental Setup! Jupiter: Weather Query Corpus! Short queries (average of ~6 words in length)! We prune utterances with non-speech artifacts out of the corpus! We use a training set containing 76.68 hours of speech! We use a test set containing 3.18 hours and a dev set of.84 hours! The acoustic models are well matched to the training set! Lexicon & LM! The lexicons in these experiments contain only training set words! A trigram was trained on the training set transcripts! Testing! We provide a baseline based only on the letter-to-sound pronunciations! We decode using the expert dictionary to give us a baseline! We weight the expert lexicon using the PMM training! We train a 5-gram graphone and graphoneme-based PMM!

Continuous Speech: Experimental Results! WER Graphoneme L2S 11.2 Expert 9.5 Expert PMM 9.2 Phoneme PMM 8.3 Phone PMM 8.2

Analysis! Word Dictionary Baseform Top PMM Pronunciation already ao L rehdfiy aa r eh df iy antarctica ae nt aa r KD t ax k ax ae nt aa r tf ax k ax asked ae s KD t ae s td barbara b aa r b AX r ah b aa r b r ax bratislava braa tf ax s l aa v ax b r tf ax z l aa v ax clothes klowdh z klowz

Lexicon Sizes! Weather Lexicon Avg # Prons # States # Arcs Size Expert 1.2 32K 152K 3.5 MB Phoneme PMM 3.15 51K 350K 7.5 MB Phone PMM 4.0 47K 231K 5.2 MB

Varying Initialization Parameters! Singular Graphones and variable M M=1 M=2 M=3 M=4 M=5 LM FST Size 28K 64K 624K 3.1M 9.4M WER Using Graphones Alone (T=.01) PPW 3.9 14.2 11.4 7.5 5.9 WER Dev. 71.4 20.5 14.3 13.1 12.5 WER Test 74.9 17.6 11.7 10.9 10.2 WER Using Graphone-based PMM (T=.01) PPW 6.08 4.0 3.5 3.1 3.0 WER Dev. 17.7 10.6 10.6 10.5 10.7 WER Test 15.6 8.4 8.2 8.1 8.2

Varying Initialization Parameters! Graphones with M = 5 and variable L-to-R 1-to-1 1-to-2 1-to-3 2-to-2 3-to-3 LM FST Size 9.4M 12M 12M 25M 21M WER Using Graphones Alone (T=.01) PPW 5.9 5.7 5.7 5.4 5.5 WER Dev. 12.5 12.2 12.4 12.2 12.3 WER Test 10.2 10.2 10.3 10.4 10.1 WER Using Graphone-based PMM (T=.01) PPW 3.0 3.0 2.9 2.9 2.9 WER Dev. 10.7 10.6 10.5 10.4 10.6 WER Test 8.2 8.2 8.2 8.2 8.2

Conclusions! PMM: Maximum-likelihood lexicon learning! Flexible initialization from experts or L2S! Requires no additional resources! Produces better-than-expert pronunciations!

Future Directions! Apply to other languages! Train in tandem with acoustic models (no phonological rules)! Learn the lexicon from scratch?!