End- To- End Speech Recogni0on with Recurrent Neural Networks

Size: px

Start display at page:

Download "End- To- End Speech Recogni0on with Recurrent Neural Networks"

Anna Garrison
5 years ago
Views:

1 RTTH Summer School on Speech Technology: A Deep Learning Perspec0ve End- To- End Speech Recogni0on with Recurrent Neural Networks José A. R. Fonollosa Universitat Politècnica de Catalunya. Barcelona Barcelona, July 9, 2015

2 From speech processing to machine learning 2

3 Towards end- to- end RNN Speech Recogni0on? Architectures GMM- HMM: 30 years of feature engineering DNN- GMM- HMM: Trained features DNN- HMM: TDNN, LSTM, RNN, MS DNN for language modeling (RNN) End- to- end DNN? Examples Alex Graves (Google) Deep Speech (Baidu) Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

GMM- HMM Perceptual Feature Extrac0on (MFCC, PLP, FF, VTLN,GammaTone,..) Feature Transforma0on (Deriva0ve, LDA, MLLT, fmllr,..) GMM (Training: ML, MMI, MPE, MWE, SAT,.

4 GMM- HMM Perceptual Feature Extrac0on (MFCC, PLP, FF, VTLN,GammaTone,..) Feature Transforma0on (Deriva0ve, LDA, MLLT, fmllr,..) GMM (Training: ML, MMI, MPE, MWE, SAT,..) HMM N- GRAM Acous&c Model Phone0c inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

5 DNN- GMM- HMM Feature Extrac0on (MFCC, PLP, FF) DNN (Tandem, Bo]leneck, DNN- Derived) Feature Transforma0on (LDA, MLLT, fmllr) GMM HMM N- GRAM Acous&c Model Phone0c inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

6 Tandem MLP outputs as input to GMM Input layer Output layer Hidden layers Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

7 Bo]leneck Features Use one narrow hidden layer. Supervised or unsupervised training (autoencoder) Input layer Output layer Hidden layers Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

8 DNN- Derived Features Zhi- Jie Yan, Qiang Huo, Jian Xu: A scalable approach to using DNN- derived features in GMM- HMM based acous0c modeling for LVCSR. INTERSPEECH 2013: Input layer Output layer Hidden layers Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep

9 DNN- GMM- HMM DNN (Tandem, Bo]leneck, DNN- Derived) Feature Transforma0on (LDA, MLLT, fmllr, ) GMM HMM N- GRAM Acous&c Model Phone0c inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

10 DNN- HMM DNN HMM N- GRAM Acous&c Model Phone0c inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

11 DNN- HMM+RNNLM DNN HMM (N- GRAM +) RNN Acous&c Model Phone0c inventory Pronuncia0on Lexicon Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

12 RNN- RNNLM RNN Acous0c Model (N- GRAM +) RNN Language Model Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

13 End- to- End RNN Alex Graves et al. (2006) Connec0onist temporal classifica0on: Labelling unsegmented sequence data with recurrent neural networks Proceedings of the Interna0onal Conference on Machine Learning, ICML Florian Eyben, Mar0n Wöllmer, Bjö rn Schuller, Alex Graves (2009) From Speech to Le]ers - Using a Novel Neural Network Architecture for Grapheme Based ASR, ASRU Alex Graves, Navdeep Jaitly, (Jun 2014) Towards End- To- End Speech Recogni0on with Recurrent Neural Networks, Interna0onal Conference on Machine Learning, ICML Jan Chorowski et al. (Dec 2014) End- to- end Con0nuous Speech Recogni0on using A]en0on- based Recurrent NN: First Results, Deep Learning and Representa0on Learning Workshop: NIPS Awni Hannun et al (Dec 2014), Deep Speech: Scaling up end- to- end speech recogni&on, arxiv: [cs.cl] Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

14 End- to- End RNN No perceptual features (MFCC). No feature transforma0on. No phone0c inventory. No transcrip0on dic0onary. No HMM. The output of the RNN are characters including space, apostrophe, (not CD phones) Connec0onist Temporal Classifica0on (No fixed aligment speech/character) Data augmenta0on. 5,000 hours (9600 speakers) + noise = 100,000 hours. Op0miza0ons: data parallelism, model parallelism Good results in noisy condi0ons Adam Coates Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

Bidirec0onal Recursive DNN T H E D _ O G Unrolled RNN Spectogram Clipped ReLu Accelerated gradient method GPU friendly h (f) t = g(w (4) h (3) t h (b) t = g(w (4) h (3) t + W r (f) h (f)

15 Bidirec0onal Recursive DNN T H E D _ O G Unrolled RNN Spectogram Clipped ReLu Accelerated gradient method GPU friendly h (f) t = g(w (4) h (3) t h (b) t = g(w (4) h (3) t + W r (f) h (f) t 1 + b(4) ) + W r (b) h (b) t+1 + b(4) ) omputed sequentially from to fo Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

Efficient'dynamic'programming'of'all'possible'

16 Connec0onist Temporal Classifica0on T H _ E D O G?"?" How'to'connect'speech'data'with'transcription?' Transcription'not'labeled'per'millisecond' Use'CTC,'from'[Graves'06]' Efficient'dynamic'programming'of'all'possible' alignments'to'compute'error'of'{audio,'transcription}' Bryan'Catanzaro' Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

17 Connec0onist Temporal Classifica0on The framewise network receives an error for misalignment The CTC network predicts the sequence of phonemes / characters (as spikes separated by blanks ) No force alignment (ini0al model) required for training. Waveform label probability dh ax s aw n " " " " dcl d ix v the sound of " " Framewise CTC Alex Graves 2006 Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

18 GMM- HMM / DNN- HMM / RNN EndXtoXend'DL'may'work'better'' when'we'have'big'models'and'' lots'of'data' Accuracy' DL'V1'for'Speech' Traditional'ASR' Deep'Speech' Data'+'Model'Size' Bryan'Catanzaro Adam'Coates' Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

19 Data Augmenta0on This approach needs bigger models and bigger datasets. Synthesis by superposi0on: reverbera0ons, echoes, a large number of short noise recordings from public video sources, ji]er Lombart Effect ' ' >100,000' 80000' Hours' 60000' Synthesized' data' 40000' 20000' 0' 300' 2000' WSJ' Switchboard' Fisher' Deep'Speech' Bryan'Catanzaro Adam'Coates' Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

20 Results 2000 HUB5 (LDC2002S09) System AM training data SWB CH Vesely et al. (2013) SWB Seide et al. (2014) SWB+Fisher+other 13.1 Hannun et al. (2014) SWB+Fisher Zhou et al. (2014) SWB 14.2 Maas et al. (2014) SWB Maas et al. (2014) SWB+Fisher Soltau et al. (2014) SWB Saon et al (2015) SWB+Fisher+CH Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

21 Hybrid(IBM) versus DL(Baidu) Features IBM 2015 Baidu 2014 VTL- PLP, MVN, LDA, STC, fmmlr, i- Vector Alignment GMM- HMM 300K Gaussians - DNN DNN(5x2048) + CNN(128x9x9+5x2048) + +RNN outputs DNN Training HMM CE + MBR Discrimina0ve Training (ST) CTC 32K states (DNN outputs) pentaphone acous0c context Language Model 37M 4- gram + model M (class based exponen0al model) + 2 NNLM 80 log filter banks 4RNN (5 x 2304) 28 outputs - 4- gram (Transcripts) Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

22 Results II Training: Baidu database + data augmenta0on Test: new dataset of 200 recording in both clean and noisy sexngs (no details) Comparison against commercial systems in produc0on Word'Error'Rate'(%)' 50' 40' 30' 20' 10' 0' Clean' Noisy' Combined' Apple'Dictation' Bing'Speech' Google'API' wit.ai' Deep'Speech' Bryan'Catanzaro Adam'Coates' Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

23 Deep Speech Demo h]p:// highlight/

24 References Alex Graves et al. (2006) Connec0onist temporal classifica0on: Labelling unsegmented sequence data with recurrent neural networks Proceedings of the Interna0onal Conference on Machine Learning, ICML Florian Eyben, Mar0n Wöllmer, Bjö rn Schuller, Alex Graves (2009) From Speech to Le]ers - Using a Novel Neural Network Architecture for Grapheme Based ASR, ASRU Alex Graves, Navdeep Jaitly, (Jun 2014) Towards End- To- End Speech Recogni0on with Recurrent Neural Networks, Interna0onal Conference on Machine Learning, ICML Jan Chorowski et al. (Dec 2014) End- to- end Con0nuous Speech Recogni0on using A]en0on- based Recurrent NN: First Results, Deep Learning and Representa0on Learning Workshop: NIPS Awni Hannun et al (Dec 2014), Deep Speech: Scaling up end- to- end speech recogni0on, arxiv: [cs.cl] George Saon et al. The IBM 2015 English Conversa0onal Telephone Speech Recogni0on System, arxiv: [cs.cl] Jose Fonollosa: Deep Learning Architectures for Speech Recogni0on RTTH Summer School, Barcelona, 9- July

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Andrew Maas Collaborators: Awni Hannun, Peng Qi, Chris Lengerich, Ziang Xie, and Anshul Samar Advisors: Andrew Ng and Dan Jurafsky Outline