The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays

Size: px

Start display at page:

Download "The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays"

Richard Hunter
5 years ago
Views:

1 CHiME2018 workshop The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays Naoyuki Kanda 1, Rintaro Ikeshita 1, Shota Horiguchi 1, Yusuke Fujita 1, Kenji Nagamatsu 1, Xiaofei Wang 2, Vimal Manohar 2, Nelson Enrique Yalta Soplin 2, Matthew Maciejewski 2, Szu-Jui Chen 2, Aswin Shanmugam Subramanian 2, Ruizhi Li 2, Zhiqi Wang 2, Jason Naradowsky 2, L. Paola Garcia-Perera 2, Gregory Sell 2 1 2

2 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 1

3 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

4 Acoustic Model Training Pipeline Step 1. -AM Step 2. Alignment 1ch worn L 1ch worn R 1ch worn L+R 1ch worn L+R (mono) Alignment & Cleanup (tri) (LDA-MLLT) Full set Cleaned 1ch worn L+R & phone-state alignment (SAT) Alignment Expansion Data augmentation 40h -> 4,500h Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

5 Training Data for 1ch AM Worn Effect of data augmentation with baseline AM Data Array (Raw, CH1) Array (BeamFormIt) Speed & Volume Data Augmentation Reverb. & Noise(*) Bandpass Training Epoch Worn- Dev Ref-Array- Dev L, R, L+R L, R, L+R L, R, L+R L, R, L+R L, R, L+R (*) Reverb. & noise perturbation was applied only for worn microphone data. Speed: 0.9, 1.0, 1.1 Volume: Reverberation: Generate impulse responses of simulated rooms by image method. Follow the settings of {small, medium}-size rooms in [1]. Noise: Add non-speech region of array data with SNR of {20,15,10,5, 0} Bandpass: Randomly-selected frequency band was cut off. (leave at least 1,000 Hz band within the range of less than 4,000 Hz) [1] T. Ko, et al.: A study on data augmentation of reverberant speech for robust speech recognition, Proc. ICASSP, pp ,

6 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) 4ch AM Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

7 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 6

8 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM (1) LF-MMI update with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 7

9 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM -1 1 with 512 nodes of ReLU (2) LF-MMI update 4ch-branch Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM concat MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 8

10 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of (3) LF-sMBR update 9

11 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 10

Target speech Overlapped speech Target speech Target speech with strong

non-target, and noise y f (t) α tgt N C 0, v f tgt (t)r f tgt + α

f noise Mask estimation using EM Algorithm MVDR-based Beamformer R f

matrix R f R f nontgt Non-target speaker + noise R f noise Noise Noise

"Online MVDR beamformer based on complex Gaussian mixture model with

12 Target speech Overlapped speech Target speech Target speech with strong noise Complex Gaussian Mixture Model 3-class mixture: target, non-target, and noise y f (t) α tgt N C 0, v f tgt (t)r f tgt + α nontgt N C 0, v f nontgt (t)r f nontgt + α noise N C 0, v f noise (t)r f noise Mask estimation using EM Algorithm MVDR-based Beamformer R f tgt Target speaker + noise Target Speech Mask Spatial correlation matrix R f R f nontgt Non-target speaker + noise R f noise Noise Noise Mask EM Algorithm Higuchi, Takuya, et al. "Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR. " IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (2017):

13 Mask Estimation Neural Network 1. Train mask estimation (ME) network [1][2] by using mixture of speech (worn non-speaker-overlapped region) and noise (array non-speech region) in the training set 513 X_mask N_mask FC, Sigmoid FC, Sigmoid 513 FC, ReLU 513 FC, ReLU 513 BLSTM Speaker adaptation abs(stft) X_reference BCE 513 gate def gate(x): if input.speaker == target_speaker: return x else: return 0 X_mask N_mask (*) we used only non-overlapped regions for adaptation 513 FC, Sigmoid FC, Sigmoid 513 FC, ReLU FC, ReLU BLSTM abs(stft) target-spk. speech non-target-spk. speech [1] J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Proc. ICASSP, 2016, pp [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, Improved MVDR beamforming using single-channel mask prediction networks. in Proc. Interspeech, 2016, pp

14 Mask Estimation Neural Network 3. Mask inference Target speaker s mask is selected only if target speaker s output value is higher than all other non-targets values. Example: P01(target) and P02(non-target) X_mask_tgt gate X_mask P01 ME net P01 If X_mask_P01 > X_mask_P02: X_mask_P01 else: 0 X_mask P02 ME net P02 abs(stft) 13

15 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 14

16 Language Modeling Recurrent neural network based word-lm 2 layer LSTM with 512 nodes, 50% dropout 512 dim embeddings PyTorch implementation Official-LM: forward-rnn-lm: backward-rnn-lm = 0.5 : 0.25 : 0.25 WER (%) for Dev set without RNN-LM with RNN-LM Single-array % impr Multiple-array % impr (*) Results with model combination and hypothesis deduplication 15

17 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 16

18 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array, instead combined hypotheses from each array. Array1 Front-end1 AM1 Hypothesis_1,1,1 Array1 Array6 Front-end1 Front-end2 AM2 AM5 Hypothesis_1,1,2 Hypothesis_6,2,5 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 17

19 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 18

20 Hypothesis Deduplication (HD) Same words were sometimes recognized for overlapped utterances P05 um yeah P08 can i help with um yeah that looks good Duplicated words with lower confidence were excluded from the hypothesis. HD was applied after ROVER, so precise time boundary could not be used. Minimum edit distance-based word alignment was used to detect word duplication. WER (%) for Dev set without HD with HD Single-array % impr Multiple-array % impr. (*) Results with RNN-LM 19

21 Final results & Conclusion 20

22 Final Results WER (%) without RNN-LM / with RNN-LM Our best eval result Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / /

23 Final Results WER (%) without RNN-LM / with RNN-LM Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Array combination by ROVER worked well for dev, but not effective for eval set. Why? Different types of rooms? Speaker-array distance? Anyway, better array combination methods should be pursued. 22

24 Conclusion Our contributions Multiple data augmentation 4-ch AM with Residual BiLSTM Speaker adaptive mask estimation network / C-based beamformer Hypothesis Dedupulication Array combination by ROVER (found not effective for evaluation set) Our results 48.2% WER for evaluation set 2 nd ranked, with only 2.1 point difference to the best result Thank you for your attention! 23

25 Appendix 24

26 Comparison of AM Architectures Model Input Training Worn-Dev Ref-Array-Dev Baseline 1ch LF-MMI CNN-TDNN-LSTM 1ch LF-MMI CNN-TDNN-BiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 4ch LF-MMI n/a CNN-TDNN-RBiLSTM 4ch LF-sMBR [1] n/a [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

27 Comparison of Frontend Processing Front-end for 1ch input Front-end for 4ch input Dev BeamFormIt (= Baseline) Raw Raw Raw WPE WPE C-MVDR WPE Speaker adaptive mask NN-MVDR WPE

28 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array. Instead we combined hypotheses from each array. Array1 Array1 Front-end1 Front-end1 AM1 AM2 Hypothesis_1,1,1 Hypothesis_1,1,2 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 WER (%) for Dev set AM Array Frontend Dev Single-array Multiple-array 1 1 MaskNet MaskNet MaskNet, C MaskNet, C (*) Results w/o RNN-LM 27

29 Thank you 28

Lecture 7: Neural network acoustic models in speech recognition

CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic