The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays
|
|
- Richard Hunter
- 5 years ago
- Views:
Transcription
1 CHiME2018 workshop The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays Naoyuki Kanda 1, Rintaro Ikeshita 1, Shota Horiguchi 1, Yusuke Fujita 1, Kenji Nagamatsu 1, Xiaofei Wang 2, Vimal Manohar 2, Nelson Enrique Yalta Soplin 2, Matthew Maciejewski 2, Szu-Jui Chen 2, Aswin Shanmugam Subramanian 2, Ruizhi Li 2, Zhiqi Wang 2, Jason Naradowsky 2, L. Paola Garcia-Perera 2, Gregory Sell 2 1 2
2 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 1
3 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech
4 Acoustic Model Training Pipeline Step 1. -AM Step 2. Alignment 1ch worn L 1ch worn R 1ch worn L+R 1ch worn L+R (mono) Alignment & Cleanup (tri) (LDA-MLLT) Full set Cleaned 1ch worn L+R & phone-state alignment (SAT) Alignment Expansion Data augmentation 40h -> 4,500h Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech
5 Training Data for 1ch AM Worn Effect of data augmentation with baseline AM Data Array (Raw, CH1) Array (BeamFormIt) Speed & Volume Data Augmentation Reverb. & Noise(*) Bandpass Training Epoch Worn- Dev Ref-Array- Dev L, R, L+R L, R, L+R L, R, L+R L, R, L+R L, R, L+R (*) Reverb. & noise perturbation was applied only for worn microphone data. Speed: 0.9, 1.0, 1.1 Volume: Reverberation: Generate impulse responses of simulated rooms by image method. Follow the settings of {small, medium}-size rooms in [1]. Noise: Add non-speech region of array data with SNR of {20,15,10,5, 0} Bandpass: Randomly-selected frequency band was cut off. (leave at least 1,000 Hz band within the range of less than 4,000 Hz) [1] T. Ko, et al.: A study on data augmentation of reverberant speech for robust speech recognition, Proc. ICASSP, pp ,
6 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) 4ch AM Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech
7 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 6
8 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM (1) LF-MMI update with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 7
9 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM -1 1 with 512 nodes of ReLU (2) LF-MMI update 4ch-branch Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM concat MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 8
10 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of (3) LF-sMBR update 9
11 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 10
12 Target speech Overlapped speech Target speech Target speech with strong noise Complex Gaussian Mixture Model 3-class mixture: target, non-target, and noise y f (t) α tgt N C 0, v f tgt (t)r f tgt + α nontgt N C 0, v f nontgt (t)r f nontgt + α noise N C 0, v f noise (t)r f noise Mask estimation using EM Algorithm MVDR-based Beamformer R f tgt Target speaker + noise Target Speech Mask Spatial correlation matrix R f R f nontgt Non-target speaker + noise R f noise Noise Noise Mask EM Algorithm Higuchi, Takuya, et al. "Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR. " IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (2017):
13 Mask Estimation Neural Network 1. Train mask estimation (ME) network [1][2] by using mixture of speech (worn non-speaker-overlapped region) and noise (array non-speech region) in the training set 513 X_mask N_mask FC, Sigmoid FC, Sigmoid 513 FC, ReLU 513 FC, ReLU 513 BLSTM Speaker adaptation abs(stft) X_reference BCE 513 gate def gate(x): if input.speaker == target_speaker: return x else: return 0 X_mask N_mask (*) we used only non-overlapped regions for adaptation 513 FC, Sigmoid FC, Sigmoid 513 FC, ReLU FC, ReLU BLSTM abs(stft) target-spk. speech non-target-spk. speech [1] J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Proc. ICASSP, 2016, pp [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, Improved MVDR beamforming using single-channel mask prediction networks. in Proc. Interspeech, 2016, pp
14 Mask Estimation Neural Network 3. Mask inference Target speaker s mask is selected only if target speaker s output value is higher than all other non-targets values. Example: P01(target) and P02(non-target) X_mask_tgt gate X_mask P01 ME net P01 If X_mask_P01 > X_mask_P02: X_mask_P01 else: 0 X_mask P02 ME net P02 abs(stft) 13
15 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 14
16 Language Modeling Recurrent neural network based word-lm 2 layer LSTM with 512 nodes, 50% dropout 512 dim embeddings PyTorch implementation Official-LM: forward-rnn-lm: backward-rnn-lm = 0.5 : 0.25 : 0.25 WER (%) for Dev set without RNN-LM with RNN-LM Single-array % impr Multiple-array % impr (*) Results with model combination and hypothesis deduplication 15
17 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 16
18 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array, instead combined hypotheses from each array. Array1 Front-end1 AM1 Hypothesis_1,1,1 Array1 Array6 Front-end1 Front-end2 AM2 AM5 Hypothesis_1,1,2 Hypothesis_6,2,5 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 17
19 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 18
20 Hypothesis Deduplication (HD) Same words were sometimes recognized for overlapped utterances P05 um yeah P08 can i help with um yeah that looks good Duplicated words with lower confidence were excluded from the hypothesis. HD was applied after ROVER, so precise time boundary could not be used. Minimum edit distance-based word alignment was used to detect word duplication. WER (%) for Dev set without HD with HD Single-array % impr Multiple-array % impr. (*) Results with RNN-LM 19
21 Final results & Conclusion 20
22 Final Results WER (%) without RNN-LM / with RNN-LM Our best eval result Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / /
23 Final Results WER (%) without RNN-LM / with RNN-LM Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Array combination by ROVER worked well for dev, but not effective for eval set. Why? Different types of rooms? Speaker-array distance? Anyway, better array combination methods should be pursued. 22
24 Conclusion Our contributions Multiple data augmentation 4-ch AM with Residual BiLSTM Speaker adaptive mask estimation network / C-based beamformer Hypothesis Dedupulication Array combination by ROVER (found not effective for evaluation set) Our results 48.2% WER for evaluation set 2 nd ranked, with only 2.1 point difference to the best result Thank you for your attention! 23
25 Appendix 24
26 Comparison of AM Architectures Model Input Training Worn-Dev Ref-Array-Dev Baseline 1ch LF-MMI CNN-TDNN-LSTM 1ch LF-MMI CNN-TDNN-BiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 4ch LF-MMI n/a CNN-TDNN-RBiLSTM 4ch LF-sMBR [1] n/a [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech
27 Comparison of Frontend Processing Front-end for 1ch input Front-end for 4ch input Dev BeamFormIt (= Baseline) Raw Raw Raw WPE WPE C-MVDR WPE Speaker adaptive mask NN-MVDR WPE
28 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array. Instead we combined hypotheses from each array. Array1 Array1 Front-end1 Front-end1 AM1 AM2 Hypothesis_1,1,1 Hypothesis_1,1,2 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 WER (%) for Dev set AM Array Frontend Dev Single-array Multiple-array 1 1 MaskNet MaskNet MaskNet, C MaskNet, C (*) Results w/o RNN-LM 27
29 Thank you 28
Lecture 7: Neural network acoustic models in speech recognition
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic
More informationA Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition
A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October
More informationVariable-Component Deep Neural Network for Robust Speech Recognition
Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft
More informationMULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION
MULTI-CHAEL DEEP CLUSTERIG: DISCRIMIATIVE SPECTRAL AD SPATIAL EMBEDDIGS FOR SPEAKER-IDEPEDET SPEECH SEPARATIO Zhong-Qiu Wang 1,2, Jonathan Le Roux 1, John R. Hershey 1 1 Mitsubishi Electric Research Laboratories
More informationMaximum Likelihood Beamforming for Robust Automatic Speech Recognition
Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR
More informationDoes speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR Ochiai, T.; Watanabe, S.;
More informationLow latency acoustic modeling using temporal convolution and LSTMs
1 Low latency acoustic modeling using temporal convolution and LSTMs Vijayaditya Peddinti, Yiming Wang, Daniel Povey, Sanjeev Khudanpur Abstract Bidirectional long short term memory (BLSTM) acoustic models
More informationHello Edge: Keyword Spotting on Microcontrollers
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of
More informationFactored deep convolutional neural networks for noise robust speech recognition
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Factored deep convolutional neural networks for noise robust speech recognition Masakiyo Fujimoto National Institute of Information and Communications
More informationDiscriminative importance weighting of augmented training data for acoustic model training
Discriminative importance weighting of augmented training data for acoustic model training Sunit Sivasankaran, Emmanuel Vincent, Irina Illina To cite this version: Sunit Sivasankaran, Emmanuel Vincent,
More informationEfficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models
Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models Chanwoo Kim, Ehsan Variani, Arun Narayanan, and Michiel Bacchiani Google Speech {chanwcom, variani, arunnt,
More informationarxiv: v1 [cs.cl] 30 Jan 2018
ACCELERATING RECURRENT NEURAL NETWORK LANGUAGE MODEL BASED ONLINE SPEECH RECOGNITION SYSTEM Kyungmin Lee, Chiyoun Park, Namhoon Kim, and Jaewon Lee DMC R&D Center, Samsung Electronics, Seoul, Korea {k.m.lee,
More informationGated Recurrent Unit Based Acoustic Modeling with Future Context
Interspeech 2018 2-6 September 2018, Hyderabad Gated Recurrent Unit Based Acoustic Modeling with Future Context Jie Li 1, Xiaorui Wang 1, Yuanyuan Zhao 2, Yan Li 1 1 Kwai, Beijing, P.R. China 2 Institute
More informationSemantic Word Embedding Neural Network Language Models for Automatic Speech Recognition
Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation
More informationTHE PERFORMANCE of automatic speech recognition
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 2109 Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments Michael L. Seltzer,
More informationAdversarial Feature-Mapping for Speech Enhancement
Interspeech 2018 2-6 September 2018, Hyderabad Adversarial Feature-Mapping for Speech Enhancement Zhong Meng 1,2, Jinyu Li 1, Yifan Gong 1, Biing-Hwang (Fred) Juang 2 1 Microsoft AI and Research, Redmond,
More informationImproving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks
Interspeech 2018 2-6 September 2018, Hyderabad Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks Sheng Li 1, Xugang Lu 1, Ryoichi Takashima 1, Peng Shen 1, Tatsuya Kawahara
More informationAlternative Objective Functions for Deep Clustering
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Alternative Objective Functions for Deep Clustering Wang, Z.-Q.; Le Roux, J.; Hershey, J.R. TR2018-005 April 2018 Abstract The recently proposed
More informationDiscriminative training and Feature combination
Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics
More informationOptimization of Speech Enhancement Front-end with Speech Recognition-level Criterion
INTERSPEECH 206 September 8 2, 206, San Francisco, USA Optimization o Speech Enhancement Front-end with Speech Recognition-level Criterion Takuya Higuchi, Takuya Yoshioka, Tomohiro Nakatani NTT Communication
More informationWhy DNN Works for Speech and How to Make it More Efficient?
Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.
More informationEnd-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow
End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow Ehsan Variani, Tom Bagby, Erik McDermott, Michiel Bacchiani Google Inc, Mountain View, CA, USA
More informationImagine How-To Videos
Grounded Sequence to Sequence Transduction (Multi-Modal Speech Recognition) Florian Metze July 17, 2018 Imagine How-To Videos Lots of potential for multi-modal processing & fusion For Speech-to-Text and
More informationOPENSEQ2SEQ: A DEEP LEARNING TOOLKIT FOR SPEECH RECOGNITION, SPEECH SYNTHESIS, AND NLP
OPENSEQ2SEQ: A DEEP LEARNING TOOLKIT FOR SPEECH RECOGNITION, SPEECH SYNTHESIS, AND NLP Boris Ginsburg, Vitaly Lavrukhin, Igor Gitman, Oleksii Kuchaiev, Jason Li, Vahid Noroozi, Ravi Teja Gadde, Chip Nguyen
More informationDeep Neural Networks in HMM- based and HMM- free Speech RecogniDon
Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Andrew Maas Collaborators: Awni Hannun, Peng Qi, Chris Lengerich, Ziang Xie, and Anshul Samar Advisors: Andrew Ng and Dan Jurafsky Outline
More informationPOINT CLOUD DEEP LEARNING
POINT CLOUD DEEP LEARNING Innfarn Yoo, 3/29/28 / 57 Introduction AGENDA Previous Work Method Result Conclusion 2 / 57 INTRODUCTION 3 / 57 2D OBJECT CLASSIFICATION Deep Learning for 2D Object Classification
More informationSpeech Recognition with Quaternion Neural Networks
Speech Recognition with Quaternion Neural Networks LIA - 2019 Titouan Parcollet, Mohamed Morchid, Georges Linarès University of Avignon, France ORKIS, France Summary I. Problem definition II. Quaternion
More informationMETRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION
METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION Rui Lu 1, Zhiyao Duan 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Department of Electrical and
More informationEnd- To- End Speech Recogni0on with Recurrent Neural Networks
RTTH Summer School on Speech Technology: A Deep Learning Perspec0ve End- To- End Speech Recogni0on with Recurrent Neural Networks José A. R. Fonollosa Universitat Politècnica de Catalunya. Barcelona Barcelona,
More informationLSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University
LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in
More information27: Hybrid Graphical Models and Neural Networks
10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look
More informationImproving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization
Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationDiffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting
Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Yaguang Li Joint work with Rose Yu, Cyrus Shahabi, Yan Liu Page 1 Introduction Traffic congesting is wasteful of time,
More informationSPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION Xue Feng,
More informationConvolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research
Convolutional Sequence to Sequence Learning Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research Sequence generation Need to model a conditional distribution
More informationAcoustic Features Fusion using Attentive Multi-channel Deep Architecture
CHiME 2018 Workshop on Speech Processing in Everyday Environments 07 September 2018, Hyderabad, India Acoustic Features Fusion using Attentive Multi-channel Deep Architecture Gaurav Bhatt 1, Akshita Gupta
More informationJoint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training
Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department
More informationSVD-based Universal DNN Modeling for Multiple Scenarios
SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,
More informationDynamic Time Warping
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW
More informationLSTM: An Image Classification Model Based on Fashion-MNIST Dataset
LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application
More informationGating Neural Network for Large Vocabulary Audiovisual Speech Recognition
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, JUNE 2017 1 Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition Fei Tao, Student Member, IEEE, and
More informationIntroducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit
Journal of Machine Learning Research 6 205) 547-55 Submitted 7/3; Published 3/5 Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit Felix Weninger weninger@tum.de Johannes
More informationGPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search
GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search Wonkyum Lee Jungsuk Kim Ian Lane Electrical and Computer Engineering Carnegie Mellon University March 26, 2014 @GTC2014
More informationLOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS
LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, Bhuvana Ramabhadran IBM T. J. Watson
More informationCS489/698: Intro to ML
CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun
More informationTwo-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments
The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro
More informationVoice command module for Smart Home Automation
Voice command module for Smart Home Automation LUKA KRALJEVIĆ, MLADEN RUSSO, MAJA STELLA Laboratory for Smart Environment Technologies, University of Split, FESB Ruđera Boškovića 32, 21000, Split CROATIA
More informationBo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on
Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic
More informationarxiv: v1 [cs.sd] 1 Nov 2017
REDUCING MODEL COMPLEXITY FOR DNN BASED LARGE-SCALE AUDIO CLASSIFICATION Yuzhong Wu, Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong arxiv:1711.00229v1 [cs.sd] 1 Nov 2017
More informationCUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models
CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models Xie Chen, Xunying Liu, Yanmin Qian, Mark Gales and Phil Woodland April 1, 2016 Overview
More informationarxiv: v1 [cs.sd] 6 Nov 2018
BOOTSTRAPPING SINGLE-CHANNEL SOURCE SEPARATION VIA UNSUPERVISED SPATIAL CLUSTERING ON STEREO MIXTURES Prem Seetharaman 1, Gordon Wichern 2, Jonathan Le Roux 2, Bryan Pardo 1 1 Northwestern University,
More informationDeep Learning. Volker Tresp Summer 2014
Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there
More informationNON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan
NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan Department of Electrical Engineering, University of California Los Angeles, USA {hiteshag,
More informationDeep Learning on Graphs
Deep Learning on Graphs with Graph Convolutional Networks Hidden layer Hidden layer Input Output ReLU ReLU, 22 March 2017 joint work with Max Welling (University of Amsterdam) BDL Workshop @ NIPS 2016
More informationarxiv: v1 [cs.sd] 14 Mar 2017
arxiv:1703.04783v1 [cs.sd] 14 Mar 2017 Tsubasa Ochiai Doshisha University Shinji Watanabe Takaaki Hori John R. Hershey Mitsubishi Electric Research Laboratories (MERL) Abstract The field of speech recognition
More informationTurbo Decoders for Audio-visual Continuous Speech Recognition
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Turbo Decoders for Audio-visual Continuous Speech Recognition Ahmed Hussen Abdelaziz International Computer Science Institute, Berkeley, USA ahmedha@icsi.berkeley.edu
More informationEmbedded Audio & Robotic Ear
Embedded Audio & Robotic Ear Marc HERVIEU IoT Marketing Manager Marc.Hervieu@st.com Voice Communication: key driver of innovation since 1800 s 2 IoT Evolution of Voice Automation: the IoT Voice Assistant
More informationMultichannel End-to-end Speech Recognition
Tsubasa Ochiai 1 Shinji Watanabe 2 Takaaki Hori 2 John R. Hershey 2 Abstract The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance
More informationModeling Phonetic Context with Non-random Forests for Speech Recognition
Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 /
More informationA Survey on Speech Recognition Using HCI
A Survey on Speech Recognition Using HCI Done By- Rohith Pagadala Department Of Data Science Michigan Technological University CS5760 Topic Assignment 2 rpagadal@mtu.edu ABSTRACT In this paper, we will
More informationMaking an on-device personal assistant a reality
June 2018 @qualcomm_tech Making an on-device personal assistant a reality Qualcomm Technologies, Inc. AI brings human-like understanding and behaviors to the machines Perception Hear, see, and observe
More informationRobust ASR using neural network based speech enhancement and feature simulation
Robust ASR using neural network based speech enhancement and feature simulation Sunit Sivasankaran, Aditya Arie Nugraha, Emmanuel Vincent, Juan Andrés Morales Cordovilla, Siddharth Dalmia, Irina Illina,
More informationSystem Identification Related Problems at
media Technologies @ Ericsson research (New organization Taking Form) System Identification Related Problems at MT@ER Erlendur Karlsson, PhD 1 Outline Ericsson Publications and Blogs System Identification
More informationSystem Identification Related Problems at SMN
Ericsson research SeRvices, MulTimedia and Network Features System Identification Related Problems at SMN Erlendur Karlsson SysId Related Problems @ ER/SMN Ericsson External 2016-05-09 Page 1 Outline Research
More informationCombining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks
Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks Emad M. Grais, Gerard Roma, Andrew J.R. Simpson, Mark D. Plumbley Centre for Vision, Speech and Signal Processing,
More informationANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS
2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS Bhusan Chettri
More informationSeparating Speech From Noise Challenge
Separating Speech From Noise Challenge We have used the data from the PASCAL CHiME challenge with the goal of training a Support Vector Machine (SVM) to estimate a noise mask that labels time-frames/frequency-bins
More informationPixel-level Generative Model
Pixel-level Generative Model Generative Image Modeling Using Spatial LSTMs (2015NIPS) L. Theis and M. Bethge University of Tübingen, Germany Pixel Recurrent Neural Networks (2016ICML) A. van den Oord,
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationConfidence Measures: how much we can trust our speech recognizers
Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition
More informationRecurrent Neural Networks and Transfer Learning for Action Recognition
Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the
More informationLSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia
1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model
More informationarxiv: v1 [cs.lg] 5 Nov 2018
UNSUPERVISED DEEP CLUSTERING FOR SOURCE SEPARATION: DIRECT LEARNING FROM MIXTURES USING SPATIAL INFORMATION Efthymios Tzinis Shrikant Venkataramani Paris Smaragdis University of Illinois at Urbana-Champaign,
More informationRecurrent Neural Nets II
Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization
More informationBUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES
BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES O.O. Iakushkin a, G.A. Fedoseev, A.S. Shaleva, O.S. Sedova Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg,
More informationSystem Identification Related Problems at SMN
Ericsson research SeRvices, MulTimedia and Networks System Identification Related Problems at SMN Erlendur Karlsson SysId Related Problems @ ER/SMN Ericsson External 2015-04-28 Page 1 Outline Research
More informationDEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS
DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS Li Li 1,2, Hirokazu Kameoka 1 1 NTT Communication Science Laboratories, NTT Corporation, Japan 2 University of Tsukuba, Japan lili@mmlab.cs.tsukuba.ac.jp,
More informationManifold Constrained Deep Neural Networks for ASR
1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as
More informationIndex. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,
A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,
More informationLecture 7: Semantic Segmentation
Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr
More informationLearning The Lexicon!
Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,
More informationSequence Prediction with Neural Segmental Models. Hao Tang
Sequence Prediction with Neural Segmental Models Hao Tang haotang@ttic.edu About Me Pronunciation modeling [TKL 2012] Segmental models [TGL 2014] [TWGL 2015] [TWGL 2016] [TWGL 2016] American Sign Language
More informationDIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION
DIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION Kenichi Kumatani, Sankaran Panchapagesan, Minhua Wu, Minjae Kim, Nikko Ström, Gautam Tiwari, Arindam Mandal Amazon Inc. ABSTRACT In this work,
More information3D Convolutional Neural Networks for Landing Zone Detection from LiDAR
3D Convolutional Neural Networks for Landing Zone Detection from LiDAR Daniel Mataruna and Sebastian Scherer Presented by: Sabin Kafle Outline Introduction Preliminaries Approach Volumetric Density Mapping
More informationComparative Evaluation of Feature Normalization Techniques for Speaker Verification
Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam,
More informationLMS Sound Camera and Sound Source Localization Fast & Versatile Sound Source Localization
LMS Sound Camera and Sound Source Localization Fast & Versatile Sound Source Localization Realize innovation. Sound Source Localization What is it about?? Sounds like it s coming from the engine Sound
More informationMachine Learning for Speaker Recogni2on and Bioinforma2cs
Machine Learning for Speaker Recogni2on and Bioinforma2cs Man-Wai MAK Dept of Electronic and Informa8on Engineering, The Hong Kong Polytechnic University http://wwweiepolyueduhk/~mwmak/ UTS/PolyU Workshop
More informationTIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING
TIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING Jinxi Guo, Kenichi Kumatani, Ming Sun, Minhua Wu, Anirudh Raju, Nikko Ström, Arindam Mandal lennyguo@g.ucla.edu, {kumatani,
More informationImproving Face Recognition by Exploring Local Features with Visual Attention
Improving Face Recognition by Exploring Local Features with Visual Attention Yichun Shi and Anil K. Jain Michigan State University Difficulties of Face Recognition Large variations in unconstrained face
More informationChannel Locality Block: A Variant of Squeeze-and-Excitation
Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan
More informationarxiv: v3 [cs.sd] 1 Nov 2018
AN IMPROVED HYBRID CTC-ATTENTION MODEL FOR SPEECH RECOGNITION Zhe Yuan, Zhuoran Lyu, Jiwei Li and Xi Zhou Cloudwalk Technology Inc, Shanghai, China arxiv:1810.12020v3 [cs.sd] 1 Nov 2018 ABSTRACT Recently,
More informationReverberant Speech Recognition Based on Denoising Autoencoder
INTERSPEECH 2013 Reverberant Speech Recognition Based on Denoising Autoencoder Takaaki Ishii 1, Hiroki Komiyama 1, Takahiro Shinozaki 2, Yasuo Horiuchi 1, Shingo Kuroiwa 1 1 Division of Information Sciences,
More informationAutoencoder. Representation learning (related to dictionary learning) Both the input and the output are x
Deep Learning 4 Autoencoder, Attention (spatial transformer), Multi-modal learning, Neural Turing Machine, Memory Networks, Generative Adversarial Net Jian Li IIIS, Tsinghua Autoencoder Autoencoder Unsupervised
More informationAutomatic Speech Recognition (ASR)
Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph
More informationAUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS
AUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS Edited by YITENG (ARDEN) HUANG Bell Laboratories, Lucent Technologies JACOB BENESTY Universite du Quebec, INRS-EMT Kluwer
More informationAdaptive Dropout Training for SVMs
Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science,
More informationMultilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction
More informationRecurrent Neural Networks
Recurrent Neural Networks 11-785 / Fall 2018 / Recitation 7 Raphaël Olivier Recap : RNNs are magic They have infinite memory They handle all kinds of series They re the basis of recent NLP : Translation,
More informationDeep Learning Applications
October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning
More information