The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays

Size: px
Start display at page:

Download "The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays"

Transcription

1 CHiME2018 workshop The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays Naoyuki Kanda 1, Rintaro Ikeshita 1, Shota Horiguchi 1, Yusuke Fujita 1, Kenji Nagamatsu 1, Xiaofei Wang 2, Vimal Manohar 2, Nelson Enrique Yalta Soplin 2, Matthew Maciejewski 2, Szu-Jui Chen 2, Aswin Shanmugam Subramanian 2, Ruizhi Li 2, Zhiqi Wang 2, Jason Naradowsky 2, L. Paola Garcia-Perera 2, Gregory Sell 2 1 2

2 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 1

3 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

4 Acoustic Model Training Pipeline Step 1. -AM Step 2. Alignment 1ch worn L 1ch worn R 1ch worn L+R 1ch worn L+R (mono) Alignment & Cleanup (tri) (LDA-MLLT) Full set Cleaned 1ch worn L+R & phone-state alignment (SAT) Alignment Expansion Data augmentation 40h -> 4,500h Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

5 Training Data for 1ch AM Worn Effect of data augmentation with baseline AM Data Array (Raw, CH1) Array (BeamFormIt) Speed & Volume Data Augmentation Reverb. & Noise(*) Bandpass Training Epoch Worn- Dev Ref-Array- Dev L, R, L+R L, R, L+R L, R, L+R L, R, L+R L, R, L+R (*) Reverb. & noise perturbation was applied only for worn microphone data. Speed: 0.9, 1.0, 1.1 Volume: Reverberation: Generate impulse responses of simulated rooms by image method. Follow the settings of {small, medium}-size rooms in [1]. Noise: Add non-speech region of array data with SNR of {20,15,10,5, 0} Bandpass: Randomly-selected frequency band was cut off. (leave at least 1,000 Hz band within the range of less than 4,000 Hz) [1] T. Ko, et al.: A study on data augmentation of reverberant speech for robust speech recognition, Proc. ICASSP, pp ,

6 Acoustic Model Training Pipeline Step 1. -AM 1ch worn L 1ch worn R 1ch worn L+R (mono) (tri) (LDA-MLLT) (SAT) Step 2. Alignment 1ch worn L+R Alignment & Cleanup Full set Cleaned 1ch worn L+R & phone-state alignment Alignment Expansion Cleaned full set & phone-state alignment Step 3. 1ch AM Cleaned full set & phone-state alignment ivector Training Acoustic Feature Extraction ivector Extraction LF-MMI AM Training LF-MMI AM (1ch) Step 4. 4ch AM LF-MMI AM (1ch) Add 4ch input branch LF-MMI AM (4ch) 4ch AM Array 1(4ch) Array 6(4ch) CH1 CH1, 2, 3, 4 Weighted ivector Extraction Acoustic Feature Extraction LF-MMI AM Training LF-MMI AM (4ch) LF-sMBR AM Training [1] LF-sMBR AM (4ch) [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

7 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 6

8 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM (1) LF-MMI update with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 7

9 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM -1 1 with 512 nodes of ReLU (2) LF-MMI update 4ch-branch Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM concat MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of 8

10 4ch CNN-TDNN-RBiLSTM 1024 RBiLSTM with 512 nodes of 1024 RBiLSTM with 512 nodes of BiLSTM F-LSTM concat B-LSTM 1024 RBiLSTM with 512 nodes of concat ch-branch ReLU Residual BiLSTM (RBiLSTM) B-LSTM F-LSTM MFCC (1ch) LDA ivector 3x3, 128 Conv x3, 256 Conv FBANK (1ch) log-spec (4ch) 512 BiLSTM 512 BiLSTM phase (4ch) ivector with 256 nodes of with 256 nodes of (3) LF-sMBR update 9

11 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 10

12 Target speech Overlapped speech Target speech Target speech with strong noise Complex Gaussian Mixture Model 3-class mixture: target, non-target, and noise y f (t) α tgt N C 0, v f tgt (t)r f tgt + α nontgt N C 0, v f nontgt (t)r f nontgt + α noise N C 0, v f noise (t)r f noise Mask estimation using EM Algorithm MVDR-based Beamformer R f tgt Target speaker + noise Target Speech Mask Spatial correlation matrix R f R f nontgt Non-target speaker + noise R f noise Noise Noise Mask EM Algorithm Higuchi, Takuya, et al. "Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR. " IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (2017):

13 Mask Estimation Neural Network 1. Train mask estimation (ME) network [1][2] by using mixture of speech (worn non-speaker-overlapped region) and noise (array non-speech region) in the training set 513 X_mask N_mask FC, Sigmoid FC, Sigmoid 513 FC, ReLU 513 FC, ReLU 513 BLSTM Speaker adaptation abs(stft) X_reference BCE 513 gate def gate(x): if input.speaker == target_speaker: return x else: return 0 X_mask N_mask (*) we used only non-overlapped regions for adaptation 513 FC, Sigmoid FC, Sigmoid 513 FC, ReLU FC, ReLU BLSTM abs(stft) target-spk. speech non-target-spk. speech [1] J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Proc. ICASSP, 2016, pp [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, Improved MVDR beamforming using single-channel mask prediction networks. in Proc. Interspeech, 2016, pp

14 Mask Estimation Neural Network 3. Mask inference Target speaker s mask is selected only if target speaker s output value is higher than all other non-targets values. Example: P01(target) and P02(non-target) X_mask_tgt gate X_mask P01 ME net P01 If X_mask_P01 > X_mask_P02: X_mask_P01 else: 0 X_mask P02 ME net P02 abs(stft) 13

15 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 14

16 Language Modeling Recurrent neural network based word-lm 2 layer LSTM with 512 nodes, 50% dropout 512 dim embeddings PyTorch implementation Official-LM: forward-rnn-lm: backward-rnn-lm = 0.5 : 0.25 : 0.25 WER (%) for Dev set without RNN-LM with RNN-LM Single-array % impr Multiple-array % impr (*) Results with model combination and hypothesis deduplication 15

17 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 16

18 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array, instead combined hypotheses from each array. Array1 Front-end1 AM1 Hypothesis_1,1,1 Array1 Array6 Front-end1 Front-end2 AM2 AM5 Hypothesis_1,1,2 Hypothesis_6,2,5 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 17

19 Word error rate (%) Step-by-Step Improvements for Dev 85 AM Frontend Decoding LM 50 Multiplearray Single-array 18

20 Hypothesis Deduplication (HD) Same words were sometimes recognized for overlapped utterances P05 um yeah P08 can i help with um yeah that looks good Duplicated words with lower confidence were excluded from the hypothesis. HD was applied after ROVER, so precise time boundary could not be used. Minimum edit distance-based word alignment was used to detect word duplication. WER (%) for Dev set without HD with HD Single-array % impr Multiple-array % impr. (*) Results with RNN-LM 19

21 Final results & Conclusion 20

22 Final Results WER (%) without RNN-LM / with RNN-LM Our best eval result Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / /

23 Final Results WER (%) without RNN-LM / with RNN-LM Track Session Kitchen Dining Living Overall Singlearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Multiplearray Dev S02 S / / / / / / / Eval S01 S / / / / / / / Array combination by ROVER worked well for dev, but not effective for eval set. Why? Different types of rooms? Speaker-array distance? Anyway, better array combination methods should be pursued. 22

24 Conclusion Our contributions Multiple data augmentation 4-ch AM with Residual BiLSTM Speaker adaptive mask estimation network / C-based beamformer Hypothesis Dedupulication Array combination by ROVER (found not effective for evaluation set) Our results 48.2% WER for evaluation set 2 nd ranked, with only 2.1 point difference to the best result Thank you for your attention! 23

25 Appendix 24

26 Comparison of AM Architectures Model Input Training Worn-Dev Ref-Array-Dev Baseline 1ch LF-MMI CNN-TDNN-LSTM 1ch LF-MMI CNN-TDNN-BiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 1ch LF-MMI CNN-TDNN-RBiLSTM 4ch LF-MMI n/a CNN-TDNN-RBiLSTM 4ch LF-sMBR [1] n/a [1] Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu, Lattice-free state-level minimum Bayes risk training of acoustic models, Interspeech

27 Comparison of Frontend Processing Front-end for 1ch input Front-end for 4ch input Dev BeamFormIt (= Baseline) Raw Raw Raw WPE WPE C-MVDR WPE Speaker adaptive mask NN-MVDR WPE

28 Decoding Hypotheses combination by N-best ROVER 6 AMs := CNN-TDNN-{LSTM, BiLSTM, RBiLSTM} x {3500, 7000} senones 2 Front-ends := Mask Network, C 6 Arrays We didn t select array. Instead we combined hypotheses from each array. Array1 Array1 Front-end1 Front-end1 AM1 AM2 Hypothesis_1,1,1 Hypothesis_1,1,2 N-best ROVER Result Array6 Front-end2 AM6 Hypothesis_6,2,6 WER (%) for Dev set AM Array Frontend Dev Single-array Multiple-array 1 1 MaskNet MaskNet MaskNet, C MaskNet, C (*) Results w/o RNN-LM 27

29 Thank you 28

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

MULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION

MULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION MULTI-CHAEL DEEP CLUSTERIG: DISCRIMIATIVE SPECTRAL AD SPATIAL EMBEDDIGS FOR SPEAKER-IDEPEDET SPEECH SEPARATIO Zhong-Qiu Wang 1,2, Jonathan Le Roux 1, John R. Hershey 1 1 Mitsubishi Electric Research Laboratories

More information

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR

More information

Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR

Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR Ochiai, T.; Watanabe, S.;

More information

Low latency acoustic modeling using temporal convolution and LSTMs

Low latency acoustic modeling using temporal convolution and LSTMs 1 Low latency acoustic modeling using temporal convolution and LSTMs Vijayaditya Peddinti, Yiming Wang, Daniel Povey, Sanjeev Khudanpur Abstract Bidirectional long short term memory (BLSTM) acoustic models

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Factored deep convolutional neural networks for noise robust speech recognition

Factored deep convolutional neural networks for noise robust speech recognition INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Factored deep convolutional neural networks for noise robust speech recognition Masakiyo Fujimoto National Institute of Information and Communications

More information

Discriminative importance weighting of augmented training data for acoustic model training

Discriminative importance weighting of augmented training data for acoustic model training Discriminative importance weighting of augmented training data for acoustic model training Sunit Sivasankaran, Emmanuel Vincent, Irina Illina To cite this version: Sunit Sivasankaran, Emmanuel Vincent,

More information

Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models

Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models Chanwoo Kim, Ehsan Variani, Arun Narayanan, and Michiel Bacchiani Google Speech {chanwcom, variani, arunnt,

More information

arxiv: v1 [cs.cl] 30 Jan 2018

arxiv: v1 [cs.cl] 30 Jan 2018 ACCELERATING RECURRENT NEURAL NETWORK LANGUAGE MODEL BASED ONLINE SPEECH RECOGNITION SYSTEM Kyungmin Lee, Chiyoun Park, Namhoon Kim, and Jaewon Lee DMC R&D Center, Samsung Electronics, Seoul, Korea {k.m.lee,

More information

Gated Recurrent Unit Based Acoustic Modeling with Future Context

Gated Recurrent Unit Based Acoustic Modeling with Future Context Interspeech 2018 2-6 September 2018, Hyderabad Gated Recurrent Unit Based Acoustic Modeling with Future Context Jie Li 1, Xiaorui Wang 1, Yuanyuan Zhao 2, Yan Li 1 1 Kwai, Beijing, P.R. China 2 Institute

More information

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation

More information

THE PERFORMANCE of automatic speech recognition

THE PERFORMANCE of automatic speech recognition IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 2109 Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments Michael L. Seltzer,

More information

Adversarial Feature-Mapping for Speech Enhancement

Adversarial Feature-Mapping for Speech Enhancement Interspeech 2018 2-6 September 2018, Hyderabad Adversarial Feature-Mapping for Speech Enhancement Zhong Meng 1,2, Jinyu Li 1, Yifan Gong 1, Biing-Hwang (Fred) Juang 2 1 Microsoft AI and Research, Redmond,

More information

Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks

Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks Interspeech 2018 2-6 September 2018, Hyderabad Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks Sheng Li 1, Xugang Lu 1, Ryoichi Takashima 1, Peng Shen 1, Tatsuya Kawahara

More information

Alternative Objective Functions for Deep Clustering

Alternative Objective Functions for Deep Clustering MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Alternative Objective Functions for Deep Clustering Wang, Z.-Q.; Le Roux, J.; Hershey, J.R. TR2018-005 April 2018 Abstract The recently proposed

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Optimization of Speech Enhancement Front-end with Speech Recognition-level Criterion

Optimization of Speech Enhancement Front-end with Speech Recognition-level Criterion INTERSPEECH 206 September 8 2, 206, San Francisco, USA Optimization o Speech Enhancement Front-end with Speech Recognition-level Criterion Takuya Higuchi, Takuya Yoshioka, Tomohiro Nakatani NTT Communication

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow Ehsan Variani, Tom Bagby, Erik McDermott, Michiel Bacchiani Google Inc, Mountain View, CA, USA

More information

Imagine How-To Videos

Imagine How-To Videos Grounded Sequence to Sequence Transduction (Multi-Modal Speech Recognition) Florian Metze July 17, 2018 Imagine How-To Videos Lots of potential for multi-modal processing & fusion For Speech-to-Text and

More information

OPENSEQ2SEQ: A DEEP LEARNING TOOLKIT FOR SPEECH RECOGNITION, SPEECH SYNTHESIS, AND NLP

OPENSEQ2SEQ: A DEEP LEARNING TOOLKIT FOR SPEECH RECOGNITION, SPEECH SYNTHESIS, AND NLP OPENSEQ2SEQ: A DEEP LEARNING TOOLKIT FOR SPEECH RECOGNITION, SPEECH SYNTHESIS, AND NLP Boris Ginsburg, Vitaly Lavrukhin, Igor Gitman, Oleksii Kuchaiev, Jason Li, Vahid Noroozi, Ravi Teja Gadde, Chip Nguyen

More information

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Andrew Maas Collaborators: Awni Hannun, Peng Qi, Chris Lengerich, Ziang Xie, and Anshul Samar Advisors: Andrew Ng and Dan Jurafsky Outline

More information

POINT CLOUD DEEP LEARNING

POINT CLOUD DEEP LEARNING POINT CLOUD DEEP LEARNING Innfarn Yoo, 3/29/28 / 57 Introduction AGENDA Previous Work Method Result Conclusion 2 / 57 INTRODUCTION 3 / 57 2D OBJECT CLASSIFICATION Deep Learning for 2D Object Classification

More information

Speech Recognition with Quaternion Neural Networks

Speech Recognition with Quaternion Neural Networks Speech Recognition with Quaternion Neural Networks LIA - 2019 Titouan Parcollet, Mohamed Morchid, Georges Linarès University of Avignon, France ORKIS, France Summary I. Problem definition II. Quaternion

More information

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION Rui Lu 1, Zhiyao Duan 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Department of Electrical and

More information

End- To- End Speech Recogni0on with Recurrent Neural Networks

End- To- End Speech Recogni0on with Recurrent Neural Networks RTTH Summer School on Speech Technology: A Deep Learning Perspec0ve End- To- End Speech Recogni0on with Recurrent Neural Networks José A. R. Fonollosa Universitat Politècnica de Catalunya. Barcelona Barcelona,

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department

More information

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,

More information

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Yaguang Li Joint work with Rose Yu, Cyrus Shahabi, Yan Liu Page 1 Introduction Traffic congesting is wasteful of time,

More information

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION Xue Feng,

More information

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research Convolutional Sequence to Sequence Learning Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research Sequence generation Need to model a conditional distribution

More information

Acoustic Features Fusion using Attentive Multi-channel Deep Architecture

Acoustic Features Fusion using Attentive Multi-channel Deep Architecture CHiME 2018 Workshop on Speech Processing in Everyday Environments 07 September 2018, Hyderabad, India Acoustic Features Fusion using Attentive Multi-channel Deep Architecture Gaurav Bhatt 1, Akshita Gupta

More information

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department

More information

SVD-based Universal DNN Modeling for Multiple Scenarios

SVD-based Universal DNN Modeling for Multiple Scenarios SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, JUNE 2017 1 Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition Fei Tao, Student Member, IEEE, and

More information

Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit

Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit Journal of Machine Learning Research 6 205) 547-55 Submitted 7/3; Published 3/5 Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit Felix Weninger weninger@tum.de Johannes

More information

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search Wonkyum Lee Jungsuk Kim Ian Lane Electrical and Computer Engineering Carnegie Mellon University March 26, 2014 @GTC2014

More information

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, Bhuvana Ramabhadran IBM T. J. Watson

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro

More information

Voice command module for Smart Home Automation

Voice command module for Smart Home Automation Voice command module for Smart Home Automation LUKA KRALJEVIĆ, MLADEN RUSSO, MAJA STELLA Laboratory for Smart Environment Technologies, University of Split, FESB Ruđera Boškovića 32, 21000, Split CROATIA

More information

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic

More information

arxiv: v1 [cs.sd] 1 Nov 2017

arxiv: v1 [cs.sd] 1 Nov 2017 REDUCING MODEL COMPLEXITY FOR DNN BASED LARGE-SCALE AUDIO CLASSIFICATION Yuzhong Wu, Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong arxiv:1711.00229v1 [cs.sd] 1 Nov 2017

More information

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models Xie Chen, Xunying Liu, Yanmin Qian, Mark Gales and Phil Woodland April 1, 2016 Overview

More information

arxiv: v1 [cs.sd] 6 Nov 2018

arxiv: v1 [cs.sd] 6 Nov 2018 BOOTSTRAPPING SINGLE-CHANNEL SOURCE SEPARATION VIA UNSUPERVISED SPATIAL CLUSTERING ON STEREO MIXTURES Prem Seetharaman 1, Gordon Wichern 2, Jonathan Le Roux 2, Bryan Pardo 1 1 Northwestern University,

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan Department of Electrical Engineering, University of California Los Angeles, USA {hiteshag,

More information

Deep Learning on Graphs

Deep Learning on Graphs Deep Learning on Graphs with Graph Convolutional Networks Hidden layer Hidden layer Input Output ReLU ReLU, 22 March 2017 joint work with Max Welling (University of Amsterdam) BDL Workshop @ NIPS 2016

More information

arxiv: v1 [cs.sd] 14 Mar 2017

arxiv: v1 [cs.sd] 14 Mar 2017 arxiv:1703.04783v1 [cs.sd] 14 Mar 2017 Tsubasa Ochiai Doshisha University Shinji Watanabe Takaaki Hori John R. Hershey Mitsubishi Electric Research Laboratories (MERL) Abstract The field of speech recognition

More information

Turbo Decoders for Audio-visual Continuous Speech Recognition

Turbo Decoders for Audio-visual Continuous Speech Recognition INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Turbo Decoders for Audio-visual Continuous Speech Recognition Ahmed Hussen Abdelaziz International Computer Science Institute, Berkeley, USA ahmedha@icsi.berkeley.edu

More information

Embedded Audio & Robotic Ear

Embedded Audio & Robotic Ear Embedded Audio & Robotic Ear Marc HERVIEU IoT Marketing Manager Marc.Hervieu@st.com Voice Communication: key driver of innovation since 1800 s 2 IoT Evolution of Voice Automation: the IoT Voice Assistant

More information

Multichannel End-to-end Speech Recognition

Multichannel End-to-end Speech Recognition Tsubasa Ochiai 1 Shinji Watanabe 2 Takaaki Hori 2 John R. Hershey 2 Abstract The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance

More information

Modeling Phonetic Context with Non-random Forests for Speech Recognition

Modeling Phonetic Context with Non-random Forests for Speech Recognition Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 /

More information

A Survey on Speech Recognition Using HCI

A Survey on Speech Recognition Using HCI A Survey on Speech Recognition Using HCI Done By- Rohith Pagadala Department Of Data Science Michigan Technological University CS5760 Topic Assignment 2 rpagadal@mtu.edu ABSTRACT In this paper, we will

More information

Making an on-device personal assistant a reality

Making an on-device personal assistant a reality June 2018 @qualcomm_tech Making an on-device personal assistant a reality Qualcomm Technologies, Inc. AI brings human-like understanding and behaviors to the machines Perception Hear, see, and observe

More information

Robust ASR using neural network based speech enhancement and feature simulation

Robust ASR using neural network based speech enhancement and feature simulation Robust ASR using neural network based speech enhancement and feature simulation Sunit Sivasankaran, Aditya Arie Nugraha, Emmanuel Vincent, Juan Andrés Morales Cordovilla, Siddharth Dalmia, Irina Illina,

More information

System Identification Related Problems at

System Identification Related Problems at media Technologies @ Ericsson research (New organization Taking Form) System Identification Related Problems at MT@ER Erlendur Karlsson, PhD 1 Outline Ericsson Publications and Blogs System Identification

More information

System Identification Related Problems at SMN

System Identification Related Problems at SMN Ericsson research SeRvices, MulTimedia and Network Features System Identification Related Problems at SMN Erlendur Karlsson SysId Related Problems @ ER/SMN Ericsson External 2016-05-09 Page 1 Outline Research

More information

Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks

Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks Emad M. Grais, Gerard Roma, Andrew J.R. Simpson, Mark D. Plumbley Centre for Vision, Speech and Signal Processing,

More information

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS Bhusan Chettri

More information

Separating Speech From Noise Challenge

Separating Speech From Noise Challenge Separating Speech From Noise Challenge We have used the data from the PASCAL CHiME challenge with the goal of training a Support Vector Machine (SVM) to estimate a noise mask that labels time-frames/frequency-bins

More information

Pixel-level Generative Model

Pixel-level Generative Model Pixel-level Generative Model Generative Image Modeling Using Spatial LSTMs (2015NIPS) L. Theis and M. Bethge University of Tübingen, Germany Pixel Recurrent Neural Networks (2016ICML) A. van den Oord,

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Confidence Measures: how much we can trust our speech recognizers

Confidence Measures: how much we can trust our speech recognizers Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition

More information

Recurrent Neural Networks and Transfer Learning for Action Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

arxiv: v1 [cs.lg] 5 Nov 2018

arxiv: v1 [cs.lg] 5 Nov 2018 UNSUPERVISED DEEP CLUSTERING FOR SOURCE SEPARATION: DIRECT LEARNING FROM MIXTURES USING SPATIAL INFORMATION Efthymios Tzinis Shrikant Venkataramani Paris Smaragdis University of Illinois at Urbana-Champaign,

More information

Recurrent Neural Nets II

Recurrent Neural Nets II Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization

More information

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES O.O. Iakushkin a, G.A. Fedoseev, A.S. Shaleva, O.S. Sedova Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg,

More information

System Identification Related Problems at SMN

System Identification Related Problems at SMN Ericsson research SeRvices, MulTimedia and Networks System Identification Related Problems at SMN Erlendur Karlsson SysId Related Problems @ ER/SMN Ericsson External 2015-04-28 Page 1 Outline Research

More information

DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS

DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS Li Li 1,2, Hirokazu Kameoka 1 1 NTT Communication Science Laboratories, NTT Corporation, Japan 2 University of Tsukuba, Japan lili@mmlab.cs.tsukuba.ac.jp,

More information

Manifold Constrained Deep Neural Networks for ASR

Manifold Constrained Deep Neural Networks for ASR 1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information

Learning The Lexicon!

Learning The Lexicon! Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,

More information

Sequence Prediction with Neural Segmental Models. Hao Tang

Sequence Prediction with Neural Segmental Models. Hao Tang Sequence Prediction with Neural Segmental Models Hao Tang haotang@ttic.edu About Me Pronunciation modeling [TKL 2012] Segmental models [TGL 2014] [TWGL 2015] [TWGL 2016] [TWGL 2016] American Sign Language

More information

DIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION

DIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION DIRECT MODELING OF RAW AUDIO WITH DNNS FOR WAKE WORD DETECTION Kenichi Kumatani, Sankaran Panchapagesan, Minhua Wu, Minjae Kim, Nikko Ström, Gautam Tiwari, Arindam Mandal Amazon Inc. ABSTRACT In this work,

More information

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR 3D Convolutional Neural Networks for Landing Zone Detection from LiDAR Daniel Mataruna and Sebastian Scherer Presented by: Sabin Kafle Outline Introduction Preliminaries Approach Volumetric Density Mapping

More information

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam,

More information

LMS Sound Camera and Sound Source Localization Fast & Versatile Sound Source Localization

LMS Sound Camera and Sound Source Localization Fast & Versatile Sound Source Localization LMS Sound Camera and Sound Source Localization Fast & Versatile Sound Source Localization Realize innovation. Sound Source Localization What is it about?? Sounds like it s coming from the engine Sound

More information

Machine Learning for Speaker Recogni2on and Bioinforma2cs

Machine Learning for Speaker Recogni2on and Bioinforma2cs Machine Learning for Speaker Recogni2on and Bioinforma2cs Man-Wai MAK Dept of Electronic and Informa8on Engineering, The Hong Kong Polytechnic University http://wwweiepolyueduhk/~mwmak/ UTS/PolyU Workshop

More information

TIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING

TIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING TIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING Jinxi Guo, Kenichi Kumatani, Ming Sun, Minhua Wu, Anirudh Raju, Nikko Ström, Arindam Mandal lennyguo@g.ucla.edu, {kumatani,

More information

Improving Face Recognition by Exploring Local Features with Visual Attention

Improving Face Recognition by Exploring Local Features with Visual Attention Improving Face Recognition by Exploring Local Features with Visual Attention Yichun Shi and Anil K. Jain Michigan State University Difficulties of Face Recognition Large variations in unconstrained face

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

arxiv: v3 [cs.sd] 1 Nov 2018

arxiv: v3 [cs.sd] 1 Nov 2018 AN IMPROVED HYBRID CTC-ATTENTION MODEL FOR SPEECH RECOGNITION Zhe Yuan, Zhuoran Lyu, Jiwei Li and Xi Zhou Cloudwalk Technology Inc, Shanghai, China arxiv:1810.12020v3 [cs.sd] 1 Nov 2018 ABSTRACT Recently,

More information

Reverberant Speech Recognition Based on Denoising Autoencoder

Reverberant Speech Recognition Based on Denoising Autoencoder INTERSPEECH 2013 Reverberant Speech Recognition Based on Denoising Autoencoder Takaaki Ishii 1, Hiroki Komiyama 1, Takahiro Shinozaki 2, Yasuo Horiuchi 1, Shingo Kuroiwa 1 1 Division of Information Sciences,

More information

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x Deep Learning 4 Autoencoder, Attention (spatial transformer), Multi-modal learning, Neural Turing Machine, Memory Networks, Generative Adversarial Net Jian Li IIIS, Tsinghua Autoencoder Autoencoder Unsupervised

More information

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph

More information

AUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS

AUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS AUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS Edited by YITENG (ARDEN) HUANG Bell Laboratories, Lucent Technologies JACOB BENESTY Universite du Quebec, INRS-EMT Kluwer

More information

Adaptive Dropout Training for SVMs

Adaptive Dropout Training for SVMs Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science,

More information

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks 11-785 / Fall 2018 / Recitation 7 Raphaël Olivier Recap : RNNs are magic They have infinite memory They handle all kinds of series They re the basis of recent NLP : Translation,

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information