Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Similar documents
Machine Learning for Speaker Recogni2on and Bioinforma2cs

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis

Variable-Component Deep Neural Network for Robust Speech Recognition

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE

Stacked Denoising Autoencoders for Face Pose Normalization

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Manifold Constrained Deep Neural Networks for ASR

Reverberant Speech Recognition Based on Denoising Autoencoder

arxiv: v1 [cs.sd] 8 Jun 2017

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Day 3 Lecture 1. Unsupervised Learning

Grundlagen der Künstlichen Intelligenz

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA

Deep Learning. Volker Tresp Summer 2014

SVD-based Universal DNN Modeling for Multiple Scenarios

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

Deep Learning. Volker Tresp Summer 2015

Introduction to Deep Learning

Speaker Verification with Adaptive Spectral Subband Centroids

Image Restoration Using DNN

A Deep Learning Framework for Authorship Classification of Paintings

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Deep Generative Models Variational Autoencoders

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

Multifactor Fusion for Audio-Visual Speaker Recognition

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Outlier detection using autoencoders

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering

Emotion Detection using Deep Belief Networks

Tutorial Deep Learning : Unsupervised Feature Learning

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Audio-visual Biometrics Using Reliability-based Late Fusion and Deep Neural Networks Mohammad Rafiqul Alam

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan

Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling

Autoencoders, denoising autoencoders, and learning deep networks

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

An Optimization of Deep Neural Networks in ASR using Singular Value Decomposition

Why DNN Works for Speech and How to Make it More Efficient?

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

COMPUTATIONAL INTELLIGENCE

Najiya P Fathima, C. V. Vipin Kishnan; International Journal of Advance Research, Ideas and Innovations in Technology

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center

Compressing Deep Neural Networks using a Rank-Constrained Topology

arxiv: v1 [cs.cl] 18 Jan 2015

arxiv: v2 [cs.lg] 6 Jun 2015

Multi-pose lipreading and audio-visual speech recognition

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification

Neural Networks and Deep Learning

Parallel Implementation of Deep Learning Using MPI

Neural Networks: promises of current research

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC

SYNTHESIZED STEREO MAPPING VIA DEEP NEURAL NETWORKS FOR NOISY SPEECH RECOGNITION

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

1 Introduction. 3 Data Preprocessing. 2 Literature Review

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS

Multi-Modal Human Verification Using Face and Speech

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

ABC submission for NIST SRE 2016

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A Fast Personal Palm print Authentication based on 3D-Multi Wavelet Transformation

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Online PLCA for Real-time Semi-supervised Source Separation

Face Image Quality Assessment for Face Selection in Surveillance Video using Convolutional Neural Networks

Temporal Multimodal Learning in Audiovisual Speech Recognition

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Hello Edge: Keyword Spotting on Microcontrollers

arxiv: v1 [cs.sd] 24 May 2017

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition

Separating Speech From Noise Challenge

Replay Attack Detection using DNN for Channel Discrimination

Deep Learning with R. Francesca Lazzeri Data Scientist II - Microsoft, AI Research

Learning robust features from underwater ship-radiated noise with mutual information group sparse DBN

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

Improving Robustness to Compressed Speech in Speaker Recognition

Supervector Compression Strategies to Speed up I-Vector System Development

Listening With Your Eyes: Towards a Practical. Visual Speech Recognition System

Transcription:

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China

Contents 1. Mo2va2on of Work 2. Deep Belief Network 3. Denoising Autoencoder 4. Denoising Classifier 5. I-Vector and PLDA for Speaker Recogni2on 6. Experiments on YOHO Corpus 7. Conclusions 2

Overview of Speaker Iden9fica9on utt. 1 Feature Vector 1 utt. 2 Feature Vector 2 utt. 3 Feature Extraction Feature Vector 3 Classification Speaker ID utt. n Feature Vector n 3

Mo9va9on Features in speaker identification, e.g. mel-frequency cepstral coefficient (MFCC), are not designed particularly for extracting speakerdependent information are not noise robust Learning-based features outperform traditional handcrafted features in many areas, e.g. computer vision. 4

Proposed Solu9on With noisy speech input, train the deep neural network (DNN) by supervisory signal of both clean speech and speaker ID; Then use the output of the bottleneck layer as feature. Key features of proposed solution: Speaker-dependent Noise robust 5

Neural Networks Artificial neural networks (ANNs) are a family of statistical learning models inspired by biological neural networks (the brain). Its aim is to approximate the unknown function from input to target output. x = x 1 x 2 x 3 w 1 w 2 w 3 b 6

Deep Belief Network Deep Neural Network with pre-training and finetuning Pre-training: Restricted Boltzmann Machine Fine-tuning: Back-propagation Output Output RBM RBM w 2 Hidden Layer Hidden Layer Input w 1 Hidden Layer Input w 2 +ε 2 w 1 +ε 1 7

Autoencoder A particular form of Deep Belief Network The output aims to reconstruct the input The structure is symmetric with respect to the middle layer For speech, the input and output are Gaussian First layer: Gaussian-Bernoulli RBM pre-train Last layer: Linear activation function Fine-tuning: squared error function 8

Autoencoder (cont d) RBM RBM Middle Layer 2 w 2 Middle Layer 1 Middle Layer 1 Input w 1 Output Layer Middle Layer 3 Middle Layer 2 Middle Layer 1 Input Layer w 1 T +ε 4 w 2 T +ε 3 w 2 +ε 2 w 1 +ε 1 9

Denoising Autoencoder Input: noisy speech Target output: corresponding clean counterpart After fine-tuning, the DAE has the denoising ability 10

DBN Classifier Target output: class label Last layer: soft-max function Fine-tuning: cross entropy as error function Class Label RBM Middle Layer 2 RBM w 2 Middle Layer 1 Middle Layer 1 w 1 Input Layer Middle Layer 2 Middle Layer 1 Input Layer w 3 w 2 +ε 2 w 1 +ε 1 11

Denoising Classifier Two RMBs are stacked on top of the denoising deep autoencoder The top-rbm layer is connected to class label layer of Speaker ID The first RBM connected to the output of autoencoder is also Gaussian-Bernoulli RBM The whole classifier is fine-tuned by backpropagation again 12

Denoising Classifier (cont d) BN Layer Speaker ID Denoised Speech RBM w 4 w 5 Hidden Layer 5 BN Layer BN Features Output Layer (Hidden Layer 4) w 4 +ε 6 w 1 T +ε 4 Hidden Layer 5 Hidden Layer 5 Hidden Layer 3 RBM w 3 w 3 +ε 5 w 2 T +ε 3 Hidden Layer 2 w 2 +ε 2 Hidden Layer 1 w 1 +ε 1 Denoising Deep Autoencoder Visible Layer Hidden Layer 4 w 1 T +ε 4 ' Hidden Layer 3 w 2 T +ε 3 ' Hidden Layer 2 Denoising Deep Classifier Input Layer w 2 +ε 2 ' Noisy Speech Noisy Speech Hidden Layer 1 w 1 +ε 1 ' Input Layer 13

I-Vector for Speaker Iden9fica9on Factor analysis model: Speaker-dependent supervector! µ =! µ + Tx s s Speaker-dependent i-vector UBM supervector Low-rank total variability matrix Instead of the high-dimension µ s (e.g. 1024 60 ), we use the low-dimension (typically 500) i-vector x s to represent the speaker.! 14

Probabilis9c LDA Factor analysis model: i-vector extracted from the utterance of speaker s Defining Speaker subspace Global mean of all i-vectors x s = m + Vz s +ε s Speaker factor Residual noise with covariance Ʃ Then, speakers are compared in the speaker subspace based on z instead of the i-vectors x in the i-vector space, thus channel effect will be suppressed. 15

Experimental Setup Evaluation dataset: speech from 138 speakers in the YOHO corpus 96 utterances per speaker as training data Add Babble noise to SNR 15dB, 6dB and 0dB respectively by FaNT tool 40 utterances per SNR condition and per speaker as testing data Baseline: 19 MFCCs together with energy plus their 1st and 2nd derivativesà60-dim 16

DNN Setup Structure: D-256-256-256-D-256-60-138 D represents the dimension of the input vectors Input for DNN: (1) 1 frame of 256-Dim spectra (Log-spec BN) (2) 7 frames of 20-Dim Mel filter bank output (Log-mel BN) (3) 5 frames of 60-Dim MFCC (MFC BN) Packed with an SNR node Normalized by z-norm 17

I-Vector/PLDA Setup Decorrelation for BN features: PCA Whitening GMM-UBM: 256 mixtures Total Variability Matrix: 400 total factors PLDA: SNR independent with 138 latent variables Speaker identification: find the speaker ID having the highest averaged PLDA score given each test utterance 18

Denoising Ability of Autoencoder Denoised Speech Output of Hidden Layer 4 Speaker ID w 5 BN Layer BN Features w 4 +ε 6 Clean Speech Hidden Layer 5 Output Layer (Hidden Layer 4) w 1 T +ε 4 Hidden Layer 3 w 3 +ε 5 Hidden Layer 4 w 1 T +ε 4 ' Hidden Layer 3 Denoising Deep Classifier FaNT - 0dB SNR Noisy Speech w 2 T +ε 3 Hidden Layer 2 w 2 +ε 2 Denoising Deep Autoencoder w 2 T +ε 3 ' Hidden Layer 2 w 2 +ε 2 ' Hidden Layer 1 Hidden Layer 1 w 1 +ε 1 w 1 +ε 1 ' Input Layer Input Layer 19

Result Log-mel BN outperforms MFCC under noisy condistions 20

PLDA Score Combina9on At the PLDA score level, we can fuse the MFCC and the BN features Further improve the performance of speaker identification is the fusion rate and denotes the PLDA scores. 21

PLDA Score Combina9on (cont d) Feature Fusion Weight SNR of Test Utterances Clean 15dB 6dB 0dB MFCC 1.00 98.31% 95.61% 90.05% 65.65% Log-spec BN 0.57 99.29% 97.88% 94.02% 79.66% Log-mel BN 0.51 99.31% 98.46% 94.92% 82.50% MFC BN 0.53 98.79% 96.41% 92.80% 75.45% Score fusion increases the accuracy significantly BN features and MFCC are complementary to each other 22

Results on Speaker Verifica9on Results in terms of EER (in %) and mindcf ( 1000): lower is better Signal-to-noise Ra2o (SNR) Clean 15dB 6dB 0dB EER mindcf EER mindcf EER mindcf EER mindcf MFCC 0.793 4.997 1.685 9.420 3.425 21.162 10.276 52.716 Log-mel BN 0.688 3.733 1.287 6.838 2.587 14.374 6.677 35.777 EER: crossing point of FAR and FRR DCF: linear combination of FAR and FRR 23

Conclusions On speaker identification task, our Log-mel BN features are comparable with the standard MFCC The BN features and MFCC are complementary to each other, leading to signicant performance gain after fusing the MFCC- and BN-based PLDA scores. On speaker verification task, our Log-mel BN features from denoising deep classier outperform MFCC for all SNR conditions. 24

THANKS! Q & A 25

APPENDIX 26

PLDA Scoring x s = m + Vz s +ε s x t = m + Vz t +ε t 27

Results of Score Combina9on 28