Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China

Contents 1. Mo2va2on of Work 2. Deep Belief Network 3. Denoising Autoencoder 4. Denoising Classifier 5. I-Vector and PLDA for Speaker Recogni2on 6. Experiments on YOHO Corpus 7. Conclusions 2

Overview of Speaker Iden9fica9on utt. 1 Feature Vector 1 utt. 2 Feature Vector 2 utt. 3 Feature Extraction Feature Vector 3 Classification Speaker ID utt. n Feature Vector n 3

Mo9va9on Features in speaker identification, e.g. mel-frequency cepstral coefficient (MFCC), are not designed particularly for extracting speakerdependent information are not noise robust Learning-based features outperform traditional handcrafted features in many areas, e.g. computer vision. 4

Proposed Solu9on With noisy speech input, train the deep neural network (DNN) by supervisory signal of both clean speech and speaker ID; Then use the output of the bottleneck layer as feature. Key features of proposed solution: Speaker-dependent Noise robust 5

Neural Networks Artificial neural networks (ANNs) are a family of statistical learning models inspired by biological neural networks (the brain). Its aim is to approximate the unknown function from input to target output. x = x 1 x 2 x 3 w 1 w 2 w 3 b 6

Deep Belief Network Deep Neural Network with pre-training and finetuning Pre-training: Restricted Boltzmann Machine Fine-tuning: Back-propagation Output Output RBM RBM w 2 Hidden Layer Hidden Layer Input w 1 Hidden Layer Input w 2 +ε 2 w 1 +ε 1 7

Autoencoder A particular form of Deep Belief Network The output aims to reconstruct the input The structure is symmetric with respect to the middle layer For speech, the input and output are Gaussian First layer: Gaussian-Bernoulli RBM pre-train Last layer: Linear activation function Fine-tuning: squared error function 8

Autoencoder (cont d) RBM RBM Middle Layer 2 w 2 Middle Layer 1 Middle Layer 1 Input w 1 Output Layer Middle Layer 3 Middle Layer 2 Middle Layer 1 Input Layer w 1 T +ε 4 w 2 T +ε 3 w 2 +ε 2 w 1 +ε 1 9

Denoising Autoencoder Input: noisy speech Target output: corresponding clean counterpart After fine-tuning, the DAE has the denoising ability 10

DBN Classifier Target output: class label Last layer: soft-max function Fine-tuning: cross entropy as error function Class Label RBM Middle Layer 2 RBM w 2 Middle Layer 1 Middle Layer 1 w 1 Input Layer Middle Layer 2 Middle Layer 1 Input Layer w 3 w 2 +ε 2 w 1 +ε 1 11

Denoising Classifier Two RMBs are stacked on top of the denoising deep autoencoder The top-rbm layer is connected to class label layer of Speaker ID The first RBM connected to the output of autoencoder is also Gaussian-Bernoulli RBM The whole classifier is fine-tuned by backpropagation again 12

Denoising Classifier (cont d) BN Layer Speaker ID Denoised Speech RBM w 4 w 5 Hidden Layer 5 BN Layer BN Features Output Layer (Hidden Layer 4) w 4 +ε 6 w 1 T +ε 4 Hidden Layer 5 Hidden Layer 5 Hidden Layer 3 RBM w 3 w 3 +ε 5 w 2 T +ε 3 Hidden Layer 2 w 2 +ε 2 Hidden Layer 1 w 1 +ε 1 Denoising Deep Autoencoder Visible Layer Hidden Layer 4 w 1 T +ε 4 ' Hidden Layer 3 w 2 T +ε 3 ' Hidden Layer 2 Denoising Deep Classifier Input Layer w 2 +ε 2 ' Noisy Speech Noisy Speech Hidden Layer 1 w 1 +ε 1 ' Input Layer 13

I-Vector for Speaker Iden9fica9on Factor analysis model: Speaker-dependent supervector! µ =! µ + Tx s s Speaker-dependent i-vector UBM supervector Low-rank total variability matrix Instead of the high-dimension µ s (e.g. 1024 60 ), we use the low-dimension (typically 500) i-vector x s to represent the speaker.! 14

Probabilis9c LDA Factor analysis model: i-vector extracted from the utterance of speaker s Defining Speaker subspace Global mean of all i-vectors x s = m + Vz s +ε s Speaker factor Residual noise with covariance Ʃ Then, speakers are compared in the speaker subspace based on z instead of the i-vectors x in the i-vector space, thus channel effect will be suppressed. 15

Experimental Setup Evaluation dataset: speech from 138 speakers in the YOHO corpus 96 utterances per speaker as training data Add Babble noise to SNR 15dB, 6dB and 0dB respectively by FaNT tool 40 utterances per SNR condition and per speaker as testing data Baseline: 19 MFCCs together with energy plus their 1st and 2nd derivativesà60-dim 16

DNN Setup Structure: D-256-256-256-D-256-60-138 D represents the dimension of the input vectors Input for DNN: (1) 1 frame of 256-Dim spectra (Log-spec BN) (2) 7 frames of 20-Dim Mel filter bank output (Log-mel BN) (3) 5 frames of 60-Dim MFCC (MFC BN) Packed with an SNR node Normalized by z-norm 17

I-Vector/PLDA Setup Decorrelation for BN features: PCA Whitening GMM-UBM: 256 mixtures Total Variability Matrix: 400 total factors PLDA: SNR independent with 138 latent variables Speaker identification: find the speaker ID having the highest averaged PLDA score given each test utterance 18

Denoising Ability of Autoencoder Denoised Speech Output of Hidden Layer 4 Speaker ID w 5 BN Layer BN Features w 4 +ε 6 Clean Speech Hidden Layer 5 Output Layer (Hidden Layer 4) w 1 T +ε 4 Hidden Layer 3 w 3 +ε 5 Hidden Layer 4 w 1 T +ε 4 ' Hidden Layer 3 Denoising Deep Classifier FaNT - 0dB SNR Noisy Speech w 2 T +ε 3 Hidden Layer 2 w 2 +ε 2 Denoising Deep Autoencoder w 2 T +ε 3 ' Hidden Layer 2 w 2 +ε 2 ' Hidden Layer 1 Hidden Layer 1 w 1 +ε 1 w 1 +ε 1 ' Input Layer Input Layer 19

Result Log-mel BN outperforms MFCC under noisy condistions 20

PLDA Score Combina9on At the PLDA score level, we can fuse the MFCC and the BN features Further improve the performance of speaker identification is the fusion rate and denotes the PLDA scores. 21

PLDA Score Combina9on (cont d) Feature Fusion Weight SNR of Test Utterances Clean 15dB 6dB 0dB MFCC 1.00 98.31% 95.61% 90.05% 65.65% Log-spec BN 0.57 99.29% 97.88% 94.02% 79.66% Log-mel BN 0.51 99.31% 98.46% 94.92% 82.50% MFC BN 0.53 98.79% 96.41% 92.80% 75.45% Score fusion increases the accuracy significantly BN features and MFCC are complementary to each other 22

Results on Speaker Verifica9on Results in terms of EER (in %) and mindcf ( 1000): lower is better Signal-to-noise Ra2o (SNR) Clean 15dB 6dB 0dB EER mindcf EER mindcf EER mindcf EER mindcf MFCC 0.793 4.997 1.685 9.420 3.425 21.162 10.276 52.716 Log-mel BN 0.688 3.733 1.287 6.838 2.587 14.374 6.677 35.777 EER: crossing point of FAR and FRR DCF: linear combination of FAR and FRR 23

Conclusions On speaker identification task, our Log-mel BN features are comparable with the standard MFCC The BN features and MFCC are complementary to each other, leading to signicant performance gain after fusing the MFCC- and BN-based PLDA scores. On speaker verification task, our Log-mel BN features from denoising deep classier outperform MFCC for all SNR conditions. 24

THANKS! Q & A 25

APPENDIX 26

PLDA Scoring x s = m + Vz s +ε s x t = m + Vz t +ε t 27

Results of Score Combina9on 28