Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Similar documents
The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis

arxiv: v1 [cs.sd] 8 Jun 2017

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Speaker Verification with Adaptive Spectral Subband Centroids

Improving Robustness to Compressed Speech in Speaker Recognition

Trial-Based Calibration for Speaker Recognition in Unseen Conditions

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless

Multi-modal Person Identification in a Smart Environment

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification

Client Dependent GMM-SVM Models for Speaker Verification

Supervector Compression Strategies to Speed up I-Vector System Development

Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling

LARGE-SCALE SPEAKER IDENTIFICATION

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS

Multifactor Fusion for Audio-Visual Speaker Recognition

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE

A text-independent speaker verification model: A comparative analysis

MULTIPLE WINDOWED SPECTRAL FEATURES FOR EMOTION RECOGNITION

Discriminative training and Feature combination

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

AN I-VECTOR BASED DESCRIPTOR FOR ALPHABETICAL GESTURE RECOGNITION

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

LibrarY of Complex ICA Algorithms (LYCIA) Toolbox v1.2. Walk-through

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Novel Methods for Query Selection and Query Combination in Query-By-Example Spoken Term Detection

An Introduction to Pattern Recognition

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Speaker Diarization System Based on GMM and BIC

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

Input speech signal. Selected /Rejected. Pre-processing Feature extraction Matching algorithm. Database. Figure 1: Process flow in ASR

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Chapter 3. Speech segmentation. 3.1 Preprocessing

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Dynamic Time Warping

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Inter-session Variability Modelling and Joint Factor Analysis for Face Authentication

Authentication of Fingerprint Recognition Using Natural Language Processing

arxiv: v1 [cs.mm] 23 Jan 2019

ABC submission for NIST SRE 2016

Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection

Variable-Component Deep Neural Network for Robust Speech Recognition

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Voice Command Based Computer Application Control Using MFCC

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

2014, IJARCSSE All Rights Reserved Page 461

Xing Fan, Carlos Busso and John H.L. Hansen

I-VECTORS FOR TIMBRE-BASED MUSIC SIMILARITY AND MUSIC ARTIST CLASSIFICATION

Towards PLDA-RBM based Speaker Recognition in Mobile Environment: Designing Stacked/Deep PLDA-RBM Systems

Textural Features for Image Database Retrieval

Discriminate Analysis

HANDSET-DEPENDENT BACKGROUND MODELS FOR ROBUST. Larry P. Heck and Mitchel Weintraub. Speech Technology and Research Laboratory.

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Neetha Das Prof. Andy Khong

VIDEO-BASED HUMAN MOTION CAPTURE DATA RETRIEVAL VIA NORMALIZED MOTION ENERGY IMAGE SUBSPACE PROJECTIONS

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Optimizing feature representation for speaker diarization using PCA and LDA

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Probabilistic scoring using decision trees for fast and scalable speaker recognition

Detector. Flash. Detector

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

Why DNN Works for Speech and How to Make it More Efficient?

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

THE PERFORMANCE of automatic speech recognition

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Speech Recognition on DSP: Algorithm Optimization and Performance Analysis

A ROBUST SPEAKER CLUSTERING ALGORITHM

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

SAS: A speaker verification spoofing database containing diverse attacks

Replay Attack Detection using DNN for Channel Discrimination

Pattern Recognition Letters

Robust speech recognition using features based on zero crossings with peak amplitudes

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

Mouth Region Localization Method Based on Gaussian Mixture Model

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Decision trees with improved efficiency for fast speaker verification

Transcription:

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam, Pierre.Ouellet, Patrick.Kenny}@crim.ca 2 INRS-EMT, University of Quebec, Montreal, Canada dougo@emt.inrs.ca Abstract. This paper investigates several feature normalization techniques for use in an i-vector speaker verification system based on a mixture probabilistic linear discriminant analysis (PLDA) model. The objective of the feature normalization technique is to compensate for the effects of environmental mismatch. Here, we study short-time Gaussianization (STG), short-time mean and variance normalization (STMVN), and short-time mean and scale normalization (STMSN) techniques. Our goal is to compare the performance of the above mentioned feature normalization techniques on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the NIST SRE 2010 corpora. Experimental results show that the performances of the STMVN and STMSN techniques are comparable to that of the STG technique. Keywords: Speaker verification, feature normalization, STG, STMVN. 1 Introduction Most state-of the-art speaker verification systems perform well in controlled environments where data is collected from reasonably clean environments. Acoustic mismatch due to different training and testing environments can severely deteriorate the performance of the speaker verification systems. Degradation of performance due to mismatched environments has been a barrier for deployment of speaker recognition technologies. Feature normalization strategies are employed in speaker recognition systems to compensate for the effects of environmental mismatch. These techniques are preferred because a priori knowledge and adaptation are not required under any environment. Most of the normalization techniques are applied as a post-processing scheme on the Mel-frequency cepstral coefficient (MFCC) speech features. Normalization techniques can be classified as model-based or data distribution-based techniques. In model-based normalization technique, certain statistical properties of speech such as mean, variance, moments, are normalized to reduce the residual mismatch in feature vectors. Cepstral mean normalization (CMN), mean and variance normalization (MVN), STMVN and STMSN techniques fall into this category. Data distribution-based techniques aim at normalizing the feature distribution to the reference. STG and histogram normalization methods fall into this category. A

number of feature normalization techniques have been proposed in the past for speaker verification systems including feature warping [1], STG [2], CMN [3], and RASTA filtering [4-5]. In this paper, our goal is to perform a comparative evaluation of STG, STMVN and a new method called STMSN, which is similar to STMVN, feature normalization methods employing a mixture PLDA model in the i-vector space for speaker verification. For our experiments, we use the latest NIST 2010 SRE benchmark data. We use a gender independent i-vector extractor and then form a mixture PLDA model by training and combining two gender dependent models as in [6], where the gender label is treated as a latent (or hidden) variable. 2 Feature Normalization Technique In feature normalization techniques, the components of the fixed feature vector are scaled or warped so as to enable more effective modeling of speaker differences. Most of the normalization techniques are normally applied in the cepstral domain as shown in figure 1. Speaker verification systems generally make use of acoustic frontends similar to those used in speech recognition systems: pre-processing (includes dc removal and pre-emphasis), short-time spectrum estimation, Mel-filter bank integration, cepstral coefficients computation via DCT transform, appending delta and double delta coefficients, removing silence frames and then feature normalization. Speech signal Pre-processing Framing & windowing STFT & Spectrum estimation Mel-filter bank integration Features VAD labels Lag size Logarithmic nonlinearity Feature Normali -zation Silence removal MFCC Append delta & double delta DCT Fig. 1. Block diagram of MFCC feature extraction with the feature normalization as a postprocessing scheme. CMN (or MVN) is normally performed over the whole utterance with the assumption that the channel effect is constant over the entire utterance. Also, normalizing a feature vector over the entire utterance is not a feasible solution in real-time

applications as it causes unnecessarily long processing delay. To relax this assumption and to reduce the processing delay MFCC features are normalized over a sliding window of 3-5s duration. Here we use a 3s sliding window (i.e, 300 frames for a frame shift of 10 ms) for all three methods. The feature vector to be normalized is located at the centre of the sliding window. 2.1 Short-time Gaussianization (STG) STG [2] aims at modifying the short-time feature distribution to follow a reference distribution, for example standard normal distribution. It is initiated by a global linear transformation of features, followed by a short-time windowed cumulative distribution function (CDF) matching. Linear transformation in the feature space leads to local independence or decorrelation. If X is the original feature set and A is the transformation matrix, then STG can be implemented using following two steps [2]: Linearly transform the original feature using Y = AX; Apply short-time windowed feature warping T on Y as: ˆX=T( Y ). In STG each feature vector is warped independently [2]. STG perform better than the feature warping technique [1]. 2.2 Short-time Mean and Variance Normalization (STMVN) In the short-time mean and variance normalization (STMVN) technique, m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmvn ( m,k ) =, σst where m and k represent the frame index and cepstral coefficients index, respectively, L is the sliding window length in frames. μst and σst are the short-time mean and standard deviation, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L j=m-l/2 m+l/2 1 2 σst ( m,k ) = ( C( j,k) -μ ) L j=m-l/2 2.3 Short-time Cepstral Mean and Scale Normalization (STMSN) Given a lower bound b l and upper bound b u of a feature component x, scale normalization can be defined as: x-bl ˆx =. (1) b -b sn u l

ˆx sn is in the range [0,1]. MVN transforms the features component x to a random variable with zero mean and unit variance as x-μ ˆx =, (2) σ mvn where μ and σ are the sample mean and standard deviation of the feature, respectively. From (2) and (3) we define a new normalization technique, called the mean and scale normalization (MSN) technique, as x-μ ˆx =. (3) b -b msn In the short-time cepstral gain normalization (STMSN) technique, the m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmsn ( m,k ) =, d m,k where μst and dst ( ) u st l ( ) m,k are the short-time mean and short-time difference between the upper and lower bound, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L st j=m-l/2 ( ) ( ( )) ( ( )) d m,k = max C j,k - min C j,k (m-l/2) j (m+l/2) (m-l/2) j (m+l/2). 3 I-vector Framework for Speaker Verification The i-vector framework for speaker verification has set a new performance standard in the research field. The i-vector extractor converts an entire speech recording into a low dimensional feature vectors called i-vectors [7-9]. The i-vector extractors explained in [7-9] are gender dependent and are followed by gender dependent generative modeling stages. Similar to [6], we use a gender-independent i-vector extractor and a mixture of male and female Probabilistic Linear Discriminant Analysis (PLDA) models, where the gender label is treated as a latent variable, for speaker verification. The i-vector framework used in this paper consists of the following stages: i-vector extraction, generative modeling of i-vectors and scoring or likelihood ratio computation as described in [6]. A detailed description about the mixture PLDA model-based i-vector speaker verification can be found in [6].

4 Experiments 4.1 Experimental Setup We conducted experiments on the extended core-core condition of the NIST 2010 SRE extended list. The performance of the feature normalization techniques was evaluated using following the evaluation metrics: the Equal Error Rate (EER), the old normalized minimum detection cost function (DCF Old ) and the new normalized minimum detection cost function (DCF New ). DCF Old and DCF New correspond to the evaluation metric for the NIST SRE in 2008 and 2010, respectively. 4.1.1 Feature Extraction & UBM training For our experiments, we use 20 MFCC features (including log-energy) augmented with their delta and double delta coefficients, making 60 dimensional MFCC feature vectors. The analysis frame length is 25 ms with a frame shift of 10 ms. Delta and double coefficients were calculated using a 2-frame window. Then silence frames are removed according to the VAD labels. After that we apply feature normalization techniques (STG, STMVN and STMSN), which use a 300-frame window. We train a gender-independent, full covariance Universal Background Model (UBM) with 256- component Gaussian Mixture Models (GMMs). NIST SRE 2004 and 2005 telephone data were used for training the UBM for our system. 4.1.2 Training and extraction of i-vectors Our gender-independent i-vector extractor is of dimension 800. After training genderindependent GMM-UBM, we train the i-vector extractor using the Baum-Welch (BW) statistics extracted from the following data: LDC release of Switchboard II - phase 2 and phase 3, Switchboard Cellular - part 1 and part 2, Fisher data, NIST SRE 2004 and 2005 telephone data, NIST SRE 2005 and 2005 microphone data and NIST SRE 2008 interview development microphone data. In order to reduce the i-vectors dimension, a Linear Discriminant Analysis (LDA) projection matrix is estimated from the BW statistics by maximizing the following objective function: LDA P T P ΣbP P = argmax, T P ΣwP where Σ b and Σ w represent the between- and within-class scatter matrices, respectively. For the estimation of Σb we use all telephone training data excluding Fisher data and Σ w is estimated using all telephone and microphone training data excluding Fisher data. An optimal reduced dimension of 150 is determined empirically. Then we extract 150-dimensional i-vectors for all training data excluding Fisher data by applying this transformation matrix on the 800-dimensional i-vectors. For the test data, first BW statistics and then 150 dimensional i-vectors are extracted following the similar procedure using the same projection matrix. We also normalize the length of the i-vectors, as it has been found that normalizing the length of the i-vectors after mapping by the estimated LDA projection matrix helps the Gaussian PLDA model to

give the same results as the heavy-tailed PLDA model [10], i.e., PLDA model with heavy-tailed prior distributions [8]. 4.1.3 Training the PLDA model We train two PLDA models, one for the males and another for females. These models were trained using all the telephone and microphone training i-vectors; then we combine these PLDA models to form a mixture of PLDA models in i-vector space [6]. 4.2 Results Feature normalization techniques are evaluated on the latest NIST SRE 2010 corpora using the i-vector speaker verification system. Results are reported for five evaluation conditions correspond to det conditions 1-5 (as shown in table 1) in the evaluation plan [11]. Table 2 presents EERs for all normalization techniques. Tables 3 and 4 depict the DCF Old and DCF New, respectively. In terms of the EER, and DCF Old, STG perform better compared to STMVN and STMSN. In the case of DCF New, the STMSN technique is found to perform better. The execution time (on a 64-bit MATLAB) of the STG technique is 16.05 seconds to normalize the features extracted from a speech signal of 139 seconds duration whereas the execution times of the STMSN and STMVN techniques are 1.71 and 3.01 seconds, respectively. STMVN and STMSN are very simple methods and take less time to normalize MFCC features whereas STG is considerably more complex to implement and take more time to normalize MFCC features compared to the STMVN and STMSN techniques. Table 1: Evaluation conditions (extended core-core) for the NIST 2010 SRE task. Condition det1 det2 det3 det4 det5 Task Interview in training and test, same Mic. Interview in training and test, different Mic. Interview in training and normal vocal effort phone call over Tel channel in test. Interview in training and normal vocal effort phone call over Mic channel in test Normal vocal effort phone call in training and test, different Tel

Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by EER. For each row the best EER is in boldface. Female Male EER (%) STG STMVN STMSN det1 2.5 2.4 2.4 det2 5.1 4.9 5.1 det4 3.9 4.3 4.5 det3 3.3 3.6 3.2 det5 3.4 3.6 3.5 det1 1.6 1.5 1.5 det2 2.7 3.0 3.0 det4 2.4 2.5 2.8 det3 3.2 3.3 3.6 det5 2.6 2.6 2.6 Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF Old ). For each row the best DCF Old is in boldface. Female Male DCF Old STG STMVN STMSN det1 0.12 0.12 0.12 det2 0.24 0.24 0.24 det4 0.20 0.20 0.21 det3 0.18 0.17 0.17 det5 0.16 0.18 0.17 det1 0.07 0.07 0.07 det2 0.14 0.15 0.14 det4 0.11 0.11 0.11 det3 0.14 0.16 0.16 det5 0.13 0.13 0.13 Table 3. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF New ). For each row the best DCF New is in boldface. Female Male DCF New STG STMVN STMSN det1 0.40 0.38 0.40 det2 0.67 0.63 0.64 det4 0.57 0.56 0.57 det3 0.56 0.56 0.56 det5 0.47 0.54 0.51 det1 0.25 0.29 0.26 det2 0.49 0.47 0.45 det4 0.37 0.37 0.33 det3 0.53 0.54 0.48 det5 0.46 0.48 0.46

5 Conclusion In this paper a simple feature normalization method, called STMSN, was introduced and its performance, in the context of an i-vector speaker verification system, was compared to the STG and STMVN techniques. Both the STMVN and STMSN methods provide comparable speaker verification results using i-vectors to that of STG. STG is considerable more complex and takes longer time to normalize MFCC feature vectors compared to STMSN and STMVN. References 1. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the Speaker Recognition Workshop, Crete, Greece, pp. 213 218 (2001) 2. Xiang, B., Chaudhari, U., Navratil, J., Ramaswamy, G., Gopinath, R.: Short-time Gaussianization for robust speaker verification. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, pp. 681 684 (2002). 3. Furui, S.: Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoustics, Speech Signal Process., 29 (2), 254 272 (1981). 4. Atal, B.: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55 (6), 1304 1312 (1974). 5. Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2 (4), 578 589 (1994). 6. M. Senoussaoui, P. Kenny, N. Brummer, E. de Villiers, P. Dumouchel: Mixture of PLDA models in I-vector space for gender independent speaker recognition, In: Interspeech 2011 (to appear), Florence, Italy, August (2011). 7. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet: Front-end factor analysis for speaker verification. IEEE Trans. on Audio, Speech and Language Processing, vol. 19(4), pp. 788-798 (2011). 8. P. Kenny: Bayesian speaker verification with heavy tailed priors, In: The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 9. N. Brümmer, E. de Villiers: The speaker partitioning problem, In: the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 10. D. Garcia-Romero, and Carol Y. Espy-Wilson: Analysis of i-vector length normalization in speaker recognition systems, In: Interspeech 2011 (to appear), Florence, Italy, Aug.(2011). 11. National Institute of Standards and Technology, NIST Speaker Recognition Evaluation, http://www.itl.nist.gov/iad/mig/tests/sre/.