Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam, Pierre.Ouellet, Patrick.Kenny}@crim.ca 2 INRS-EMT, University of Quebec, Montreal, Canada dougo@emt.inrs.ca Abstract. This paper investigates several feature normalization techniques for use in an i-vector speaker verification system based on a mixture probabilistic linear discriminant analysis (PLDA) model. The objective of the feature normalization technique is to compensate for the effects of environmental mismatch. Here, we study short-time Gaussianization (STG), short-time mean and variance normalization (STMVN), and short-time mean and scale normalization (STMSN) techniques. Our goal is to compare the performance of the above mentioned feature normalization techniques on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the NIST SRE 2010 corpora. Experimental results show that the performances of the STMVN and STMSN techniques are comparable to that of the STG technique. Keywords: Speaker verification, feature normalization, STG, STMVN. 1 Introduction Most state-of the-art speaker verification systems perform well in controlled environments where data is collected from reasonably clean environments. Acoustic mismatch due to different training and testing environments can severely deteriorate the performance of the speaker verification systems. Degradation of performance due to mismatched environments has been a barrier for deployment of speaker recognition technologies. Feature normalization strategies are employed in speaker recognition systems to compensate for the effects of environmental mismatch. These techniques are preferred because a priori knowledge and adaptation are not required under any environment. Most of the normalization techniques are applied as a post-processing scheme on the Mel-frequency cepstral coefficient (MFCC) speech features. Normalization techniques can be classified as model-based or data distribution-based techniques. In model-based normalization technique, certain statistical properties of speech such as mean, variance, moments, are normalized to reduce the residual mismatch in feature vectors. Cepstral mean normalization (CMN), mean and variance normalization (MVN), STMVN and STMSN techniques fall into this category. Data distribution-based techniques aim at normalizing the feature distribution to the reference. STG and histogram normalization methods fall into this category. A

number of feature normalization techniques have been proposed in the past for speaker verification systems including feature warping [1], STG [2], CMN [3], and RASTA filtering [4-5]. In this paper, our goal is to perform a comparative evaluation of STG, STMVN and a new method called STMSN, which is similar to STMVN, feature normalization methods employing a mixture PLDA model in the i-vector space for speaker verification. For our experiments, we use the latest NIST 2010 SRE benchmark data. We use a gender independent i-vector extractor and then form a mixture PLDA model by training and combining two gender dependent models as in [6], where the gender label is treated as a latent (or hidden) variable. 2 Feature Normalization Technique In feature normalization techniques, the components of the fixed feature vector are scaled or warped so as to enable more effective modeling of speaker differences. Most of the normalization techniques are normally applied in the cepstral domain as shown in figure 1. Speaker verification systems generally make use of acoustic frontends similar to those used in speech recognition systems: pre-processing (includes dc removal and pre-emphasis), short-time spectrum estimation, Mel-filter bank integration, cepstral coefficients computation via DCT transform, appending delta and double delta coefficients, removing silence frames and then feature normalization. Speech signal Pre-processing Framing & windowing STFT & Spectrum estimation Mel-filter bank integration Features VAD labels Lag size Logarithmic nonlinearity Feature Normali -zation Silence removal MFCC Append delta & double delta DCT Fig. 1. Block diagram of MFCC feature extraction with the feature normalization as a postprocessing scheme. CMN (or MVN) is normally performed over the whole utterance with the assumption that the channel effect is constant over the entire utterance. Also, normalizing a feature vector over the entire utterance is not a feasible solution in real-time

applications as it causes unnecessarily long processing delay. To relax this assumption and to reduce the processing delay MFCC features are normalized over a sliding window of 3-5s duration. Here we use a 3s sliding window (i.e, 300 frames for a frame shift of 10 ms) for all three methods. The feature vector to be normalized is located at the centre of the sliding window. 2.1 Short-time Gaussianization (STG) STG [2] aims at modifying the short-time feature distribution to follow a reference distribution, for example standard normal distribution. It is initiated by a global linear transformation of features, followed by a short-time windowed cumulative distribution function (CDF) matching. Linear transformation in the feature space leads to local independence or decorrelation. If X is the original feature set and A is the transformation matrix, then STG can be implemented using following two steps [2]: Linearly transform the original feature using Y = AX; Apply short-time windowed feature warping T on Y as: ˆX=T( Y ). In STG each feature vector is warped independently [2]. STG perform better than the feature warping technique [1]. 2.2 Short-time Mean and Variance Normalization (STMVN) In the short-time mean and variance normalization (STMVN) technique, m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmvn ( m,k ) =, σst where m and k represent the frame index and cepstral coefficients index, respectively, L is the sliding window length in frames. μst and σst are the short-time mean and standard deviation, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L j=m-l/2 m+l/2 1 2 σst ( m,k ) = ( C( j,k) -μ ) L j=m-l/2 2.3 Short-time Cepstral Mean and Scale Normalization (STMSN) Given a lower bound b l and upper bound b u of a feature component x, scale normalization can be defined as: x-bl ˆx =. (1) b -b sn u l

ˆx sn is in the range [0,1]. MVN transforms the features component x to a random variable with zero mean and unit variance as x-μ ˆx =, (2) σ mvn where μ and σ are the sample mean and standard deviation of the feature, respectively. From (2) and (3) we define a new normalization technique, called the mean and scale normalization (MSN) technique, as x-μ ˆx =. (3) b -b msn In the short-time cepstral gain normalization (STMSN) technique, the m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmsn ( m,k ) =, d m,k where μst and dst ( ) u st l ( ) m,k are the short-time mean and short-time difference between the upper and lower bound, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L st j=m-l/2 ( ) ( ( )) ( ( )) d m,k = max C j,k - min C j,k (m-l/2) j (m+l/2) (m-l/2) j (m+l/2). 3 I-vector Framework for Speaker Verification The i-vector framework for speaker verification has set a new performance standard in the research field. The i-vector extractor converts an entire speech recording into a low dimensional feature vectors called i-vectors [7-9]. The i-vector extractors explained in [7-9] are gender dependent and are followed by gender dependent generative modeling stages. Similar to [6], we use a gender-independent i-vector extractor and a mixture of male and female Probabilistic Linear Discriminant Analysis (PLDA) models, where the gender label is treated as a latent variable, for speaker verification. The i-vector framework used in this paper consists of the following stages: i-vector extraction, generative modeling of i-vectors and scoring or likelihood ratio computation as described in [6]. A detailed description about the mixture PLDA model-based i-vector speaker verification can be found in [6].

4 Experiments 4.1 Experimental Setup We conducted experiments on the extended core-core condition of the NIST 2010 SRE extended list. The performance of the feature normalization techniques was evaluated using following the evaluation metrics: the Equal Error Rate (EER), the old normalized minimum detection cost function (DCF Old ) and the new normalized minimum detection cost function (DCF New ). DCF Old and DCF New correspond to the evaluation metric for the NIST SRE in 2008 and 2010, respectively. 4.1.1 Feature Extraction & UBM training For our experiments, we use 20 MFCC features (including log-energy) augmented with their delta and double delta coefficients, making 60 dimensional MFCC feature vectors. The analysis frame length is 25 ms with a frame shift of 10 ms. Delta and double coefficients were calculated using a 2-frame window. Then silence frames are removed according to the VAD labels. After that we apply feature normalization techniques (STG, STMVN and STMSN), which use a 300-frame window. We train a gender-independent, full covariance Universal Background Model (UBM) with 256- component Gaussian Mixture Models (GMMs). NIST SRE 2004 and 2005 telephone data were used for training the UBM for our system. 4.1.2 Training and extraction of i-vectors Our gender-independent i-vector extractor is of dimension 800. After training genderindependent GMM-UBM, we train the i-vector extractor using the Baum-Welch (BW) statistics extracted from the following data: LDC release of Switchboard II - phase 2 and phase 3, Switchboard Cellular - part 1 and part 2, Fisher data, NIST SRE 2004 and 2005 telephone data, NIST SRE 2005 and 2005 microphone data and NIST SRE 2008 interview development microphone data. In order to reduce the i-vectors dimension, a Linear Discriminant Analysis (LDA) projection matrix is estimated from the BW statistics by maximizing the following objective function: LDA P T P ΣbP P = argmax, T P ΣwP where Σ b and Σ w represent the between- and within-class scatter matrices, respectively. For the estimation of Σb we use all telephone training data excluding Fisher data and Σ w is estimated using all telephone and microphone training data excluding Fisher data. An optimal reduced dimension of 150 is determined empirically. Then we extract 150-dimensional i-vectors for all training data excluding Fisher data by applying this transformation matrix on the 800-dimensional i-vectors. For the test data, first BW statistics and then 150 dimensional i-vectors are extracted following the similar procedure using the same projection matrix. We also normalize the length of the i-vectors, as it has been found that normalizing the length of the i-vectors after mapping by the estimated LDA projection matrix helps the Gaussian PLDA model to

give the same results as the heavy-tailed PLDA model [10], i.e., PLDA model with heavy-tailed prior distributions [8]. 4.1.3 Training the PLDA model We train two PLDA models, one for the males and another for females. These models were trained using all the telephone and microphone training i-vectors; then we combine these PLDA models to form a mixture of PLDA models in i-vector space [6]. 4.2 Results Feature normalization techniques are evaluated on the latest NIST SRE 2010 corpora using the i-vector speaker verification system. Results are reported for five evaluation conditions correspond to det conditions 1-5 (as shown in table 1) in the evaluation plan [11]. Table 2 presents EERs for all normalization techniques. Tables 3 and 4 depict the DCF Old and DCF New, respectively. In terms of the EER, and DCF Old, STG perform better compared to STMVN and STMSN. In the case of DCF New, the STMSN technique is found to perform better. The execution time (on a 64-bit MATLAB) of the STG technique is 16.05 seconds to normalize the features extracted from a speech signal of 139 seconds duration whereas the execution times of the STMSN and STMVN techniques are 1.71 and 3.01 seconds, respectively. STMVN and STMSN are very simple methods and take less time to normalize MFCC features whereas STG is considerably more complex to implement and take more time to normalize MFCC features compared to the STMVN and STMSN techniques. Table 1: Evaluation conditions (extended core-core) for the NIST 2010 SRE task. Condition det1 det2 det3 det4 det5 Task Interview in training and test, same Mic. Interview in training and test, different Mic. Interview in training and normal vocal effort phone call over Tel channel in test. Interview in training and normal vocal effort phone call over Mic channel in test Normal vocal effort phone call in training and test, different Tel

Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by EER. For each row the best EER is in boldface. Female Male EER (%) STG STMVN STMSN det1 2.5 2.4 2.4 det2 5.1 4.9 5.1 det4 3.9 4.3 4.5 det3 3.3 3.6 3.2 det5 3.4 3.6 3.5 det1 1.6 1.5 1.5 det2 2.7 3.0 3.0 det4 2.4 2.5 2.8 det3 3.2 3.3 3.6 det5 2.6 2.6 2.6 Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF Old ). For each row the best DCF Old is in boldface. Female Male DCF Old STG STMVN STMSN det1 0.12 0.12 0.12 det2 0.24 0.24 0.24 det4 0.20 0.20 0.21 det3 0.18 0.17 0.17 det5 0.16 0.18 0.17 det1 0.07 0.07 0.07 det2 0.14 0.15 0.14 det4 0.11 0.11 0.11 det3 0.14 0.16 0.16 det5 0.13 0.13 0.13 Table 3. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF New ). For each row the best DCF New is in boldface. Female Male DCF New STG STMVN STMSN det1 0.40 0.38 0.40 det2 0.67 0.63 0.64 det4 0.57 0.56 0.57 det3 0.56 0.56 0.56 det5 0.47 0.54 0.51 det1 0.25 0.29 0.26 det2 0.49 0.47 0.45 det4 0.37 0.37 0.33 det3 0.53 0.54 0.48 det5 0.46 0.48 0.46

5 Conclusion In this paper a simple feature normalization method, called STMSN, was introduced and its performance, in the context of an i-vector speaker verification system, was compared to the STG and STMVN techniques. Both the STMVN and STMSN methods provide comparable speaker verification results using i-vectors to that of STG. STG is considerable more complex and takes longer time to normalize MFCC feature vectors compared to STMSN and STMVN. References 1. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the Speaker Recognition Workshop, Crete, Greece, pp. 213 218 (2001) 2. Xiang, B., Chaudhari, U., Navratil, J., Ramaswamy, G., Gopinath, R.: Short-time Gaussianization for robust speaker verification. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, pp. 681 684 (2002). 3. Furui, S.: Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoustics, Speech Signal Process., 29 (2), 254 272 (1981). 4. Atal, B.: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55 (6), 1304 1312 (1974). 5. Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2 (4), 578 589 (1994). 6. M. Senoussaoui, P. Kenny, N. Brummer, E. de Villiers, P. Dumouchel: Mixture of PLDA models in I-vector space for gender independent speaker recognition, In: Interspeech 2011 (to appear), Florence, Italy, August (2011). 7. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet: Front-end factor analysis for speaker verification. IEEE Trans. on Audio, Speech and Language Processing, vol. 19(4), pp. 788-798 (2011). 8. P. Kenny: Bayesian speaker verification with heavy tailed priors, In: The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 9. N. Brümmer, E. de Villiers: The speaker partitioning problem, In: the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 10. D. Garcia-Romero, and Carol Y. Espy-Wilson: Analysis of i-vector length normalization in speaker recognition systems, In: Interspeech 2011 (to appear), Florence, Italy, Aug.(2011). 11. National Institute of Standards and Technology, NIST Speaker Recognition Evaluation, http://www.itl.nist.gov/iad/mig/tests/sre/.