ABC submission for NIST SRE 2016

Size: px

Start display at page:

Download "ABC submission for NIST SRE 2016"

Job Martin
5 years ago
Views:

1 ABC submission for NIST SRE 2016 Agnitio+BUT+CRIM Oldrich Plchot, Pavel Matejka, Ondrej Novotny, Anna Silnova, Johan Rohdin, Mireia Diez, Ondrej Glembek, Xiaowei Jiang, Lukas Burget, Martin Karafiat, Lucas Ondel, Frantisek Grezl, Niko Brummer, Albert Swart, Paola García, Jesús Jorrín, Luis Buera, Patrick Kenny, Jahangir Alam, Gautam Bhattacharya December 11, San Diego, NIST SRE 2016

2 Overview General system architecture Agnitio CRIM Data, features, system architecture Speaker Classifier Network (SCN) Scoring with Beta-Bernoulli Backend BUT System design Clustering analysis Results BUT subsystems Analysis with PLDA system Analysis with DPLDA system BUT DEV set design Different flavours of calibration/fusion Results Conclusions

3 System architecture 2 3 SUM Fusion 7 LR calibration (BUT DEV) LR fusion (BUT DEV) MMFBG Fusion (DEV16) NIG calibration (DEV16) Calibrated score 8 Linear cal. (DEV16) SUM Fusion

4 s System Conservative and simple system: Interesting: Stacking of ivectors NDA+norm BNF (pnorm, 5-hidden layer) Two main systems based on the position of the bottleneck: BNF-2 Second layer BNF-4 Fourth layer Fusion

Adaptation of the NDA Normalizing according to clustering In

5 s System Interesting ideas we explored, but didn t work quite well: Clustering: language and gender (HC and score PLDA) Adaptation of the NDA Normalizing according to clustering In dev we could observe clusters Unlabeled and labeled minor Unlabeled major

6 s System Results: Development Equalized BNF-4 BNF-2 FUSION EER mincprim ActCPrim

7 s System Results: Development Equalized BNF-4 BNF-2 FUSION EER mincprim ActCPrim Eval Equalized EER mincprim ActCprim Conclusions and highlights: Normalization helped. Speaker clustering was difficult, didn t help in normalization.

8 CRIM Site for NIST SRE 2016 Jahangir Alam, Patrick Kenny, Gautam Bhattacharya CRIM NIST SRE 2016 Workshop

9 9 Outline Data preparation Feature Extraction Training and Extraction of I-vectors Speaker Classifier Network Beta-Bernoulli Backend Results

10 Data preparation OBD: Mandarin, Chinese, and Tagalog from NIST SREs 2004-2008.

10 10 Data preparation OBD: Mandarin, Chinese, and Tagalog from NIST SREs PBD: Switchboard + all recordings from NIST SREs excluding the Mandarin, Chinese, and Tagalog. SRE16UNLABELED: Unlabeled training data from SRE16 OD: OBD + SRE16UNLABELED

11 11 Feature Extraction MFCC (60-dimensional, MFCC_E_D_A) LFCC (60-dimensional, LFCC_E_D_A) LPCC (60-dimensional, LPCC_E_D_A)

12 12 Training and Extraction of I-vectors

13 Speaker Classifier Network The SCN is two layers deep and uses sigmoid non-linearity in the hidden layers. Each hidden layer consists of 2000 hidden units.

13 13 Speaker Classifier Network The SCN is two layers deep and uses sigmoid non-linearity in the hidden layers. Each hidden layer consists of 2000 hidden units. The softmax output distribution is over 4323 speakers in the background set (Primary Background Data + Oriental Background Data). We extract the activations of the last hidden layer and treat them as feature vectors (d-vectors) for speaker verification.

14 14 Beta-Bernoulli Backend (1/2) For each node in the last hidden layer, we compare the activations on the enrollment side and the test side by supposing them to be generated by a biased coin toss: - The probability of heads is drawn from a Beta - One draw for the same speaker hypothesis. prior. - Two draws for the different speaker hypothesis. The Beta priors (one per node) are trainable. Reference. T. Minka Estimating a Dirichlet distribution, 2012.

15 15 Beta-Bernoulli Backend (2/2)

16 16 Results on Evaluation Data

17 17 Results on Evaluation Data

18 DPLDA_PLP - single best system on DEV, submitted as contrastive 2 *SVM_PLP - failed on eval, caused some miscalibration *PLDA_MFCC - Used NDA instead of LDA, miscalibrated on eval DPLDA_MFCC - initialized from PLDA_MFCC, much better calibrated PLDA_TEL_PLP - PLP, only telephone data for PLDA training PLDA_TEL_PERS - Perseus, only telephone data for PLDA training PLDA_MFCCSBN - MFCC+Bottleneck features Bottleneck on Fisher English - fixed condition Botleneck on BABEL languages - open condition * Problems with calibration on eval

19 Analysis on MFCC PLDA system

20 Feature comparison with PLDA system ivector system with 2048G/600ivec/L2norm/200lda, WCC(gender,lang,train+unlabeled data), adaptive z+t-norm Results computed on all trials from the eval key Features feadim EER[%] mincprim MFCC PLP MFCC+SBN80-BABEL (open cond) Perseus MFCC+SBN80-Fisher All features perform about the same. There is no superiority of BN as we saw on SRE2010 data. More analysis and comparisons in our SLT paper.

21 BUT DEV set We used PRISM language condition for calibration/fusion We split the segments into short cuts to reflect speech duration in enroll/test of DEV16 We split it into calibration and test part (no jack-knifing) We added multi-enroll trials For the purposes of calibration/fusion, we used only non-english trials We tried to add short segments of non-english training data into the PLDA This did not improve results of PLDA nor DPLDA NO DEV16 annotated data in BUT part of the submission In DPLDA, we used unlabeled data to form non-target trials

22 Calibration/Fusion LR: optimizing cross-entropy on the supervised DEV set, training shift and scale Linear f ( (s-mnon).^2/vnon + log(vnon) -(s-mtar).^2/vtar - log(vtar) )/2; Normal Inverse Gaussian Distribution f ( (s-mnon).^2 - (s-mtar).^2 )/(2*v); Quadratic s_cal = a*s + b Can contain fat-tailed, skewed, or classical normal distributions NIG process is a special case of Levy process (brownian motion, Poisson process) Calibrate: tarlh=nig_logpdf(betat,gammat,deltat,mut,s); nonlh=nig_logpdf(betan,gamman,deltan,mun,s); llr = tarlh - nonlh; Multiclass Multivariate Fully Bayesian Classifier (MMFBG)

23 System architecture 2 3 SUM Fusion 7 LR calibration (BUT DEV) LR fusion (BUT DEV) 8 Linear cal. (DEV16) SUM Fusion MMFBG Fusion (DEV16) NIG calibration (DEV16) Calibrated score Contrastive sys.

24 ABC primary system (NIG_CAL)

25 ABC primary system (Q_CAL)

26 BUT Fusion (contrastive1) - LR

27 no SVM,DPLDA+NDA - MMFBG, BUT DEV, QCAL

28 MMFBG on DEV16, QCAL

29 Results - equalized, NIST scoring tool System EER[%] mincprim actcprim ABC_PRIMARY_NIGCAL ABC_PRIMARY_QCAL Agnitio (SUM, NIGCAL) CRIM (SUM, QCAL) BUT (CONTRASTIVE_FIX on BUT_DEV) BUT_DPLDA_PLP (CONTRASTIVE2, LR)

30 Conclusions Evaluation was challenging Calibration was not a big issue We were not able to successfully exploit clustering on unlabeled DEV data There is probably a big channel mismatch between SRE16 and all older MIXER data Even systems like relevance map, eigenchannel comp., SVM, etc. were competitive Small models (256G, 400ivec) were performing close to the big ones BN features and BN+MFCC were not outperforming single MFCC or PLP system Although we had a hard time to fuse on small DEV16 We designed an out-of-domain dataset that provided good calibration for eval (BUT_DEV) We were positively surprised by the DPLDA performance There is definitely some room for improvement and tuning

31 THANK YOU We are happy for the dataset with a lot of room for improvement and research :)

32 Analysis with DPLDA

33 Analysis on MFCC PLDA system

34 Analysis on MFCC PLDA system II s-norm* = mean(t-norm,z-norm), 500 closest i-vectors were used (based on score). nne - non native english cuts, noe - non english cuts

35 Analysis with DPLDA 1. Baseline DPLDA system, trained on telephone part of Mixer+Fisher+Switchboard. Uses PLP based 600dim ivecs. 2. The same as 1, but with SRE16 unlabeled data added to the training. 3. The same as 2 with NAP applied (we use 20 language classes from training set and one class for unlabeled data). 4. The same as 3, with snorm_easy applied. Snorm is calculated on unlabeled data only. This is what went into a BUT fusion. 5. The same as 4, but instead of snorm_easy, we apply adaptive snorm. Size of the cohort is set to 200, snorm again calculated only on unlabeled data. 6. Contrastive 2 system. The same as 4, but also SRE16 development data is added to the training set. 7. Similar to what Nuance did, the same as previous one, but instead of adding development data once we add it 6 times. 8. Add to training set corrupted version of development data. 9. The same as 6 but instead of snorm_easy, we apply adaptive snorm.

36 DPLDA results SRE16 dev SRE16 eval EER[%] mincprim EER[%] mincprim 1. DPLDA (2048G/600ivec250LDA, tel data) SRE16 unlab NAP snorm asnorm SRE16 dev data xSRE16 dev data corr SRE16 dev data SRE16 dev data

37 BUT - SVM classifier One SVM per speaker trained using the enrollment ivector(s) as positive samples and unlabeled major and unlabeled minor data as negative samples. Length normalization, WCCN and NAP were applied to ivectors. Trained with telephone data from Mixer+Fisher+Switchboard. The classes for NAP were languages present in the training data. ZT-Norm was applied to system scores. ZNorm was trained on a subset of Chinese utterances from the training portion of non-english short cuts, plus the data from unlabeled major and unlabeled minor sets. TNorm was trained with the SVM models trained on Chinese cuts, using the unlabeled major and minor sets as background data (negative samples).

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis The 2017 Conference on Computational Linguistics and Speech Processing ROCLING 2017, pp. 276-286 The Association for Computational Linguistics and Chinese Language Processing SUT Submission for NIST 2016