Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Size: px

Start display at page:

Download "Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks"

Percival Walton
6 years ago
Views:

1 Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks Vadim Shchemelinin 1,2, Alexandr Kozlov 2, Galina Lavrentyeva 2, Sergey Novoselov 1,2 and Konstantin Simonchik 1,2 1 ITMO University, St.Petersburg, Russia 2 Speech Technology Center Limited, St.Petersburg, Russia Abstract. This paper explores the robustness of a text-independent voice verification system against different methods of spoofing attacks based on speech synthesis and voice conversion techniques. Our experiments show that the most dangerous are spoofing attacks based on the speech synthesis, but the use of standard TV-JFA approach based spoofing detection module can reduce the False Acceptance error rate of the whole speaker recognition system from 80% to 1%. Keywords: spoofing, anti-spoofing, speaker recognition, TV, SVM 1 Introduction Speaker verification systems become widespread in recent time. They are used in different areas of our lives: forensic research, physical access control systems, banking, as well as on the web. The two main roles that such systems have in every-day life are usability enhancement and security. So to perform its functions a voice verification system has to have high robustness, especially if it is used for access to a bank account or personal information. For this reason, it is important to continuously assess the stability of voice verification systems against to spoofing attacks. The greatest threat are automatable methods of spoofing based on the synthesis of speech or voice conversion techniques. In the works [1, 2] it is shown that such attack mehods may raise a false error rate to unacceptable values. Together with the increased security threat there were developed detection methods of similar attacks. However, the question of their reliability and performance evaluation is still open. The aim of our study was to determine the most dangerous methods of spoofing for modern verification system working together with the spoofing detection module.

2 2 Vadim Shchemelinin et al. 2 Voice Verification System with Anti-spoofing 2.1 Voice Verification Module One of the standard use-cases of text-independent voice verification systems is the client voice model creation and its comparison with his etalon model during user interaction with the IVR (Interactive Voice Response) systems in call-centers. The user calls to the call-center and uses voice commands to go through the IVR menu. Throughout the call session, clients speech is sent to verification system for voice model creation and estimation if the access to the confidential information should be denied or not. In our experiments the i-vector based speaker recognition system was used. Before features extraction signal preprocessing module was applied. It included energy based voice activity detection, clipping [3], pulse and multi-tonal detection. The pre-emphasizing was also done and speech signal was divided into 22ms window frames with a 50% overlap, and, similarly to spoofing detection, multiplied by Hamming window function. As front-end features 13 MFCC features of each frame with first and second derivatives were selected. The derivatives were estimated over a 5-frame context and we also applied a cepstral mean subtraction (CMS) for the cepstral coefficients. For the acoustic space modelling we used Total Variability super-vectors with Probabilistic LDA approach (TV-PLDA) to achieve better performance [4, 5]. According to this approach, the distribution of the i-vectors can be expressed as following: µ = m + T ω + ϵ, where µ is the super-vector of the Gaussian Mixture Models (GMM) parameters of the speaker model, m is the super-vector of the Universal Background Model(UBM) parameters, T is the TV matrix defining the basis in the reduced feature space, ω is the i-vector in the reduced feature space, ω N(0, 1), ϵ is the error vector. In our system the dimension of TV space was 600 and UBM was genderindependent with 512 component. UBM was obtained by standard ML-training on the telephone part of the NIST s SRE datasets (all languages, both genders) [6, 7]. In our study we used more than 4000 training speakers in total. We also used a diagonal, not a full-covariance GMM UBM. The i-vector extractor and PLDA matrix were trained on more than telephone and microphone recordings from the NIST comprising more than 4000 speakers voices. 2.2 Spoofing Detection Module Spoofing detection method was used in considered speaker verification system as preliminary step. It was firstly introduced in the ASVspoof Challenge 2015 [8]

3 Vulnerability of Voice Verification to different spoofing attacks 3 and achieved 3.922% EER for unknown types of spoofing attacks and 0.008% EER for known spoofing attacks. It should be mentioned that for the HMMbased spoofing attacks of the ASVspoof Challenge evaluation base zero error of spoofing detection was achieved. That was the motivation to include this method to ASV system. Anti-spoofing method consists of four main components: Pre-detection Acoustic feature extractor TV i-vector extractor SVM classifier Pre-detector was used to check if the input signal had zero temporal energy and in this cases declared signal as spoofing attack. Otherwise acoustic features were extracted from signal. As front-end acoustic features we used: 12 Mel-Frequency Cepstral Coefficients (MFCC), 12 Mel-Frequency Principal Coefficients (MFPC) and 12 Cos- Phase Principal Coefficients (CosPhasePC) based on phase spectrum with its first and second derivatives. To obtain these coefficients Hamming windowing was used with 256 window length and 50% overlap. For the acoustic space modelling we used the standard TV-JFA approach, which is the state-of-the-art in speaker verification [7, 9, 10]. According to this version of the joint factor analysis, the i-vector of the Total Variability space is extracted by means of JFA modification, which is a usual Gaussian factor analyser defined on mean super-vectors of the Universal Background Model (UBM) and Total-variability matrix T. UBM was represented by the Gaussian mixture model (GMM) of the described features. The diagonal covariance UBM was trained by the standard EM-algorithm. For anti-spoofing method UBM was represented by a 1024-component Gaussian mixture model of the described features, and the dimension of the TV space was Fusion Decision Module Fig. 1. Voice Verification System with Anti-spoofing sheme

4 4 Vadim Shchemelinin et al. Fusion Decision Module was based on fusion on speaker recognition module output and spoofing detection module output as shown on figure 1. The decision made by verification and spoofing detection modules was expressed as a mentioned below P = P verification (1 P spoofing ), where: P verification is the probability that the speaker in the test recording is the same as the speaker in the etalon, P spoofing is probability that the test recording is spoofing. To calculate probabilities from scores, we used the BOSARIS toolkit [18]. 3 Experiments with Different Types of Spoofing Baseline S1 spoofing S2 spoofing S3 spoofing S4 spoofing S5 spoofing False Rejection (%) False Acceptance (%) Fig. 2. DET curves for verification system without spoofing detection module against different methods of attacks For examining vulnerability of Voice Verification System to different methods of spoofing attacks we used ASVspoof development dataset [11]. It includes free and spoofed speech of 35 speakers, 15 male and 20 female. There are 3497 genuine and spoofed trials. Spoofed speech is generated according to one of the five spoofing methods (S1 - S5) as follows:

5 Vulnerability of Voice Verification to different spoofing attacks 5 S1 - Based on voice conversion, simplified frame selection algorithm [12, 13]. The converted speech is generated by selecting target speech frames. S2 - The simplest voice conversion algorithm [14] which adjusts only the first mel-cepstral coefficient in order to shift the slope of the source spectrum to the target. S3 - The Hidden Markov model based speech synthesis system using speaker adaptation techniques [15] and only 20 adaptation utterances. S4 - The Hidden Markov model based speech synthesis system using speaker adaptation techniques [15] and only 40 adaptation utterances. S5 - The method based on voice conversion toolkit and with the Festvox system [16]. At first, we checked how strong F A error rate was increased if voice verification system didn t contain spoofing detection module. Also in this step, we wanted to make sure that proposed for ASVspoof Chalenge 2015 spoofing techniques were a threat to a system of verification. As the baseline we used only free speech of all speakers from previous described dataset. It is interesting to note that S2 based on conversion of the first mel-cepstral coefficient gives the greatest detection error [17], while this method has the least impact on verification system without spoofing detector as shown on figure Baseline S1 spoofing S2 spoofing S3 spoofing S4 spoofing S5 spoofing 2 1 False Rejection (%) False Acceptance (%) Fig. 3. DET curves for verification system with spoofing detection module against different methods of attacks

6 6 Vadim Shchemelinin et al. The results of experiments with enabled spoofing detection module are presented on figure 3. Additionally in table 1, presented comparisons of the F A values at baseline EER point threshold with spoofing detection module on and off. As it can be Table 1. F A verification error for spoofing the verification system based on different algorithms. F A for threshold in EER point Voice Verification system S1 S2 S3 S4 S5 Without spoofing detection module 52.5% 1.7% 68.5% 77.1% 63.7% With TV-JFA based spoofing detection module 0.36% 0% 0.23% 1.35% 0.98% seen from the table, spoofing detection implementation significantly improves F A error rate. Also obtained results demonstrate that synthesis based spoofing methods are more dangerous in comparison with those based on voice conversion techniques. 4 Conclusions In this paper we analyzed the vulnerability of voice verification system based on state-of-the-art speaker recognition and spoofing detection methods against different spoofing methods based on text-to-speech and voice conversion algorithms. As it was demonstrated by the experiments, spoofing using a TTS voice is more treatful than other methods. For instance, the Hidden Markov model based speech synthesis spoofing method gave 1.35% False Acceptance error, comparing to the 0.98% of method based on voice conversion toolkit. Also, it can be sum up that it is important to evaluate spoofing detection methods together with voice verification systems. Firstly, spoofing detector can be reliable on the not effective spoofing attacks. Secondly, the system EER can be increased by false acceptances error of spoofing detector. However, our results showed once again that it is highly necessary to test verification systems against spoofing by different methods, and to develop antispoofing algorithms reliable in real use-cases. This work was partially financially supported by the Government of Russian Federation, Grant 074-U01. References 1. Shchemelinin V., Simonchik K.: Examining Vulnerability of Voice Verification Systems to Spoofing Attacks by Means of a TTS System. Proceedings of the SPECOM 2013 (Plzen, Czech Republic, September 15, 2013), pp (2013)

7 Vulnerability of Voice Verification to different spoofing attacks 7 2. Shchemelinin V., Topchina M., Simonchik K.: Vulnerability of Voice Verification Systems to Spoofing Attacks with TTS Voices Based on Automatically Labeled Telephone Speech. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8773, No. LNAI, pp (2014) 3. Aleinik, S., Matveev, Y.N.: Detection of Clipped Fragments in Speech Signals. International Journal of Electrical, Electronic Science and Engineering, 8(2), pp , (2014) 4. Kenny, P.,: Bayesian speaker verification with heavy tailed priors. Proceedings of the Odyssey Speaker and Language Recognition Workshop (Brno, Czech Republic, Jun. 2010). (2010) 5. Simonchik K., Pekhovsky T., Shulipa A., Afanasyev A.: Supervized Mixture of PLDA Models for Cross-Channel Speaker Verification. Proceedings of the 13th Annual Conference of the International Speech Communication Association, Interspeech-2012 (Portland, Oregon, USA, September 9-13). (2012) 6. Matveev Yu., Simonchik K.: The speaker identification system for the NIST SRE Proc. The 20th International Conference on Computer Graphics and Vision, GraphiCon 2010 (St. Petersburg, Russia, September ), pp (2010) 7. Kozlov, A., Kudashev, O., Matveev, Yu., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for the NIST SRE Lecture Notes in Computer Science (LNCS), vol. 8113, pp (2013) 8. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi: ASVspoof 2015: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan, Dec. 19, 2014, 9. S. Novoselov, T. Pekhovsky, K. Simonchik: STC Speaker Recognition System for the NIST i-vector Challenge. In: Proc. Odyssey The Speaker and Language Recognition Workshop (2014) 10. Kinnunen T., Li H.: An overview of text-independent speaker recognition: from features to supervectors. In: Speech Commun. vol. 52, pp (2010) 11. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilc, Md Sahidullah, Aleksandr Sizov: ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge spoofingchallenge.org/is2015_asvspoof.pdf 12. T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez, and Y. Stylianou: Towards a voice conversion system based on frame selection in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Z. Wu, T. Virtanen, T. Kinnunen, E. Chng, and H. Li: Exemplarbased unit selection for voice conversion utilizing temporal information. Interspeech, T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai: An adaptive algorithm for mel-cepstral analysis of speech, in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained smaplr adaptation algorithm, IEEE Trans. Audio, Speech and Language Processing, vol. 17, no. 1, pp. 6683, Festvox project, S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, V. Shchemelinin: STC Anti-spoofing Systems for the ASVspoof 2015 Challenge. wp-content/uploads/2015/06/technical_report_asvspoof2015_stc.pdf. 18. BOSARIS Toolkit,

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE Sergey Novoselov 1,2, Alexandr Kozlov 2, Galina Lavrentyeva 1,2, Konstantin Simonchik 1,2, Vadim Shchemelinin 1,2 1 ITMO University, St. Petersburg,