Multifactor Fusion for Audio-Visual Speaker Recognition

Similar documents
Multimodal feature fusion for video forgery detection

A Low-Complexity Dynamic Face-Voice Feature Fusion Approach to Multimodal Person Recognition

Hybrid Biometric Person Authentication Using Face and Voice Features

Speaker Verification with Adaptive Spectral Subband Centroids

Robust biometric image watermarking for fingerprint and face template protection

Client Dependent GMM-SVM Models for Speaker Verification

Biometrics Technology: Multi-modal (Part 2)

K-Nearest Neighbor Classification Approach for Face and Fingerprint at Feature Level Fusion

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data

A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Xing Fan, Carlos Busso and John H.L. Hansen

ISCA Archive

Audio-Visual Based Multi-Sample Fusion to Enhance Correlation Filters Speaker Verification System

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Inter-session Variability Modelling and Joint Factor Analysis for Face Authentication

LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM

Multi-Modal Human Verification Using Face and Speech

Combining Audio and Video for Detection of Spontaneous Emotions

Multi-modal Person Identification in a Smart Environment

Speaker Diarization System Based on GMM and BIC

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

Face Detection and Recognition in an Image Sequence using Eigenedginess

Global fitting of a facial model to facial features for model based video coding

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Fuzzy and Markov Models for Keystroke Biometrics Authentication

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

International Journal of Advanced Research in Computer Science and Software Engineering

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

Spatial Frequency Domain Methods for Face and Iris Recognition

Semi-Supervised PCA-based Face Recognition Using Self-Training

The BioSecure Talking-Face Reference System

Gaussian Mixture Model Coupled with Independent Component Analysis for Palmprint Verification

Improving Robustness to Compressed Speech in Speaker Recognition

Agatha: Multimodal Biometric Authentication Platform in Large-Scale Databases

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

BIOMET: A Multimodal Biometric Authentication System for Person Identification and Verification using Fingerprint and Face Recognition

Mobile Biometrics (MoBio): Joint Face and Voice Verification for a Mobile Platform

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

Recognition of Non-symmetric Faces Using Principal Component Analysis

Keywords Wavelet decomposition, SIFT, Unibiometrics, Multibiometrics, Histogram Equalization.

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Voice Authentication Using Short Phrases: Examining Accuracy, Security and Privacy Issues

Learning Patch Dependencies for Improved Pose Mismatched Face Verification

Authentication of Fingerprint Recognition Using Natural Language Processing

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Multi-pose lipreading and audio-visual speech recognition

Iris Recognition for Eyelash Detection Using Gabor Filter

Multimodal Belief Fusion for Face and Ear Biometrics

A Low Cost Incremental Biometric Fusion Strategy for a Handheld Device

Figure 1. Example sample for fabric mask. In the second column, the mask is worn on the face. The picture is taken from [5].

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION

Extrapolating Single View Face Models for Multi-View Recognition

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Perspectives on Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-based Telephony

Color Local Texture Features Based Face Recognition

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Robust color segmentation algorithms in illumination variation conditions

Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification

Automatic Enhancement of Correspondence Detection in an Object Tracking System

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center

Biometric Security System Using Palm print

How accurate is AGNITIO KIVOX Voice ID?

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

Nonlinear Image Interpolation using Manifold Learning

On Modeling Variations for Face Authentication

An Efficient Secure Multimodal Biometric Fusion Using Palmprint and Face Image

Voiceprint-based Access Control for Wireless Insulin Pump Systems

Dynamic skin detection in color images for sign language recognition

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless

Multi-modal Translation and Evaluation of Lip-synchronization using Noise Added Voice

The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering

Approach to Increase Accuracy of Multimodal Biometric System for Feature Level Fusion

Detection of Mouth Movements and Its Applications to Cross-Modal Analysis of Planning Meetings

A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

6. Multimodal Biometrics

EFFECTIVE METHODOLOGY FOR DETECTING AND PREVENTING FACE SPOOFING ATTACKS

SIC DB: Multi-Modal Database for Person Authentication

Exploring Similarity Measures for Biometric Databases

Learning to Recognize Faces in Realistic Conditions

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Image Denoising AGAIN!?

SAS: A speaker verification spoofing database containing diverse attacks

Confidence Measures: how much we can trust our speech recognizers

Utilization of Matching Score Vector Similarity Measures in Biometric Systems

Transcription:

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY and DAT TRAN School of Information Sciences and Engineering University of Canberra ACT 2601 AUSTRALIA girija.chetty@canberra.edu.au, dat.tran@canberra.edu.au Abstract: - In this paper we propose a multifactor hybrid fusion approach for enhancing security in audio-visual speaker verification. Speaker verification experiments conducted on two audiovisual databases, VidTIMIT and UCBN, show that multifactor hybrid fusion involve a combination feature-level fusion of lip-voice features and face-lip-voice features at score-level is indeed a powerful technique for speaker identity verification, as it preserves synchronisation. of the closely coupled modalities, such as face, voice and lip dynamics of a speaker during speech, through various stages of authentication. An improvement in error rate of the order of 22-36% is achieved for experiments by using feature level fusion of acoustic and visual feature vectors from lip region as compared to classical late fusion approach. Key-Words: - Audio-visual, Multifactor, Hybrid Fusion, Speaker recognition, Impostor attacks 1 Introduction By using multiple cues concurrently for authentication, systems gain more immunity to intruder attacks [1], as it will be more difficult for an impostor to impersonate another person with multiple cues, such as audio and visual cues simultaneously. In addition, multiple cues such as those from face and voice, also help improve system reliability. For instance, while background noise has a detrimental effect on the performance of voice biometrics, it does not have any influence on face biometrics. On the other hand, while the performance of face recognition systems depends heavily on lighting conditions, lighting does not have any effect on the voice quality [2]. The classical approaches to multimodal fusion, late fusion and its variants in particular, have been investigated in great depth. Late fusion, or fusion at the score level, involves combining the scores of different classifiers, each of which has made an independent decision. This means, however, that many of correlation properties of the joint audiovideo data are lost. Fusion at feature-level on the other hand, can substantially improve the performance of the multimodal systems as the feature sets provide a richer source of information than the matching scores and because in this mode, features are extracted from the raw data and subsequently combined. An appropriate combination of feature fusion and the traditional late fusion approach allows complete modelling of static and dynamic components of static and dynamic components of a speaking face, making the speaker identity verification the system less vulnerable to impostor attacks. This paper proposes a novel multifactor hybrid audiovisual fusion as a powerful technique for addressing impostor attacks under environmental degradations. The hybrid approach involving feature-level fusion of audio-lip features and scorelevel fusion of audio-lip-face features may allow robust speaker verification as the synchronisation between closely coupled modalities such as voice and lip-movements are modelled better with feature fusion combined with traditional late fusion approach, makes the system less vulnerable to impostor attacks. The remainder of this paper is organised as follows. The next section details the speaking face data used for different experiments. The description of feature extraction method used is given in section 3. The examination of system performance for proposed hybrid fusion approach. in the presence of adverse environmental conditions such as acoustic noise and visual artefacts, as well as sensitivity to training data size and utterance length are discussed in section 4, followed by the performance of hybrid fusion of audio-lip-voice features in section 5. The paper concludes with some conclusions with plans for further work in section 6.

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 71 2 Speaking Face Data We used two different types of speaking face data to evaluate proposed multifactor fusion. The first database used for evaluation is the multimodal person authentication database VidTIMIT [4]. The VidTIMIT database consists of video and corresponding audio recordings of 43 people (19 female and 24 male). The mean duration of each sentence is around 4 seconds, or approximately 100 video frames. A broadcast-quality digital video camera in a noisy office environment was used to record the data. The video of each person is stored as a sequence of JPEG images with a resolution of 512 384 pixels (columns rows), with corresponding audio provided as a monophonic, 16 bit, 32 khz PCM file. MPEG2 encoded stream with a resolution of 720 576 pixels, with corresponding 16 bit, 48 khz PCM audio. These two types of databases represent very different types of speaking face data, VidTIMIT with original audio recorded in a noisy office environment and clean visual environment, and UCBN with clean audio and visual environments, but complex backgrounds. This allows the robustness of feature-level fusion to be examined accurately in our study. Figure 1(a) and 1(b) show sample speaking-face data from the VidTIMIT and UCBN databases. 3 Audio Visual Fusion The hybrid multi factor fusion is performed by late fusion of eigen face features extracted from ilummination, scale and pose normalized frontal faces based on principal component analysis [10], with feature fused audio-lip features derived from audio and different types of visual features as described below. 3.1 Acoustic feature extraction Fig. 1: Faces from (a) VidTimit (b) UCBN The second type of data used is the UCBN database, a free to air broadcast news database. The broadcast news is a continuous source of video sequences, that can be easily obtained or recorded, and has optimal illumination, colour, and sound recording conditions. However, some of the attributes of broadcast news database such as nearfrontal images, smaller facial regions, multiple faces and complex backgrounds require an efficient face detection and tracking scheme to be used. The database consists of 20-40 second video clips for anchor persons and newsreaders with frontal/nearfrontal shots of 10 different faces (5 female and 5 male). Each video sample is a 25 frames per second The mel frequency cepstral coefficients (MFCC) as derived from the cepstrum information were used for extracting acoustic features. The pre-emphasized audio signal was processed using a 30ms Hamming window with one-third overlap, yielding a frame rate of 50 Hz, to obtain the MFCC acoustic vectors. An acoustic feature vector was determined for each frame by warping 512 spectral bands into 30 melspaced bands, and computing the 8 MFCCs. Cepstral mean normalization was performed on all MFCCs before they were used for training, testing and evaluation. Before extracting MFCCs, the audio files from the two databases were mixed with factory noise (Factor1.wav) from the Noisex-92 database [5] at a signal-to-noise ratio of 6 db. Channel effects with a telephone line filter were then added to the noisy PCM files to simulate the channel mismatch. 3.2 Visual feature extraction Before the facial features can be extracted, faces need to be detected and recognised. The face detection for video was based on the approach of

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 72 skin colour analysis in red-blue chrominance colour space, followed by deformable template matching with an average face, and finally verification with rules derived from the spatial/geometrical relationships of facial components, [6]. The lip features were obtained by determining the lip region using derivatives of the hue and saturation functions, combined with geometric constraints. Figures 2(a) to 2(c) show some of the results of the face detection and lip feature extraction stages. The details of the scheme are described in [6]. Fig. 2(a): A skin colour sample and the Gaussian distribution in red-blue chromatic space Fig. 2(c): Illustration of geometric features and measured key points for different lip openings 3.3 Joint Lip-Voice Feature Vector To evaluate the power of the feature-level fusion part of the overall hybrid fusion strategy in preserving the audiovisual synchrony and hence increasing the robustness of speaker verification approach, experiments were conducted with both feature fusion (also referred to as early fusion) and late fusion of audiovisual features. In case of feature fusion, the audiovisual fusion involved a concatenation of the audio features (MFFCs-8) and lip features (eigen-lip projections(10)+lip dimensions(6)), and the combined feature vector was then fed to a GMM classifier. The audio features acquired at 50 Hz, and the visual features acquired at 25Hz were appropriately rate interpolated to obtain synchronized joint audiovisual feature vectors. For late fusion, audio and visual features were fed to independent GMM classifiers and the weighted scores (β) [8] from each stage, were fed to a weighted-sum fusion unit. 3.3 Hybrid Fusion of Face and Lip-Voice Features The hybrid fusion was performed by score fusion (late fusion) of audio-lip feature vector with eigenface features of the frontal faces. Fig. 2(b): Lip region localisation using hue-saturation thresholding and detection of lip boundaries using pseudo-hue/luminance images Similarly to the audio files, the video data in both databases were mixed with artificial visual artefacts such as addition of Gaussian blur and Gaussian noise, using a visual editing tool [Adobe Photoshop]. The Gaussian Blur of Photoshop was set to 1.2, and Gaussian Noise of Photoshop to 1.6. Fig. 2(d): Multifactor Hybrid Fusion Scheme

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 73 The fusion weights for audio-lip features and face features were obtained by varying the combination weights β for the audio-lip and eigen face feature scores. β is varied from 0 1, with β increasing for increasing facial feature score. Figure 2(d) shows the proposed hybrid fusion scheme. 3.4 Likelihood normalization The use of normalized likelihoods together with a global threshold in addition, leads to improvements in performance as well as robustness of person authentication systems [4]. The Universal Background Model (UBM) approach [9] is the most popular normalized likelihood approach when utilizing Gaussian Mixture Model (GMM) classifier. For VidTIMIT, the data from 24 male and 19 female clients were used to create separate gender specific universal background models. The background models were then adapted to speaker models using MAP adaptation [9]. The first two utterances for all speakers in the corpus being common were used for text dependent experiments and 6 different utterances for each speaker allowed text independent verification experiments to be conducted. For text independent experiments, four utterances from session 1 were used for training and four utterances from session 2 and 3 were used for testing. For the UCBN database, the training data for both text dependent and text independent experiments contained 15 utterances from 5 male and 5 female speakers, and 5 utterances for testing, each recorded in a different session. The utterances were of 20-second duration for text dependent experiments, and of 40-second duration in text independent mode. Similarly to VidTIMIT, separate UBMs for the male and female cohorts were created for UCBN data. 4 Impostor Attack Experiments The impostor attack experiments were conducted in two phases, training and test phase. In training phase, a 10-mixture Gaussian mixture model of multifactor audio-visual features of different types was built reflecting the probability densities for combined phonemes and visemes in the audiovisual feature space. For testing purposes, clients test set recordings were evaluated against the client s model λ by determining the log likelihoods log p(x λ) of the time sequences X of audiovisual feature vectors under the usual assumption of statistical independence of subsequent feature vectors. In order to obtain suitable thresholds to distinguish client recordings from impostor recordings, detection error tradeoff (DET) curves and equal error rates (EER) were determined. For all experiments, the threshold was set using data from test data set. Table 2 shows the number of client trials and impostor attack trials conducted for determining the EERs. The first row in Table 2 for example, refers to experiments with VidTIMIT database, in text dependent mode for male only cohort, comprising a total of 48 client trials (24 clients 2 utterances per client), and 46 impostor attack trials (23 clients 2 utterances per client). A convenient notation is used here for referring to the experiments in a particular mode (Table 1). Simple Z-norm based approach proposed in [9] was used for normalization of all scores. Different sets of experiments were conducted to evaluate the performance of system in terms of DET curves and equal error rates (EER). The results of only two types of data, that is DB1TIMO (VidTIMIT database text independent male only cohort) and DB2TDFO (UCBN database text dependent female only cohort) experiments has been reported here. For evaluating the performance improvement achieved with hybrid fusion of audio-lip and face features, experiments with late fusion of separate audio, lip and face features was carried out in addition. For the first set of experiments, original data from VidTIMIT and the original files from UCBN database was used. The DET curves for baseline performance of the system with original data, with hybrid fusion and late fusion are shown in Figure 3. As can be seen in Figure 3, the baseline EER achieved is 3.65% for DB1TIMO and 2.55% for DB2TDFO with hybrid fusion (feature fusion of audio-lip components, as compared to late fusion of all three audio-visual (audio, lip and face) component. 8.1% (DB1TIMO)

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 74 and 6.8%( DB2TDFO) achieved for late fusion with β=0.75. Notation EER LF(0.25) FF DB1 DB2 TDMO TDFO TIMO TIFO Table 1 : Notation for different experiments True description Equal Error Rate Late fusion with fusion weight β=0.25 Feature Fusion VidTIMIT database UCBN database Text dependent male only cohort Text dependent female only cohort Text independent male only cohort Text independent male only cohort In Figure 4, behavior of the system when subjected to different types of environmental degradations as well as EER sensitivity with variations in training data size is shown. Once again hybrid fusion involving the feature level fusion of audio-lip conponents outperforms late fusion technique for different types of acoustic and visual degradations. When mixed with acoustic noise (Factory noise at 6 db SNR + channel effects), hybrid fusion allows a significant performance improvement at fusion weight β=0.75. When mixed with visual artefacts improvement in performance achieved with feature fusion is about 30.40% as compared to LF (β=0.75), and 18.9% with LF (β=0.25). Table 3 and Figure 4 show the baseline EERs achieved, and EERs achieved with visual artefacts, acoustic noise, and shorter utterance length. The table also shows a relative drop in performance due to the late fusion vs. feature fusion of audio-lip components. The influence of training utterance length variation on system performance is quite remarkable and different as compared to other effects. The system is more sensitive to utterance length variation for hybrid fusion mode as compared to late fusion mode (Table 3). The drop in performance is lesser (9.46% for late fusion (β=0.75)) and (26.57% for late fusion (β=0.25)) as compared to 42.32% drop for feature fusion for DB1TIMO, and likewise, the drop is 12.15% and 24.53% as compared to 40.96% drop for DB2TDFO. The utterance length is varied from normal 4 seconds to 1 second for DB1TIMO and from 20 seconds to 5 seconds for DB2TDFO data. Table 2: Number of client and Impostor attack trials Speaking face data Client Trials Impostor Attack Trials DB1TDMO 48(24 2) 46 (23 2) DB1TDFO 38(19 2) 36(18 2) DB1TIMO 144 (24 6) 138(23 6) DB1TIFO 114(19 6) 108 (18 6) DB2TDMO 100 (5 10) 40 (4 10) DB2TDFO 100(5 10) 40 (5 4 10) DB2TIMO 100 (5 10) 40 (5 4 10) DB2TIFO 100 (5 10) 40 (5 4 10) Fig 3: Baseline impostor attack performance (late fusion vs. hybrid fusion) This drop in performance is because of larger dimensionality of joint audiovisual feature vectors used (8 MFCCs+10 eigenlips+6 lip dimensions) in the feature fusion of audio and lip features, as well as the shorter utterance length, which seems to be not sufficient to establish the audiovisual synchrony. Longer utterances and hence more training speech would allow the audiovisual correspondence to be learnt, and robust speaker verification with hybrid fusion technique. 5 Conclusions In this paper we have shown that hybrid multifactor fusion of audio, lip and face features, with feature level fusion of lip-voice features substantially enhances performance against impostor attacks. Also, the sensitivity of feature fusion components

Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 75 to variations in the length of training utterance has been recognized. Future work will involve investigation of new types of audio-visual features and fusion techniques for improving he robustness against different types if acoustic and visual environmental degradations. Table 3: Relative Performance with acoustic noise, visual artifacts and variation in training data size References: [1] Ross, A., and Jain, A.K., Information fusion in biometrics, Pattern Recognition Letters 24, 13 (Sept. 2003), 2115 2125. [2] J. Kittler, G. Matas, K. Jonsson, and M. S anchez, Combining evidence in personal identity verification systems, Pattern Recognition Letters, vol.18, no.9,pp.845 852,Sept. 1997. Speaking Face Data DB1TIMO (LF-0.75) DB2TDFO (LF-0.75) DB1TIMO (LF- 0.25) DB1TDFO (LF- 0.25) DB1TIMO (FF) DB2TDFO (FF) Base Line % EER Acoustic Effects 8.1 9.98 (-23.22 %) 6.8 8.18 (-20.22 %) 4.75 6.76 (-42.33%) 4.35 6.10 (-40.16%) 3.65 3.83 (-4.83%) 2.55 2.60 (-2.06%) Visual artefacts 11.44 (-41.12%) 9.81 (-44.92%) 6.19 (-30.22%) 5.75 (-32.82%) 4.06 (-11.82%) 2.89 (-13.42%) Utterance Length Effects 8.87 (-9.46 %) 7.63 (-12.15%) 6.01 (-26.57%) 5.42 (-24.53%) 5.19 (-42.32%) 3.59 (-40.96%) [3] Chetty, G. and M. Wagner, Liveness verification in audiovideo authentication. Proc. Int Conf on Spoken Language Processing ICSLP-04, pp 2509-2512. [4] Sanderson, C. and K.K. Paliwal, Fast features for face authentication under illumination direction changes, Pattern Recognition Letters 24, 2409-2419. [5] Signal Processing Information Base (SPIB) http://spib.rice.edu/spib/select_noise.html [7] D. Reynolds, T. Quatieri and R. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing, Vol. 10,No. 1-3, 2000, pp.19-41. [6] Chetty, G. and M. Wagner, Automated lip feature extraction for liveness verification in audiovideo authentication, Proc. Image and Vision Computing 2004, New Zealand, pp 17-22. [8] M.C.Cheung, K.K. Yiu, M.W.Mak, and S.Y.Kung, Multi-sample fusion with constrained feature transformation for robust speaker verification, Proceedings Odyssey 04 Conference. [9] R.Auckenthaler, E.Paris, and M.Carey, Improving GMM Speaker verification System by Phonetic Weighting, ICASSP 99, pp. 1440-1444, 1999. Fig 4: Speaker Verification System Sensitivity [10] M. Turk and A. Pentland., Eigenfaces for recognition, J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.