Face. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction

Similar documents
A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION

Face Detection and Recognition in an Image Sequence using Eigenedginess

Multi-Modal Human Verification Using Face and Speech

A New Manifold Representation for Visual Speech Recognition

LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM

Client Dependent GMM-SVM Models for Speaker Verification

ISCA Archive

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Image classification by a Two Dimensional Hidden Markov Model

Multifactor Fusion for Audio-Visual Speaker Recognition

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Person Authentication from Video of Faces: A Behavioral and Physiological Approach Using Pseudo Hierarchical Hidden Markov Models

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

K-Nearest Neighbor Classification Approach for Face and Fingerprint at Feature Level Fusion

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

Different Approaches to Face Recognition

On Modeling Variations for Face Authentication

Mobile Face Recognization

Component-based Face Recognition with 3D Morphable Models

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

An Introduction to Pattern Recognition

Text Area Detection from Video Frames

Learning based face hallucination techniques: A survey

Locating Salient Object Features

DSW Feature Based Hidden Marcov Model: An Application on Object Identification

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Joint Processing of Audio and Visual Information for Speech Recognition

An embedded system of Face Recognition based on ARM and HMM

Graphical Models for Recognizing Human Interactions

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Dynamic Time Warping

Image Denoising AGAIN!?

Speaker Verification with Adaptive Spectral Subband Centroids

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

Face Recognition Using Singular Value Decomposition along with seven state HMM

Facial expression recognition using shape and texture information

Enhancing Face Recognition from Video Sequences using Robust Statistics

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Applications Video Surveillance (On-line or off-line)

Combining Audio and Video for Detection of Spontaneous Emotions

Short Survey on Static Hand Gesture Recognition

Diagonal Principal Component Analysis for Face Recognition

COMBINING SPEEDED-UP ROBUST FEATURES WITH PRINCIPAL COMPONENT ANALYSIS IN FACE RECOGNITION SYSTEM

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Color-based Face Detection using Combination of Modified Local Binary Patterns and embedded Hidden Markov Models

Discriminative training and Feature combination

Criminal Identification System Using Face Detection and Recognition

Object. Radiance. Viewpoint v

Confidence Measures: how much we can trust our speech recognizers

An Integrated Face Recognition Algorithm Based on Wavelet Subspace

Combined Histogram-based Features of DCT Coefficients in Low-frequency Domains for Face Recognition

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

Image-Based Face Recognition using Global Features

Identity Authentication based on Audio Visual Biometrics: A Survey

Iris Recognition for Eyelash Detection Using Gabor Filter

A Fast Personal Palm print Authentication based on 3D-Multi Wavelet Transformation

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

Coupled hidden Markov models for complex action recognition

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Topics for thesis. Automatic Speech-based Emotion Recognition

The Method of User s Identification Using the Fusion of Wavelet Transform and Hidden Markov Models

Multi-pose lipreading and audio-visual speech recognition

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

FACE RECOGNITION UNDER LOSSY COMPRESSION. Mustafa Ersel Kamaşak and Bülent Sankur

EAR RECOGNITION AND OCCLUSION

A NEW ROBUST IMAGE WATERMARKING SCHEME BASED ON DWT WITH SVD

Static Gesture Recognition with Restricted Boltzmann Machines

Chapter 3. Speech segmentation. 3.1 Preprocessing

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Face recognition based on improved BP neural network

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

Illumination invariant face recognition and impostor rejection using different MINACE filter algorithms

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT

Recognition of online captured, handwritten Tamil words on Android

Analysis and Performance of Face Recognition System Using Log Gabor Filter Bank with HMM Model

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

COMBINING NEURAL NETWORKS FOR SKIN DETECTION

Face Verification Using Adapted Generative Models

Transductive Phoneme Classification Using Local Scaling And Confidence

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Class Conditional Density Estimation Using Mixtures with Constrained Component Sharing

Speaker Diarization System Based on GMM and BIC

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

Transcription:

A Bayesian Approach to Audio-Visual Speaker Identification Ara V Nefian 1, Lu Hong Liang 1, Tieyan Fu 2, and Xiao Xing Liu 1 1 Microprocessor Research Labs, Intel Corporation, fara.nefian, lu.hong.liang, xiao.xing.liug@intel.com, 2 Computer Science and Technology Department, Tsinghua University, futieyan00@mails.tsinghua.edu.cn Abstract. In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth are modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by thecapability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db. 1 Introduction Increased interest in robust person identification systems leads to complex systems that rely often on the fusion of several type of sensors. Audio-visual speaker identification (AVSI) systems are particularly interesting due to their increased robustness to acoustic noise. These systems combine acoustic speech features with facial or visual speech features to reveal the identity of the user. As in audio-visual speech recognition the key issues for robust AVSI systems are the visual feature extraction and the audio-visual decision strategy. Audio-visual fusion methods [22, 3] can be broadly grouped into two categories: feature fusion and decision systems. In feature fusion systems the observation vectors, obtained through the concatenation of acoustic and visual speech feature vectors, are described using a hidden Markovmodel (HMM). However, the audio and visual state synchrony assumed by these systems may not describe accurately the audio-visual speech generation. In comparison, in decision level systems the class conditional likelihood of each modality is combined at phone or word levels. Some of the most successful decision fusion models include the multi-stream HMM [18], or the product HMM [21, 4]. The Bayesian models also revealed their high modeling accuracy for face recognition. Recent face recognition systems using embedded Bayesian networks [14]

showed their improved performance over some template-based approaches [23, 1, 6, 17]. Video sequence Face Detection Face Face Feature Extraction Face Recognition Audio sequence Mouth Detection Mouth Visual Feature Extraction Acoustic Feature Extraction AV Speechbased user identification AV speaker Identification Fig. 1. The audio-visual speaker identification system. The Bayesian approach to audio-visual speaker identification described in this paper (Figure 1) starts with the detection of the face and mouth in a video sequence. The facial features are used in the computation of the face likelihood (Section 2) while the visual features of the mouth region together with the acoustic features determine the likelihood of the audio-visual speech (Sections 3 and 4). Finally the face and the audio-visual speech likelihood are combined in a late integration scheme to reveal the identity of the user (Section 5). 2 The Face Model While HMM are very successful in speech or gesture recognition, an equivalent two-dimensional HMM for images has been shown to be impractical due to its complexity [5]. Figure 2a shows a graph representation of the 2D HMM with the square nodes representing the discrete hidden nodes and the circles describing the continuous observation nodes. In recent years, several approaches to approx- Oi, j Oi+m, j Oi, j+n Oi+m, j+n (a) (b) (c) Fig. 2. A 2D HMM for face recognition (a), an embedded HMM (b) and the facial feature blockextraction for face recognition (c). imate a 2D HMM with computationally practical models have been investigated

[20, 14, 7]. In this paper, the face images are modeled using an embedded HMM (EHMM) [15]. The EHMM used for face recognition is a hierarchical statistical model with two layers of discrete hidden nodes (one layer for each data dimension) and a layer of observation nodes. In an EHMM both the parent" and child" layer of hidden nodes are described by a set of HMMs (Figure 2b). The states of the HMM in the parent" and child" layers are referred to as the super states and the states of the model respectively. The hierarchical structure of the EHMM or the embedded Bayesian networks [14] in general reduces significantly the complexity of these models compared to the 2D HMM. The sequence of observation vectors for an EHMM are obtained from a window that scans the image from left to right and top to bottom as shown in Figure 2c. Using the images in the training set, an EHMM is trained for each person in the database by means of the EM algorithm described in [15]. Recognition is carried out via the Viterbi decoding algorithm [12]. 3 The Audio-Visual Speech Model A a coupled HMM (CHMM) [2] can be seen as a collection of HMMs, one for each data stream, where the hidden backbone nodes at time t for each HMM are conditioned by the backbone nodes at time t 1 for all the related HMMs. Throughout this paper we will use CHMM with two channels, one for audio and (a) (b) Fig. 3. A directed graph representation of a two channel CHMM with mixture components (a) and the state diagram representation of the CHMM used in our audio-visual speaker identification system (b). the other for visual observations (Figure 3a). The parameters of a CHMM with two channels are defined below: ß c 0(i) = (q c 1 = i) b c t(i) = (O c t jqc t = i) a c ijj;k = (qc t = ijq a t 1 = j; q v t 1 = k) where c 2fa; vg denotes the audio and visual channels respectively, andq c t is the state of the backbonenodeinthecth channel at time t. For a continuous

mixture with Gaussian components, the probabilities of the observed nodes are given by: M c i X b c t(i) = wi;m c N(Oc t ;μc i;m ; Uc i;m) m=1 where O c t is the observation vector at time t corresponding to channel c, and μ c i;m and U c i;m and wi;m c are the mean, covariance matrix and mixture weight corresponding to the ith state, the mth mixture and the cth channel. Mi c is the number of mixtures corresponding to the ith state in the cth channel. In our audio-visual speaker identification system, each CHMM describes one of the possible phoneme-viseme pairs as defined in [16], for each person in the database. 4 Training the CHMM The training of the CHMM parameters for the task of audio-visual speaker identification is performed in two stages. First, a speaker-independent background model (BM) is obtained for each CHMM corresponding to a viseme-phoneme pair. Next, the parameters of the CHMMs are adapted to a speaker specific model using a maximum a posteriori (MA) method. To deal with the requirements of a continuous speech recognition systems, two additional CHMMs are trained to model the silence between consecutive words and sentences. 4.1 Maximum Likelihood Training of the Background Model In the first stage, the CHMMs for isolated phoneme-viseme pairs are initialized using the Viterbi-based method described in [13] followed by the estimationmaximization (EM) algorithm [10]. Each of the models obtained in the first stage is extended with one entry and one exit non-emitting state (Figure 3 b). The use of the non-emitting states also enforces the phoneme-viseme synchrony at the model boundaries. Next, the parameters of the CHMMs are refined through the embedded training of all CHMM from continuous audio-visual speech [10]. In this stage, the labels of the training sequences consist only of the sequence of phoneme-visemes with all boundary information being ignored. We will denote the mean, covariance matrices and mixture weights for mixture m, state i, and channel c of the trained CHMM corresponding to the background model as (μ c i;m) BM,(U c i;m) BM,and(w c i;m) BM respectively. 4.2 Maximum A osteriori Adaptation In this stage of the training, the state parameters of the background model are adapted to the characteristics of each speaker in the database. The new state parameters for all CHMMs ^μ c i;m, ^U c i;m and ^wc i;m are obtained through Bayesian

adaptation [19]: ^μ c i;m = i;m c μc i;m +(1 i;m)(μ c c i;m) BM (1) ^U c i;m = i;mu c c i;m (μc i;m) 2 +(μ c i;m) 2 BM +(1 i;m)(u c c i;m) BM (2) ^w c i;m = c i;m wc i;m +(1 c i;m)(w c i;m) BM ; (3) where i;m c is a parameter that controls the MA adaptation for mixture component m in channel c and state i. The sufficient statistics of the CHMM states corresponding to a specific user, μ c i;m, U c i;m and wi;m c are obtained using the EM algorithm from the available speaker dependent dataasfollows: μ c r;t i;m = flc r;t(i; m)o r;t r;t flc r;t(i; m) where U c i;m = fl c r;t(i; m) = r;t flc r;t(i; m)(o c r;t μc i;m)(o c r;t μc i;m) T wi;m c r;t = flc r;t(i; m) r;t k flc r;t(i; k) ; r;t flc r;t(i; m) j 1 r ff r;t (i; j)fi r;t (i; j) i;j 1 wi;m c N(Oc r;t jμc i;m ; Uc i;m) r ff r;t (i; j)fi r;t (i; j) k wc i;k N(Oc r;t jμc i;k ; Uc i;k ) ; and ff r;t (i; j) = (O r;1;:::;o r;t jqr;t a = i; qr;t v = j) and fi r;t (i; j) = (O r;t+1;:::;o r;tr ;qr;t a = i; qr;t v = j) are the forward and backward variables respectively [10] computed for the rth observation sequences O r;t = [(O a r;t) T ; (O v r;t) T ] T. The adaptation coefficient is i;m c r;t = flc r;t(i; m) r;t flc r;t(i; m)+ffi ; where ffi is the relevance factor, which is set ffi = 16 in our experiments. Note that as more speaker dependent data for a mixture m of state i and channel c becomes available, the contribution of the speaker specific statistics to the MA state parameters increases (Equations 1-3). On the other hand, when less speaker specific data is available, the MA parameters are very close to the parameters of the background model. 5 Recognition Given an audio-visual test sequence, the recognition is performed in two stages. First the face likelihood and the audio-visual speech likelihood are computed separately. To deal with the variation in the relative reliability of the audio and visual features of speech at different levels of acoustic noise, we modified the

observation probabilities used in decoding such that ~ b c t(i) = [b c t] c, c 2 fa; vg where the audio and video stream exponents a and v satisfy a ; v 0and a + v = 1. Then the overall matching score of the audio-visual speech and face model is computed as L(O f ; O a ; O v jk) = f L(O f jk)+ av L(O a ; O v jk) (4) where O a ; O v and O f are the acoustic speech, visual speech and facial sequence of observations, L(Λjk) denotes the observation likelihood for the kth person in the database and f ; av 0; f + av = 1 are some weighting coefficients for the face and audio-visual speech likelihoods. 6 Experimental Results The audio-visual speaker identification system presented in this paper was tested on digit enumeration sequences from the XM2VTS database [11]. For parameter adaptation we used four training sequences from each of the 87 speakers in our training set while for testing we used 320 sequences. In our experiments the acoustic observation vectors consist of 13 Mel frequency cepstral (MFC) coefficients with their first and second order time derivatives, extracted from windows of 25.6ms, with an overlap of 15.6ms. The extraction of the visual features starts with the face detection system described in [9] followed by the detection and tracking of the mouth region using a set of support vector machine classifiers. The features of the visual speech are obtained from the mouth region through a cascade algorithm described in [8]. The pixels in the mouth region are mapped to a 32-dimensional feature space using the principal component analysis. Then, blocks of 15 consecutive visual observation vectors are concatenated and projected on a 13 class, linear discriminant space. Finally, the resulting vectors, with their first and second order time derivatives are used as the visual observation sequences. The audio and visual features are integrated using a CHMM with three states in both the audio and video chains with no back transitions (Figure 3b). Each state has 32 mixture components with diagonal covariance matrices. In our system, the facial features are obtained using a sampling window of size 8 8 with 75% overlap between consecutive windows. The observation vectors corresponding to each position of the sampling window consist of a set of 2D discrete Cosine transform (2D DCT) coefficients. Specifically, we used nine 2D DCT coefficients obtained from a 3 3 region around the lowest frequency in the 2D DCT domain. The faces of all people in the database are modeled using EHMM with five super states and 3,6,6,6,3 states per super state respectively. Each state of the hidden nodes in the child" layer of the EHMM is described by a mixture of three Gaussian density functions with diagonal covariance matrices. To evaluate the behavior of our speaker identification system in environments affected by acoustic noise, we corrupted the testing sequences with white Gaussian noise at different SNR levels, while we trained on the original clean acoustic sequences. Figure 4 shows the error rate of the audio-only ( v ; f )=(0:0; 0:0),

video-only ( v ; f ) = (1:0; 0:0), audio-visual ( f = 0:0) and face-audio-visual ( v ; f )=(0:5; 0:3) speaker identification system at different SNR levels. 25 20 audio visual audio only video only face only audio visual face Error rate(%) 15 10 5 0 0 5 10 15 20 25 30 SNR (db) Fig. 4. The error rate of the audio-only ( v; f ) = (0:0; 0:0), video-only ( v ; f ) = (1:0; 0:0), audio-visual ( f =0:0) and face+audio-visual ( v ; f )=(0:5; 0:3) speaker identification system. 7 Conclusions In this paper, we described a Bayesian approach for text dependent audio-visual speaker identification. Our system uses a hierarchical decision fusion approach. In the lower level the acoustic and visual features of speech are integrated using a CHMM and the face likelihood is computed using an EHMM. In the upper level, a late integration scheme combines the likelihood of face and audio-visual speech to reveal the identity of the speaker. The use of strongly correlated acoustic and visual temporal features of speech together with overall facial characteristics makes the current system very difficult to break, and increases its accuracy to acoustic noise. The study of the recognition performance in environments corrupted by acoustic noise shows that our system outperforms the audio-only baseline system by a wide margin. References 1..N. Belhumeur, J. Hespanha, and D.J. Kriegman. Eigenfaces vs Fisherfaces: Recognition using class specific linear projection. In roceedings of Fourth Europeean Conference on Computer Vision, ECCV'96, pages 45 58, April 1996. 2. M. Brand, N. Oliver, and A. entland. Coupled hidden Markov models for complex action recognition. In IEEE International Conference on Computer Vision and attern Recognition, pages 994 999, 1997. 3. C. Chibelushi, F. Deravi, and S.D. Mason. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia, 4(1):23 37, March 2002. 4. S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2:141 151, September 2000.

5. S. Kuo and O.E. Agazzi. Keyword spotting in poorly printed documents using pseudo 2-D Hidden Markov Models. IEEE Transactions on attern Analysis and Machine Intelligence, 16(8):842 848, August 1994. 6. A. Lawrence, C.L. Giles, A.C Tsoi, and A.D. Back. Face recognition : A convolutional neural networkapproach. IEEE Transactions on Neural Networks, 8(1):98 113, 1997. 7. Jia Lia, A. Najmi, and R.M. Gray. Image classification byatwo-dimensional hidden markov model. IEEE Transactions on Signal rocessing, 48(2):517 533, February 2000. 8. L. Liang, X. Liu, X. i, Y. Zhao, and A. V. Nefian. Speaker independent audiovisual continuous speech recognition. In International Conference on Multimedia and Expo, volume 2, pages 25 28, 2002. 9. R. Lienhart and J. Maydt. An extened set of Haar-like features for rapid objection detection. In IEEE International Conference on Image rocessing, volume 1, pages 900 903, 2002. 10. X. Liu, L. Liang, Y. Zhao, X. i, and A. V. Nefian. Audio-visual continuous speech recognition using a coupled hidden Markov model. In International Conference on Spoken Language rocessing, 2002. 11. J. Luettin and G. Maitre. Evaluation protocol for the XM2FDB database. In IDIA-COM 98-05, 1998. 12. A. V. Nefian and M. H. Hayes. Face recognition using an embedded HMM. In roceedings of the IEEE Conference onaudio and Video-based Biometric erson Authentication, pages 19 24, March 1999. 13. A. V. Nefian, L. Liang, X. i, X. Liu, and C. Mao. A coupled hidden Markov model for audio-visual speech recognition. In International Conference on Acoustics, Speech and Signal rocessing, volume 2, pages 2013 2016, 2002. 14. Ara V. Nefian. Embedded Bayesian networks for face recognition. In IEEE International Conference on Multimedia and Expo, volume 2, pages 25 28, 2002. 15. Ara V. Nefian and Monson H. Hayes III. Maximum likelihood training of the embedded HMM for face detection and recognition. In IEEE International Conference on Image rocessing, volume 1, pages 33 36, 2000. 16. C. Neti, G. otamianos, J. Luettin, I. Matthews, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. Audio visual speech recognition. In Final Workshop 2000 Report, 2000. 17. Jonathon hillips. Matching pursuit filters applied to face identification. IEEE Transactions on Image rocessing, 7(8):1150 1164, August 1998. 18. G. otamianos, J. Luettin, and C. Neti. Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal rocessing, volume 1, pages 169 172, 2001. 19. D.A. Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using an adapted gaussian mixture model. Digital Signal rocessing, 10:19 41, 2000. 20. F. Samaria. Face Recognition Using Hidden Markov Models. hd thesis, University of Cambridge, 1994. 21. L.K. Saul and M.L. Jordan. Boltzmann chains and hidden Markov models. In G. Tesauro, David S. Touretzky, and T. Leen, editors, Advances in Neural Information rocessing Systems, volume 7. The MIT ress, 1995. 22. S.B.Yacouband, S.Luettin, J.Jonsson, K.Matas, and J.Kittler. Audio-visual person verification. In IEEE International Conference on Computer Vision and attern Recognition, pages 580 585, 1999. 23. M. Turkand A.. entland. Face recognition using eigenfaces. In roceedings of International Conference on attern Recognition, pages 586 591, 1991.