Audio-visual Biometrics Using Reliability-based Late Fusion and Deep Neural Networks Mohammad Rafiqul Alam

Size: px

Start display at page:

Download "Audio-visual Biometrics Using Reliability-based Late Fusion and Deep Neural Networks Mohammad Rafiqul Alam"

Brett McGee
5 years ago
Views:

1 Audio-visual Biometrics Using Reliability-based Late Fusion and Deep Neural Networks Mohammad Rafiqul Alam This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia School of Computer Science and Software Engineering. January 2016

5 I would like to dedicate this thesis to my loving parents, Mr. Kalimullah and Mrs. Rokeya Begum, and my lovely wife Airin Sultana for their love and support

7 Abstract Online fraudulent activities are rapidly increasing with the growing use of web based applications. This is one example area that explains why biometrics are becoming essential in various commercial, government and forensic applications to verify the identity of an unknown person. Audio-visual biometrics provides a natural choice for person recognition as both inputs (i.e., speech and facial image) are non-intrusive and provide complimentary and correlated information. The recent advancements in mobile phone technology and the emergence of low-cost data acquisition devices have facilitated further the non-intrusive data acquisition for audio-visual biometric systems. However, the captured data may be of poor quality due to, in the case of face recognition, variations in the pose, illumination and background. A quality based fusion approach can be used as a solution to this problem: the estimation of a quality index for each input and the use of these indices in the fusion of the two modalities. However, measuring the quality at the signal level is difficult, particularly for the visual inputs, because the source of variation (e.g., illumination, pose and background) is difficult to model. This thesis analyses the impact of noisy inputs on the matching scores and presents a reliability-based score-level fusion. Then, a late fusion framework is proposed to incorporate both score- and rank-level fusion. In addition, a multimodal deep neural network which infers joint features (i.e., feature-level fusion) is trained using a novel three-step algorithm. The thesis is organized as a set of papers which are already published and/or submitted to journals or internationally refereed conferences. First, an audio-visual person identification system is presented using a Linear Regression-based Classifier (LRC) for both audio and visual sub-systems. Although LRC was previously used for face recognition, this thesis presents a novel LRC-GMM- UBM approach for speaker identification. An identification accuracy of 97.72% is achieved using the proposed LRC-GMM-UBM approach on a subset of the AusTalk database, consisting of 88 persons recorded at the UWA campus. Then, a reliabilitybased score-level fusion is proposed to boost the overall identification accuracy, especially in noisy conditions, compared to a non-adaptive and a non-learned approach of fusion.

8 viii Following the success of the reliability-based score-level fusion a late fusion framework incorporating both score- and rank-level fusion is developed. In this framework, the matching scores from the classifier of a sub-system are transformed using a novel C-ratio. Modifications to the highest rank and the Borda count fusion rules are proposed to incorporate a confidence factor. Extensive experiments were carried out on an extended subset of the AusTalk database consisting of audio-visual data from 248 persons (recorded at three university campuses across Australia) and the VidTIMIT database containing audio-visual data from 43 persons. It has been shown that the proposed fusion framework is more robust compared to the state-of-the-art late (score-level and rank-level) fusion approaches. A novel reliability-to-weight mapping function is proposed which can be used with different reliability measures such as the C-ratio and posterior entropy. The proposed mapping function overcomes the bias issue which is persistent with the existing entropy based score-level fusion rules. A DBM-DNN classifier is used, for the first time, to perform audio-visual person identification. Experimental results show that the DBM-DNN, in conjunction with the proposed reliability-toweight mapping function, outperforms the state-of-the-art DBN-DNN, LRC and Support Vector Machine (SVM) classifiers on the VidTIMIT and MOBIO databases. Experimental results show that the proposed mapping function is robust to JPEG compression (in the face image) and acoustic babble noise (in the speech). Moreover, an audio-visual person identification system based on a joint Deep Boltzmann Machine (jdbm) model is proposed. The proposed jdbm model is trained using a novel three-step algorithm. The activation probabilities of the units in the shared layer of the jdbm model are used as joint features (this can be viewed as a fusion at the feature level) and logistic regression is used for classification. Experimental results show that higher identification accuracy is achieved using the joint features generated by the proposed model than the joint features obtained using the state-of-the-art Deep Auto-encoder (DAE) and Deep Belief Network (DBN) models. The joint features generated by the proposed jdbm model are also more robust to noise and a missing modality compared to the state-of-the-art.

9 Acknowledgements First, I would like to thank almighty Allah for bestowing His mercy and blessings upon me and my family. In the four years of PhD study, I have gained some valuable skills and expertise which would hopefully make me an expert in the field of biometrics. It would not have been possible without a lot of patience and courage from me, mental support from the family, and financial and technical support from the supervisors. Particularly, I would like to mention my wife, Airin Sultana, who encouraged me since the very beginning day of my PhD study and always motivated me to believe in myself. I consider myself a very lucky person to have her in my life. I sincerely thank the Graduate Research and Scholarship Office (GRSO) for offering me, first the UPAIS and then the UPA scholarships. I also thank the GRSO for approving my travel scholarship application which enabled me to present a paper at the BTAS 15 conference held in Washington DC, USA. I would like to express my sincere gratitude to the School of Computer Science and Software Engineering (CSSE) for supporting my PhD Completion scholarship application. I appreciate Professor Mark Reynolds for his support when I sought a tutoring opportunity. Finally, I would like to thank my supervisors for their continuous support and advice. I am grateful to Winthrop Professor Mohammed Bennamoun for supervising my PhD research. His timely advices helped me to achieve a lot of positive outcomes from my research. I would like to especially mention Professor Roberto Togneri, my co-supervisor, who supported me to improve my writing skills and develop research ideas since the early days of my PhD candidature. His support to my application for the PhD completion scholarship is incomparable and will be always remembered by me. My sincere thanks also go for Dr. Ferdous Sohel who has been more like a elder brother to me than a co-supervisor. He helped me to quickly settle in UWA and understand the terms related to my research. I also thank Dr. Imran Naseem for sharing his thoughts and contributing as a co-author of one of the papers that constitute this thesis.

11 Table of contents List of figures List of tables xv xix 1 Introduction Motivations of the Research Problem Statement Research Contributions Structure of the Thesis A Review of Audio-Visual Biometric Systems (Chapter 2) Linear Regression-based Classifier for Audio-visual Person Identification (Chapter 3) A Reliability-based Score-level Fusion of Linear Regressionbased Classifiers (Chapter 4) A Confidence-Based Late Fusion Framework (Chapter 5) Audio-visual Person Recognition using Deep Neural Networks and a Novel Reliability-based Fusion (Chapter 6) A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using Mobile Phone Data (Chapter 7) A Review of Audio-Visual Biometric Systems Biometric systems What is biometrics? Multimodal biometrics Audio-Visual biometric systems Preprocessing Feature extraction Classification Fusion Audio-visual corpora Audio-visual biometric systems evaluated in a controlled environment Mobile person recognition

12 xii Table of contents Speaker recognition systems Face recognition systems Audio-visual person recognition on MOBIO Deep neural networks for person recognition A DBN-DNN for unimodal person recognition A DBM-DNN for person recognition Summary Linear Regression-based Classifier for Audio-Visual Person Identification Introduction Linear Regression-based Classifier LRC-GMM-UBM for Speaker Recognition Score Normalization Min-max normalization Audio Visual Fusion Experimental Results and Analysis Conclusion A Reliability-based Score-level Fusion of Linear Regression-based Classifiers Introduction Score Normalization Min-max normalization Proposed Reliability Estimation Technique Experimental Setup Results and Analysis Discussion A Late Fusion Framework For Audio-Visual Person Identification Introduction Motivation and Contributions Fusion In Multibiometric Identification Existing Score-level Fusion Approaches Existing Rank-Level Fusion Methods Proposed Fusion Framework C-ratio Score Fusion Confidence-Based Rank-Level Fusion Databases And Systems AusTalk VidTIMIT System Experiments, Results and Analysis

13 Table of contents xiii Robustness to AWGN Robustness to Salt and Pepper Noise Rank-level Fusion With AWGN Conclusions Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Introduction Background Audio-visual biometrics Reliability-to-weight mapping Methodology Unsupervised training of the DBMs Supervised fine-tuning of the DBM-DNNs Decision making Fusion Evaluation criteria Experimental setup Experimental results MOBIO database VidTIMIT database Identification Verification Conclusion A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using Mobile Phone Data Introduction Related Work Proposed Learning of the jdbm Model Training of Bimodal DBM Experimental Setup MOBIO Database Hand-crafted Features Implementation Details Results and Analysis Unimodal Identification Evaluation of Shared Features Robustness of the Joint Features Robustness to Missing Modality

14 xiv Table of contents 7.6 Conclusion Conclusion Discussion Future Work References 123

15 List of figures 2.1 Commonly used biometric traits and the level of cooperation required from the end users during data acquisition Block diagram of an audio-visual biometric recognition system A sequence of rotating head movements included in the M2VTS, XM2VTS and VidTIMIT databases. (Examples are taken from XM2VTS) AusTalk data capturing environment (taken from [1]) A taxonomy of existing speaker recognition methods using MOBIO A taxonomy of the existing face recognition methods using MOBIO left: DBM; right: DBN a) Generative DBN and b) discriminative DBN-DNN [2] Two-stage pretraining of a DBM with two hidden layers. Shaded nodes indicate clamped variables, while white nodes indicate free variables DBM-DNNs are initialized with DBM weights and discrminitaively fine-tuned. The output scores are fused before reaching a final decision A typical speaker with white Gaussian noise from 0.0 to 0.9 (clockwise from top left) Identification accuracy with respect to feature vector dimension Audio-visual fusion performance with noisy audio and clean visual data Audio-visual fusion performance with noisy visual and clean audio data Audio-visual fusion performance with respect to AWGN Identification accuracy at different levels of noise on visual data and mild audio noise (SNR = 40dB) Identification accuracy at different levels of audio SNR and mild visual noise (AWGN variance = 0.06)

16 xvi List of figures 4.3 Identification accuracy at different levels of noise on visual data (AWGN variance) and high audio noise (SNR = 12dB) Identification accuracy at different levels of audio SNR and high visual noise (AWGN variance of 0.2) Performance of the reliability estimation technique compared to the non-learned approach in [3] and [4]at high visual noise with variance of Block diagram of an audio-visual biometric system that incorporates sample quality and (or) matcher confidence measures in the fusion. Although quality-based fusion has been studied extensively [5], incorporating matcher confidence in the fusion has not been well studied [6]. Moreover, achieving sample quality in audio-visual biometrics is challenging [3]. The shaded box highlights our contribution Variation of match scores with the (a) speech and (b) face matcher confidence measures in the training dataset (T ). Our proposed matcher confidence measure is able to separate the genuine scores from the impostor scores. For example, the difference between the genuine and impostor scores is high when our proposed matcher confidence measure is high and vice versa. The maximum value of Cface T is whereas the maximum value of CT speech is We calculate the C-ratio of a modality by normalizing the matcher confidence obtained at the evaluation phase (c E m) by the maximum value of corresponding Cm. T Universal image quality index (Q I ) for a reference image (a) matched against clean input image (b) from a different user, and against the same image (a) corrupted with AWGN of levels of (c) σ 2 = 0.3 (d) σ 2 = 0.3 and (e) σ 2 = 0.9 as well as (f) 25% (g) 50% and (h) 75% salt and pepper noise CMC curves for our confidence-based rank fusion method (conborda- Count and conhighestrank) at different (audio,visual) noise levels and compared against the Borda count (bordacount), the highest rank (higestrank), perturbation-factor highest rank (pfactorhighestrank) and the predictor based Borda count (predictorbasedborda) methods a) left: Boltzmann machine (BM); right: Restricted Boltzmann Machine (RBM), b) left: Deep Boltzmann Machine (DBM); right: Deep Belief Network (DBN)

17 List of figures xvii 6.2 Unsupervised training of the DBMs. The dimension of inputs to the DBM LBP face (left) is = 3712 ( i.e., 58 LBP patterns are extracted from each (8 8) block from a input image and then concatenated). The dimension of inputs to DBM GMS speech (right) is 39 c where c is the number of Gaussian mixture components Reliability-to-weight mapping on MOBIO using the (a) proposed, (b) negative, and (c) inverse mapping Example of models that can be used for learning shared features Block diagram of our proposed jdbm-based audio-visual person identification from mobile phone data A two-layer deep Boltzmann machine (DBM) with a visible layer and two hidden layers Three-step training of the proposed jdbm model. In the first step, we learn unimodal DBMs corresponding to the audio and visual modalities. In the second step (a), we learn the shared layer parameters as a Bernoulli-Bernoulli joint RBM. In the third step, the jdbm is fine-tuned after the initialization with the parameters of the unimodal DBMs and the parameters of the joint RBM Top row: frames extracted from different videos of a person; Bottom row: detected faces from the video frames Impacts of JPEG compression noise on the face images Illustration of the robustness of the features obtained from our unimodal (face) model Illustration of the robustness of the features obtained from our unimodal (speech) model

19 List of tables 2.1 AusTalk subjects represent variations in geography, dialect and emotion AusTalk data collection protocol. (Time in minutes) A summary of the MOBIO datbase The unbiased MOBIO evaluation protocol used at ICB Audio-visual biometric systems evaluated in controlled environments Speaker recognition systems evaluated using MOBIO Template-based methods of face recognition on MOBIO taken from [7] Audio-Visual person recognition systems on MOBIO audio-visual fusion performs better in clean conditions Raw scores and the ranked list from the audio and visual expert for a probe (belonging to Client 1) Rank-1 identification at various levels of additive white Gaussian noise on speech and face probes in AusTalk Rank-1 identification at various levels of additive white Gaussian noise on speech and face probes in VidTIMIT Rank-1 identification at various levels of additive white Gaussian noise on speech and salt and pepper noise on face probes in AusTalk Rank-1 identification at various levels of additive white Gaussian noise on speech and salt and pepper noise on face probes in VidTIMIT Rank-1 identification accuracy ( %) on MOBIO Rank-1 identification accuracy (%) on MOBIO using DBM-DNN when the face modality is corrupted due to different levels of JPEG compression noise but the speech modality is clean MOBIO verification protocol. There are four types of utterances in the MOBIO dataset which are designated as p: personal questions, f: free speech, r: short response, and l: long speech

20 xx List of tables 6.4 Face verification on MOBIO using the systems presented at the ICB 2013 face recognition challenge [8], evaluation of the session variability modelling approach [9], the LRC, the SVM, and DBM-DNN Speaker verification on MOBIO using the systems presented at the ICB 2013 speaker recognition challenge [8], evaluation of the session variability modelling approach [9], the LRC, the SVM, and DBM-DNN Bi-modal recognition on the MOBIO for text-independent verification Rank-1 identification accuracy (%) on VidTIMIT Rank-1 identification accuracy on VidTIMIT using DBM-DNN when the face modality is corrupted due to different levels of JPEG compression noise but the speech modality is clean EERs (%) on VidTIMIT for face, speaker and bi-modal recognition Comparison of rank-1 identification accuracy Rank-1 identification accuracy using the conventional approach (Section 7.3.1) of training a bimodal DBM and testing with noisy inputs Rank-1 identification accuracy using the three-step approach (Section 7.3.1) of training the our proposed jdbm and testing with noisy inputs Robustness to missing modality

21 Publication Arising from This Thesis This thesis contains works that are already published and/or submitted to journals or internationally refereed conferences. The bibliographical details of each work and where it appears in the thesis are outlined below. International Journal Publications [1] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., A Confidence-based Late Fusion Framework For Audio-Visual Biometric Identification, Pattern Recognition Letters, Jan (Chapter 5) [2] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., Audio-Visual Person Recognition Using Deep Neural Networks and a Novel Reliability-based Fusion, Pattern Recognition Letters (Under 2nd round of review), Sep (Chapter 6) [3] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using Mobile Phone Data, IEEE Transactions on Multimedia (Accepted with minor corrections), Jul (Chapter 7) Book Chapters [4] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., Deep Boltzmann Machines For i-vector Based Audio-Visual Person Identification, T. Bräunl et al. (Eds.): PSIVT 2015, Lecture Notes on Computer Science (LNCS) volume 9431, pp. 1-11, The preliminary ideas of this paper were refined and extended to contribute towards [2] which forms Chapter 6 of this thesis. [5] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., Deep Neural Networks for Mobile Person Recognition with Audio-Visual Signals, G. Guo and H. Weschler (Eds.), Mobile Biometrics (With Editors for final review), IET, (Chapter 2)

22 xxii List of tables International Conference Publications [6] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., Naseem, I., Linear Regression-based Classifier For Audio Visual Person Identification, In Proc. of 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA 2013), Sharjah, UAE, Feb (Chapter 3) [7] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., An Efficient Reliability Estimation Technique For Audio-Visual Person Identification, In Proc. of 8th IEEE Conference on Industrial Electronics and Applications (ICIEA 2013), Melbourne, Australia, Jun (Chapter 4) [8] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., Confidence-based Rank Level Fusion for Audio-Visual Person Identification, In Proc. of 3rd International Conference on Pattern Recognition Application and Methods (ICPRAM 2014), Angers, France, 6-8 Mar The preliminary ideas of this paper were refined and extended to contribute towards [1] which forms Chapter 5 of this thesis. [9] Alam, M.R., Bennamoun, M., Togneri, R., Sohel, F., A Deep Neural Network for Audio-Visual Person Recognition, In Proc. of 7th International Conference on Biometrics: Theory, Applications and Systems (BTAS 2015), Washington D.C., USA, 8-11 Sep The preliminary ideas of this paper were refined and extended to contribute towards [2] which forms Chapter 6 of this thesis.

23 Contribution of the Candidate to the Published Work The contribution of the candidate in all the published papers was 80%. The candidate developed and implemented the algorithms, carried out experiments and wrote the papers. Other authors contributed by reviewing the papers and providing suggestions for improvements.

25 Chapter 1 Introduction Internet crimes can be facilitated by the increasing reliance on web based services for daily life activities such as banking and shopping [10]. Because people are sharing valuable information (e.g., name, password, address, or date-of-birth) to access various applications on the web, their identities may be compromised if such information is not strongly protected. This is one of the reasons why many applications (e.g., mobile banking [11] [12] and arrivals SmartGate at the Australian airports [13]) are, these days, using biometrics to verify the identity of an unknown person. Recent developments in the mobile phone technologies and the emergence of other low-cost data capturing devices (e.g., web-cam and headphones) have made biometrics a cost effective and easily deployable solution. However, the captured data using these devices may be of poor quality or corrupted due to background noise. To overcome this problem, the current research trend is to use multiple biometric traits. Multiple sub-systems are fused either at the early stage (feature-level) or at the late stage (score- or rank-level). One of the main motivations of this thesis is to analyse the impacts of noisy inputs on the matching scores of a classifier and use that information to develop a robust late fusion framework. In addition, the use of joint features obtained from the shared layer of a joint Deep Boltzmann Machine (jdbm) model is proposed. These joint features are used with logistic regression to perform audio-visual person identification. In this chapter, the research motivations are discussed followed by a concise statement of the research problem. 1.1 Motivations of the Research Online fraudulent activities have increased in recent years due to the growing popularity of web based applications. According to the 2015 Cyber Fraud Report [14] of Veda the largest credit reference agency in Australia and New Zealand our increasing presence on the web is making us more vulnerable to be a victim

26 2 Introduction of identity theft. The report found that the number of credit application frauds in Australia increased by 33% in compared to Almost 25% of Australians are now a victim of identity theft (up by 7% in compared to ), with many more people likely to be unaware that their identities have been already compromised. Apart form this, the Australian Payments Clearing Association (APCA) report published in June 2015 [15] found that frauds on Australian payment cards have increased from 46.6c to 58.8c in every AUD 1,000 spent since 12 months ago. Moreover, according to the 2014 annual report of the Internet Crime Complaint Center (IC3), the total number of complaints related to internet crimes received from all over the world in increased by 2.5% compared to the previous year [16]. The report also found that identity theft was the reason for a total loss of USD 32 million from the victims between 1 June December Therefore, the use of a robust biometric system is essential to strengthen the protection against internet crimes such as identity theft or credit application fraud. A biometric system uses a person s physiological or behavioural trait(s), such as fingerprint, face, ear, DNA, gait, voice, hand geometry, iris, palm print, or retina, for recognising and identifying the person. Although each biometric trait has its own advantages and disadvantages, the choice of a particular one or the fusion of multiple biometric traits depends on the application area [17]. Jain et al. [18] classified the applications of biometric systems into three main groups: Commercial applications: computer network login, e-commerce, Automatic Teller Machine (ATM), credit card applications, Personal Digital Assistant (PDA), cellular phone, medical records, and distance learning. Government applications: national ID, driver s license, social security, welfare disbursement, border control, and passport control. Forensic applications: corpse identification, criminal investigation, and terrorist identification. The performance of a unimodal biometric system, which is based on a single biometric trait, is sensitive to the level of noise in the sensed data, intra-class variations, non-universality and spoof attack. For example, the performance of a speaker recognition system largely depends on the type of microphone used and channel noise. Moreover, a biometric system which relies on visual information is sensitive to illumination, shadow, background, and occlusion. Besides, if impostors possess a photograph and/or a speech recording of a registered client, then they can easily break through a unimodal system [19]. It has been shown in [20] that a multimodal biometric system, which is based on multiple biometric traits, performs better than a unimodal biometric system. However, the level of difficulty to acquire

27 1.1 Motivations of the Research 3 some biometric traits (e.g., voice, facial and gait) may be less than others (e.g., DNA and retinal image). Hence, some multimodal biometric systems (e.g., audiovisual) are more cost effective and easily deployable in a real life scenario than others. For example, data acquisition for an audio-visual biometric system can be carried out using the camera and microphone inbuilt to a mobile phone. In human speech perception, face visibility is crucial because the visual signal is correlated to the audio signal [21 24] and also includes complementary information [24, 25]. Therefore, an audio-visual biometric system is a natural choice for recognizing persons and can achieve a better recognition rate compared to a unimodal biometric system [26 29]. The fusion of information is an important step for all multimodal biometric systems. Fusion techniques are developed to achieve a better classification result by reducing one or more of the following: False Accept Rate (FAR), False Reject Rate (FRR), Failure to Enroll Rate (FTR), or the susceptibility to artefacts [30]. The final decision about a claimed identity is reached by fusing the features/outcomes from each modality. Fusion can be carried out in three different levels [31]: Feature-level: Feature vectors are extracted from the sensed data and these are concatenated into a single feature vector. Score-level: The proximity of a feature vector with a template vector is represented as a matching score. In the score-level fusion, matching scores from all sub-systems are combined. Decision-level: The ranked lists from all sub-systems are combined. Fusion at the feature level is challenging because the features extracted from the inputs of a biometric system may not be compatible. Apart form this, the relationship between different feature spaces may be difficult to learn. Therefore, a score- or decision-level fusion is used by most multimodal biometric systems. The matching scores from a classifier also provide a good insight about the quality of inputs, given that the matcher s decision making ability is strong under normal circumstances [32]. Therefore, the development of a robust late fusion approach based on statistics calculated from the matching scores will constitute a significant contribution in the field of biometrics. Recent advancements in computer hardware and machine learning algorithms have triggered interest in the development of a biometric system using deep neural networks (DNNs). These networks can be discriminatingly trained by backpropagating the derivative of the mismatch between the target outputs and the actual outputs [33]. Recently, the learning of deep networks for tasks such as speech, vision and language processing has gained in popularity [34]. A number of speaker recognition systems using deep networks have been proposed in [2]

28 4 Introduction [35 37]. Therefore, developing an audio-visual biometric system using DNNs is of great research interest. Since the matching scores from a DNN represents class posterior probabilities, developing an entropy based score-level fusion approach will also be a significant contribution. Apart from this, obtaining joint features using a multimodal deep model has gained more attention in recent years [38 40]. An audio-visual biometric system using such joint features generated by a deep multimodal model will be a useful study. 1.2 Problem Statement The research problem of this thesis can be stated as follows. Is it possible to develop a robust approach for: a) late fusion (score-level and rank-level) using a statistical measure calculated from the matching scores, or b) early fusion (feature-level) using a multimodal deep model? The answer is yes, and this thesis outlines how this can be done. 1.3 Research Contributions The major contributions of this thesis are as follows: An audio-visual person identification system is presented using a Linear Regression-based Classifier (LRC) for face recognition as in [41] and a novel LRC-GMM-UBM approach for speaker identification. (Chapter 3) A novel reliability measure is proposed using the ratio of the mean of the top k ranked scores to the overall mean (all matching scores). A fusion approach is proposed based on an observation: in the presence of noise or a poor representation of the inputs, the best match (rank-1) identities from the sub-systems of a biometric system may not be the same. Therefore, if there is a disagreement about the top ranked identity, it is proposed that the fusion weights are adapted so that a higher weight is assigned to a more reliable sub-system. (Chapter 4) A novel C-ratio is proposed using the ratio of the confidence measure associated with a classifier obtained at the test phase to the maximum value of the confidence measure for that classifier obtained at the training phase. Then, a transformation of the matching scores is performed using the C-ratio and the final decision is ruled in favor of a model/template achieving the highest fused score. (Chapter 5)

29 1.3 Research Contributions 5 A confidence factor is used with the highest rank fusion rule so that following conditions are met: a) the ranks from the most confident classifier get the highest priority, and b) a possible tie between the final rank of two different classes is avoided. (Chapter 5) The Borda count rule of rank-level fusion accounts for the variability in the ranks due to the use of multiple classifiers. This fusion rule assumes that the classifiers are statistically independent and perform equally well. In practice, a particular classifier may perform poorly due to various reasons. Hence, a modification to the Borda count fusion rule is proposed to incorporate the confidence measure of the classifiers. (Chapter 5) Deep neural networks have recently been used in the image, audio and speech processing areas [33]. The use of a DBM-DNN [42] a discriminative deep neural network (DNN) initialized with the generative weights of a deep Boltzmann machine (DBM) for audio-visual person recognition is investigated. (Chapter 6) The use of DBM-DNNs for i-vector based audio-visual person recognition is proposed. Experimental results show that a higher accuracy can be achieved with the DBM-DNNs compared to the state-of-the-art DBN-DNN approach [33] and a cosine distance based approach [43]. (Chapter 6) Since the outputs of a DBM-DNNs represent class posterior probabilities, the entropy of posteriors can be used as a reliability measure. However, the bias problem of the existing entropy based reliability-to-weight mapping functions was discussed in [44]. A novel reliability-to-weight mapping function is proposed which overcomes the bias issue and contributes to the performance boost of the DBM-DNNs in both clean and noisy conditions. (Chapter 6) A joint deep Boltzmann machine (jdbm) model is proposed to perform the feature-level fusion. The activation probabilities of the units in the shared layer are used as joint features and a logistic regression is used to perform audio-visual person identification. (Chapter 7) A novel three-step algorithm is proposed to train the jdbm model. In the first step, unimodal DBMs corresponding to the speech and the face modalities are trained. In the second step, the shared layer parameters of a jdbm are pretrained using a joint Restricted Boltzmann Machine (jrbm) model. Finally, in the third step, the jdbm is initialized with the parameters of the unimodal DBMs and the jrbm, and then the jdbm is fine-tuned for a small number of iterations. (Chapter 7)

30 6 Introduction 1.4 Structure of the Thesis This thesis is organized as a series of papers published in internationally recognized journals, books and conferences. Each paper contributes independently to audiovisual biometrics with minor overlaps. However, these papers together contribute towards a coherent theme for audio-visual biometrics. Chapters 2 to 7 correspond to publications [45], [46], [32], [47], [48] and [49] respectively. In Chapter 2, a comprehensive background of the existing audio-visual biometric systems in both controlled (e.g., data captured in an office environment) and uncontrolled (e.g., data captured using mobile phones) environments is presented. The core of this thesis is laid out in Chapters 3 to 7. In Chapter 3, LRC is used for both sub-systems of an audio-visual person identification system. A reliability measure is proposed in Chapter 4, for fusing the sub-systems. The impacts of noisy inputs on the matching scores and the ranking of identities are also analysed in this chapter. Then, in Chapter 5, a complete late fusion framework is presented which includes both score-level and rank-level fusion approaches. In Chapter 6, a DBM-DNN is used as a classifier for both sub-systems. Because the matching scores from the DBM- DNNs represent class posterior probabilities, an entropy based score-level fusion technique is proposed. In Chapter 7, a jdbm model is proposed to generate joint features given audio and/or visual input(s) to the model. This thesis is concluded in Chapter 8, which includes a summary of the contributions and suggestions for future work. A more detailed overview of each chapter is presented below A Review of Audio-Visual Biometric Systems (Chapter 2) This chapter starts with a general and brief introduction to biometrics. A detailed description of the functions of an audio-visual biometric system is then presented. A review of the existing audio-visual biometric systems evaluated in both controlled and uncontrolled environments is also included in this chapter. A preliminary discussion on the use of DNNs for biometrics is presented Linear Regression-based Classifier for Audio-visual Person Identification (Chapter 3) This chapter presents an audio-visual person identification system using LRC for each sub-system. First, class specific models (dictionaries) are built by stacking q-dimensional feature vectors (extracted from the speech and face image) obtained from the training data. A template for a client is estimated by expressing a test feature vector as a linear combination of all the training feature vectors from that client. The Euclidean distance between a test feature vector and the template is

31 1.4 Structure of the Thesis 7 used as a matching score. The matching scores from a classifier are normalized using the min-max normalization and are combined using the sum rule of fusion A Reliability-based Score-level Fusion of Linear Regressionbased Classifiers (Chapter 4) This chapter presents a reliability measure using both the ranked list and the matching scores obtained from a classifier. The estimated reliability measures are mapped into corresponding modality weights using a novel mapping function. Then, an adaptively weighted summation fusion approach is proposed to fuse the sub-systems at the score level A Confidence-Based Late Fusion Framework (Chapter 5) This chapter presents a confidence-based late fusion framework for an audio-visual person identification system. In the proposed approach, a confidence measure for each classifier is calculated from their matching scores. Then, a transformation of the matching scores is performed using a novel C-ratio. Moreover, modifications to the highest rank and the Borda count rank fusion rules are proposed to incorporate the confidence measure of the classifiers Audio-visual Person Recognition using Deep Neural Networks and a Novel Reliability-based Fusion (Chapter 6) In this chapter, an audio-visual person recognition system is presented using the DNNs initialized with the parameters of the corresponding DBMs, also referred to as DBM-DNNs. The DBM-DNNs can achieve a lower error rate compared to the stateof-the-art LRC, support vector machine (SVM), DNNs initialized with parameters of corresponding deep belief networks (DBN-DNNs) and a simple baseline DNN approach. A novel reliability-to-weight mapping function is proposed which can be used with different reliability measures such as the entropy of posteriors or the C-ratio (presented in Chapter 5) A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using Mobile Phone Data (Chapter 7) This chapter presents an audio-visual person identification system using a novel jdbm model which is trained using a three-step algorithm. The activation probabilities of the units in the shared layer are used as joint features and logistic regression is used to perform the identification. A performance comparison in

32 8 Introduction terms of identification accuracy of the proposed system with the state-of-the art support vector machine (SVM), deep belief network (DBN) and the deep auto-encoder (DAE) is presented. The robustness of the joint features generated by the proposed jdbm model are also tested against a missing modality and different levels of the JPEG compression and acoustic babble noise present in the face image and speech, respectively.

33 Chapter 2 A Review of Audio-Visual Biometric Systems Abstract This chapter starts with a general and brief introduction of biometrics and an audio-visual biometric system. This is followed by a review of the existing audiovisual biometric systems evaluated in a controlled environment. Then, a review of the existing speaker and face recognition systems, which have been evaluated on a mobile biometric database (uncontrolled environment), is presented. We then discuss the key motivations of using DNNs for person recognition. We finally introduce a DBM-DNN based framework for person identification. 2.1 Biometric systems What is biometrics? The term biometrics" refers to the automatic recognition of persons using their behavioural and/or physiological data (e.g., fingerprint, retinal, facial image, speech or their combination). A biometric recognition system compares the features extracted from input biometric samples (e.g., the face image) with pre-stored client model(s). This is has been proven to be a more secure option than the traditional knowledge-based or token-based recognition approach: since knowledge or tokens, such as a personal identification number (PIN), password or an ID card may be lost, forgotten, fabricated or stolen. A biometric system operates in one of the This article will be published in Chapter 9 of the forthcoming IET book "Mobile Biometrics" edited by Guodong Guo and Harry Weschler, with a title "Deep Neural Networks for Mobile Person Recognition using Audio-Visual Signals". The content of Section 2.3, which is not included in the original article, has been added in this thesis chapter to review the existing audio-visual biometric systems evaluated in controlled environments.

10 A Review of Audio-Visual Biometric Systems following two modes: identification or verification.

, account holder verification for automatic teller machine (ATM) or making online payments, and customs

ition system operates in the identification" mode e.g.

For example, crime scene and forensic investigations and access control systems operate in the

Decision making in the identification mode is determined by two cases: a) closedset and b) open-set.

database and he/she is classified as one of the N registered persons (i.e., one-to-n comparison).

In the verification mode (and in contrast to both forms of identification), the system tries to

An unknown person s claim is accepted only if the matching score for the client model corresponding to

2 Multimodal biometrics A number of biometric traits are used for person recognition (see Fig. 2.

uniqueness of the trait, the cost and size of the data acquisition sensor, the robustness of the

34 10 A Review of Audio-Visual Biometric Systems following two modes: identification or verification. A biometric verification" system may be deployed in various real life applications, e.g., account holder verification for automatic teller machine (ATM) or making online payments, and customs and border protection. A biometric recognition system operates in the identification" mode e.g., when an unidentified person is unlikely to claim the identity of one of the registered clients. For example, crime scene and forensic investigations and access control systems operate in the identification" mode. Decision making in the identification mode is determined by two cases: a) closedset and b) open-set. In closed-set identification, it is assumed that an unidentified person is already registered in the database and he/she is classified as one of the N registered persons (i.e., one-to-n comparison). In open-set identification, however, the system tries to find whether the unknown person is registered in the database or not. Thus, the task is considered to be an N + 1 class identification problem including a reject class (i.e., one-to-n+1 comparison). In the verification mode (and in contrast to both forms of identification), the system tries to determine whether a person is who he/she claiming to be (i.e., one-to-one comparison). An unknown person s claim is accepted only if the matching score for the client model corresponding to the claimed identity is above a pre-determined threshold Multimodal biometrics A number of biometric traits are used for person recognition (see Fig. 2.1), The choice of a particular trait (or biometric) depends on a number of factors, including: the uniqueness of the trait, the cost and size of the data acquisition sensor, the robustness of the acquired data to noise and various other nuisances, and how easy it is for an impostor to fake the biometric. A categorization of the commonly used LOW acceptability (DNA, Iris, and Retina) MEDIUM acceptability (Fingerprint, Hand geometry, Hand vein, Keystroke) high medium HIGH acceptability (Ear, Face, Gait, Signature, Voice) low Cooperation required from the user Fig. 2.1 Commonly used biometric traits and the level of cooperation required from the end users during data acquisition.

2.1 Biometric systems 11 Voice activity detection Face detection and facial image extraction i n o t c a r x e e r u t t a e F Acoustic features Visual features

35 2.1 Biometric systems 11 Voice activity detection Face detection and facial image extraction i n o t c a r x e e r u t t a e F Acoustic features Visual features classifier classifier Speaker recognition Fusion Face recognition Fig. 2.2 Block diagram of an audio-visual biometric recognition system. Audio-visual person recognition

36 12 A Review of Audio-Visual Biometric Systems biometric traits based on the level of user cooperation (that is required from the user during data acquisition) is shown in Fig If a recognition system uses a single biometric trait then it is known as an unimodal system. The decision about the identity of a person is made using a single classifier. However, the captured biometric sample may be of poor quality due to (e.g., in the case of face recognition) variations in the pose, illumination and background. Some biometric traits (e.g., voice and signature) are also vulnerable to spoofing. Under these challenging conditions, a unimodal system may produce unreliable recognition results, because it relies on a single input modality. A multimodal biometric system makes decisions by fusing information from multiple modalities at different levels: e.g., fusion at the outcome level of each modality (i.e., score-level fusion) or by using a single system with concatenated inputs (i.e., feature-level fusion). Such systems are considered more robust than their unimodal counterparts. In [50], the following advantages of a multimodal biometric system were reported: a) it addresses the issue of non-universality, b) it facilitates the indexing and filtering process of large-scale biometric databases, c) spoofing attack becomes increasingly difficult and d) it addresses the problem of noisy data by using a quality measure of the sensed data during the fusion process. Furthermore, the level of difficulty required to acquire some biometric traits (e.g., voice, facial and gait) may be less than others (e.g., DNA and retinal image). Therefore, some multimodal biometric systems (e.g., audio-visual) are more cost effective and easily deployable in a real life scenario than others (e.g., DNA + retinal image and fingerprint + palm print). The following sections of this chapter will therefore focus on a standard audio-visual biometric recognition system (see Fig. 2.2). 2.2 Audio-Visual biometric systems An audio-visual biometric system is designed to recognise unknown persons using their voices and facial images. Such a system (see Fig. 2.2) is generally composed of the following parts: a) a speaker sub-system and b) a face sub-system. Both sub-systems can be fused at the score level, meaning that their matching scores are combined using an appropriate fusion rule. These modalities can also be fused at the feature level or data level. The functions of a standard audio-visual biometric recognition system are briefly described below Preprocessing The captured audio and visual signals are commonly preprocessed before any meaningful features can be extracted. This stage enhances the quality of the

37 2.2 Audio-Visual biometric systems 13 captured data and it is critical in improving the overall recognition accuracy of the system. Silence removal: Although the type of required preprocessing is dependent on the intended features to be extracted (which are then used for classification), speech enhancement is used by most speaker recognition systems. It is usually performed with the help of a voice activity detection (VAD) algorithm e.g., [51], which removes any non-voice segments from the speech signals. Face tracking: Tracking by detection is performed on the video frames, based on the detection of the face from facial features, e.g., eyes and nose. The Viola- Jones algorithm [52] is the most commonly used algorithm for face detection due to it extremely fast feature computation, efficient feature selection, and scale and location invariant detection. Once a face part is detected, a region of interest (ROI) representing that face part is cropped. The ROI can represent various face parts, such as the eyes, nose, or mouth (e.g., [4], [53] and [54]) or the whole face (e.g., [28] and [55 58]). Then, this ROI is photometrically normalized (e.g., using the Tan-Triggs algorithm [59]) to minimize the effects of illumination and background variations Feature extraction Features are extracted from the preprocessed audio and visual signals. The choice of a particular type of feature depends on factors such as the type of application, classifier, and the robustness to channel distortion and session variability. For example, for speaker recognition it is important to extract features that capture the speaker-specific characteristics. Acoustic features A speech utterance is first divided into overlapping fixed duration (e.g., ideally between 20ms to 30ms) segments, also known as frames. A frame is generated ideally after every 10ms and multiplied by a window function to smooth the effect of using finite size segments. The following features are used by most speaker recognition systems: Mel-frequency cepstral coefficients (MFCCs): A Fast Fourier Transform (FFT) is applied on each frame to obtain a set of P complex spectral values. These coefficients are then converted to Q Mel-spaced filter bank values such that Q << P, using triangular filter banks spaced according to the Mel scale. Since human hearing exhibits logarithmic compression, the log of each filter bank output is taken. The final step is to convert these Q log filter bank values

38 14 A Review of Audio-Visual Biometric Systems into R cepstral coefficients using the Discrete Cosine Transform (DCT). Ideally, R = 13 cepstral coefficients including c 0, which represents the average logpower, are exacted from each frame. The first- and second-order derivatives of these coefficients are augmented to form a 39 dimensional MFCC + delta + acceleration feature vector per frame of an utterance. MFCCs are ideally used in the GMM-UBM based systems where a client s Gaussian Mixture Model (GMM) [60], represented as λ j, is built by adapting a global GMM model also known as the Universal Background Model (UBM). The feature vectors extracted from all the frames of the enrolment samples from a client j are used to adapt the UBM. Gaussian mean supervector (GMS): A GMS is defined as a large vector of length N M formed by concatenating the N-dimensional means from each of the M mixtures of a GMM. If the UBM is adapted using the features (e.g., MFCCs) from all the frames of a single utterance, one can obtain a GMM for that utterance [61]. The means, µ i, of the components of the utterance-specific GMM are obtained by adapting the means of the UBM. Therefore, a GMS representing an utterance, x, is formed by µ x = [µ 1 µ 2... µ M ]. They can be used to build linear models of the clients. In addition, the recent variability modelling techniques [43], [62] and [63] are based on the GMS subspace. Total Variability Modelling: Inter-session variability (ISV) [62] and joint factor analysis (JFA) [63] are the session variability modelling techniques widely used in the literature. Both methods assume that the observations, x i,j, from the i-th sample of client j are drawn from the following distribution: µ i,j = m + Uf i,j + d i (2.1) where U is the low dimensional session variability subspace trained with an expectation maximisation (EM) algorithm and f i,j are latent factors, which are set with a standard prior [62]. ISV and JFA differ in their definition of the client dependant offset. Given latent factors z i and y i, the offset is expressed as d i = Dz i in ISV, where D is a function of the UBM covariance representing the between-class variability subspace. In JFA, however, the offset is expressed as d i = V y i + ˆDz i, where V is a low-dimensional withinclass variability subspace. Both V and ˆD are learnt using the EM algorithm. Due to the high dimensionality of GMS space, the JFA can fail to separate between-class and within-class variation in two different subspaces. A Total Variability Modelling (TVM) [43] method was proposed to overcome this issue by treating each sample in the training set as if it comes from a distinct client. The TVM training process assumes that the i-th sample of client j can

39 2.2 Audio-Visual biometric systems 15 be represented by a GMS: µ i,j = m + T w i,j (2.2) where T is the low-dimensional total variability matrix and w i,j is the identity vector, commonly referred to as the i-vector representation of the utterance. A cosine similarity scoring [43] and the probabilistic linear discriminant analysis (PLDA) [64] are commonly used for evaluating the i-vector based systems. Visual features Visual features are extracted from the normalized facial images. Unlike speech utterance, the dimensions of the facial images is fixed. Therefore, some recognition systems directly use the pixel intensity values as features. Other systems use more sophisticated hand-crafted features from the local segments of the image samples. The following feature types are commonly used by face recognition systems: Appearance-based: The transform vectors of the Region Of Interest (ROI) pixel intensities are used as appearance-based features. The transform techniques used are PCA, LDA, DCT, and Wavelet transforms. These are computationally inexpensive and can be extracted dynamically in real time. As pixel intensities are used for computation, the quality of the appearancebased features degrades under intense head-pose and lighting variations. If a gray-scale image of size a b is represented by x R a b then the image can be transformed to a feature vector such that x R a b ˆx R q 1, where q = a b. This type of feature is suitable for classifiers which create dictionaries or templates of the clients and use the Euclidean distance between a test feature vector and a response vector obtained from the dictionaries/templates for scoring. Parts-based: In this approach, each face image is divided into regions and then features are extracted from each region. Examples of features include the local binary patterns (LBP) presented in [65]. In its simplest form, LBP 8,1 (with a radius of 1 pixel and 8 sampling points), patterns are extracted from each pixel of an image by thresholding the values of the 3 3 neighbourhood of the pixel against the central pixel. The extracted patterns are then interpreted as binary numbers. An LBP operator with a radius of 2 pixels, 8 sampling points and uniform patterns (i.e., if it contains at most two 0-1 or 1-0 transitions when viewed as a circular bit string) is represented as LBP u2 8,2 and is commonly used in face recognition research. Typically, a histogram of LBP is computed in each region after applying the LBP u2 8,2 operator to the face image. Since only 58 of the 256 possible 8 bit patterns are uniform, the

40 16 A Review of Audio-Visual Biometric Systems LBP u2 8,2 operator provides a compact representation while building the LBP histograms. The histograms obtained from all the regions of a face image are concatenated to form a feature vector. TVM for face recognition: Recently, the TVM method has been used for extracting i-vectors from the visual modality inputs [9]. First, feature vectors are extracted from the overlapping blocks in the normalized facial images. Feature extraction is carried out by extracting say c c blocks of pixels using an exhaustive overlap. If an image is of size a b then a total of k = (a c 1) (b c 1) blocks are extracted from that image. In [66], different values of c were evaluated and it was shown that a lower error rate can be achieved when c is set to 12. The pixel values of each block are normalized to zero mean and unit variance. Then, the l + 1 lowest frequency 2D Discrete Cosine Transform (2D-DCT) [67] coefficients are extracted from each block removing the zero frequency coefficient. The value of l is commonly set equal to 44. The resultant 2D-DCT coefficients are also normalized to zero mean and unit variance in each dimension with respect to the other feature vectors obtained from the same image. The feature vectors obtained from all the training images can be used to build the UBM. Then, the TVM method (Eq. 2.2) is applied to extract an i-vector given the observations of an image Classification The role of a classifier is to build the client models and score them in the test phase. In this section, a brief description of two of the most widely used classifiers in recent years is presented. Support vector machine (SVM) The SVM is a two-class classifier which fits a separating hyperplane between the two-classes. During the training phase, an optimal hyperplane is chosen such that it maximizes the Euclidean distance to the nearest data points on each side of the plane. The nearest data points on either side of the hyperplane are known as support vectors. An SVM classifier is constructed from the sums of kernel functions [68]: L S(x) = α i c i K(x, x i ) + d (2.3) i=1 where α i is the Lagrange multiplier associated with the i-th support vector, c i is the corresponding classification label (i.e., c i = +1 if x belongs to the user and c i = -1 if x belongs to the impostor), d is a learned constant, L is the number of

41 2.2 Audio-Visual biometric systems 17 support vectors, L α i c i = 0, α i > 0 and x i are the support vectors obtained from i=1 the training set. A kernel that is constrained to satisfy the Mercer condition is expressed by: K(x, y) = b(x) t b(y) (2.4) where b(x) is the mapping from the input space to a higher dimensional separating space. Finally, a verification decision is made based on whether the value of S(x) is above or below a threshold. Although the SVM classifier is suitable for verification, multi-class identification can be performed by designing a one-against-all (OAA) SVM for each of the N registered clients. For example, the SVM for an unknown person claiming to be a registered client j is a two-class system, where class 0 represents the training data from client j and class 1 represents the training data from all the other (N 1) clients. Identification is thus performed by carrying out a total of N two-class classifications and selecting the SVM with the maximum decision function value. The functional operation of the OAA SVM classifier for a client j can be expressed by: L j S j (x) = α j i cj i K(x, xj i ) + dj (2.5) i=1 where ({α j i, xj i }N i=1, d j ) are obtained from the SVM optimization algorithm. A rank-1 identification is performed by determining the SVM classifier with the maximum decision value: ĵ = max }{{} S j (x) (2.6) 1 j N Linear regression-based classifier (LRC) Suppose there are N classes with L j number of training images from the j-th class. A class specific model X j is developed by stacking the q-dimensional feature vectors: X j = [ (1) x j (2) x j... (L j) x j ]ϵr q t j, j = 1, 2,..., N (2.7) During the training phase, each class j is represented by a subspace X j. When a test feature vector yϵr q 1 is presented, y should be represented as a linear combination of the training samples of the class it belongs to [41]. This is expressed by: y = X j β j, where β j ϵr L j 1 is the vector of parameters. The vector β j can be estimated using least-squares estimation: ˆβ j = (X T j X j ) 1 X T j y. (2.8)

42 18 A Review of Audio-Visual Biometric Systems The estimated vector of parameters, β j, along with the predictors X j are used to predict the response vector for each class j by using ŷ j = X j ˆβj, where ŷ j is the predicted response vector in the j-th subspace. The distance measure between the predicted response vector ŷ j and the original response vector y is then calculated: d j (y) = y ŷ j 2. (2.9) and the rank-1 identification decision is given in favour of the class with minimum distance ĵ = }{{} min d j (y) (2.10) 1 j N Fusion The Fusion of multiple modalities may be carried out at an early stage (i.e., feature fusion) or a late stage (i.e., score- or rank-fusion). Since feature-level fusion may result in large feature vectors which slows down the recognition process, fusion at the score-level is commonly adopted in most biometric systems. If the matching scores obtained from multiple sub-systems are heterogeneous (e.g., posterior probability and Euclidean distance), many fusion rules first transform these into a common domain before fusion can take place. This process of transforming the scores is known as score normalization [69]. Once multiple score sets are normalized they can be combined into one score using the following fusion approaches. Non-adaptive fusion If the weights of the sub-systems are not adjusted based on some specific criteria then it is referred to as non-adaptive fusion. This approach may be used if it assumed that the data from both modalities are clean (Chapter 3 of this thesis). Fusion is performed by combining the matching scores obtained from the classifiers of the sub-systems using an appropriate rule (e.g., product, sum, min, or max rule). A theoretical framework of the rule based fusion methods is presented in [20]. If an input vector x m is presented to the m-th classifier then the matching score for the j-the class is P (λ j x m ), which is the probability of x m belonging to class j. Let c {1, 2,..., N} be the class to which the input is finally assigned. The identity c can be determined using the following rules: Product rule: c = argmax j M P (λ j x m ) (2.11) m=1

43 2.2 Audio-Visual biometric systems 19 Sum rule: Max rule: Min rule: c = argmax j M P (λ j x m ) (2.12) m=1 c = argmax max P (λ j x m ) (2.13) j m c = argmax min P (λ j x m ) (2.14) j m Adaptive fusion In a real life scenario, data from a modality may contain noise due to variations in the input data (e.g., illumination, background and pose variations on the face images; additive noise and channel distortion, and session variability on the speech samples). Therefore, an adaptive fusion approach would be essential where the sub-systems are weighted based on a criteria such as the quality of the sensed data or some statistics of the matching score distribution (Chapter 4 to Chapter 6 of this thesis). The following adaptive rules are used to determine c: Adaptive product rule: c = argmax j M P (λ j x m ) αm (2.15) m=1 Adaptive product rule: c = argmax j M α m P (λ j x m ) (2.16) m=1 where α m is the weight assigned to the classifier of the m-th modality and α 1 + α α M = 1. The statistics of the matching score distributions are commonly used in the literature to calculate the modality weights. For example, the dispersion of the log-likelihoods [70], the entropy of the posteriors [71] and the C-ratio [47] [72]. Linear logistic regression (logreg) fusion The logreg fusion approach combines a set of M classifiers using the sum rule. Let f m (x m, λ j ) represent the score assigned by classifier m M between its input vector x m and the j-th client model, λ j. Therefore, the fused score for the j-th

44 20 A Review of Audio-Visual Biometric Systems client model is obtained using a linear combination: M f β (x m, λ j ) = β 0 + β m f m (x m, λ j ) (2.17) where β = [β 0, β 1,..., β M ] are the fusion weights also known as the regression coefficients. These coefficients are calculated by using the maximum likelihood estimation of the logistic regression model on the scores of a development set [8]. Let Y cli be the set of pairs y = {x m, λ j }, where the identity of the test sample and that of the client is the same. Similarly, let Y imp be the sets of pairs z = {x m, λ j }, where the identity of the test sample is different from the identity of the client. Then, the objective function to maximize is: L(β) = log(1 + exp(f β (y, β))) log(1 + exp( f β (y, β))). (2.18) y Y imp y Y cli The maximum likelihood procedure converges to a global maximum. The optimization is carried out using the conjugate-gradient algorithm [73]. This fusion approach has been used to combine heterogeneous speaker [74][75], face[76] and bimodal recognition systems [9] [77]. m= Audio-visual corpora There exists only a few audio-visual databases that are freely available for biometrics research. This is because the field is just fledgling and the data collection process poses some challenges such as the synchronization between the audio and video data, storage and the privacy of the subjects. Existing audio-visual databases vary in size, in the types of recorded utterances and in the recording environment. In this section, a brief review of existing audio-visual databases is presented. BANCA The BANCA database [78] was captured in four languages (e.g., English, French, Italian and Spanish) using high and low quality microphones and cameras. The subjects were recorded over a period of three months and in three different scenarios such as controlled, degraded and adverse. There are 208 recorded subjects with an equal number of males and females. Each subject recorded 12 sessions with 2 recordings (1 true client access and 11 informed impostor attack) per session. The subjects were asked to say a random 12 digit number, name, address and date of birth. A cheap analogue web cam and a high quality camera were used to record the video data. Similarly, both cheap and high quality microphones were

45 2.2 Audio-Visual biometric systems 21 Fig. 2.3 A sequence of rotating head movements included in the M2VTS, XM2VTS and VidTIMIT databases. (Examples are taken from XM2VTS) used to capture the audio data. Thus, the BANCA database contains realistic and challenging conditions and allows for robustness comparison of different systems. M2VTS and XM2VTS The M2VTS [79] contains audio-visual data of 37 subjects, uttering digits 0 to 9. The data was captured in five sessions, with a gap of at least one week between the sessions. An extended M2VTS (XM2VTS) database was created with the audiovisual data of 295 subjects [80]. In XMT2VTS, data collection was carried out in four sessions uniformly distributed over a period of five months, in order to capture the variabilities that are due to the changes in appearance, mood and physical conditions. The database was acquired using a Sony VX1000E digital cam-corder and DHR1000UX digital VCR. The subjects were asked to read through twice three sentences written on a board positioned just below the camera. The subjects were also asked to rotate their head while they were being recorded (refer to Fig. 2.3 for the sequence head movements). VidTIMIT The VidTIMIT [81] database consists of videos (with speech) of 43 subjects (19 females and 24 males) reciting short sentences and rotating their heads under facial expressions. The sentences were selected from the test portion of the TIMIT corpus. This database is useful for lip reading and multi-view face and speech/speaker recognition research. A broadcast quality digital video camera was used to record the subjects in the database. There are 10 videos per subject collected in 3 sessions. The first six videos were captured in Session 1, the next two in Session 2 and the other two in Session 3. There was an average delay of seven days between Session 1 and 2 and 6 days between Session 2 and 3. The first two sentences for all subjects

46 22 A Review of Audio-Visual Biometric Systems Table 2.1 AusTalk subjects represent variations in geography, dialect and emotion. State Capitals Subjects Regional Subject Other Subjects Total NSW UNSW Emotion USYD Disordered Armidale (UNE) Bathurst (CSU) QLD UQ 120 Townsville (UQ) VIC UMELB 120 Castlemaine (UMELB) SA Flinders NT - - Darwin (CDU) 24 Aboriginal English (CDU) Alice Spring (CDU) 24 WA UWA TAS UTAS ACT UC ANU Total Table 2.2 AusTalk data collection protocol. (Time in minutes) Session 1 Session 2 Session 3 Task Time Task Time Task Time Opening Yes/No 3 Opening Yes/No 2 Opening Yes/No 2 Words 10 Words 10 Words 10 Read narrative 5 Interview 15 Map Task (First run) 20 Re-told narrative Read digits 5 Read digits 5 Map task (Second run) Read sentences 8 Conversation 5 Closing Yes/No 2 Closing Yes/No 2 Closing Yes/No were the same, while the remaining eight were different for each subject. Each subject performed a sequence of head rotations as shown in Fig AusTalk The AusTalk [1] [82] [83] dataset consists of a large collection of audio-visual data acquired from 15 different locations in all states and territories of Australia (see Table 2.1). The project started in 2011 and was funded by an Australian Research Fig. 2.4 AusTalk data capturing environment (taken from [1]).

47 2.2 Audio-Visual biometric systems 23 Table 2.3 A summary of the MOBIO datbase Site Phase I Phase II # subjects # sessions # shots # subjects # sessions # shots (female/male) (female/male) BUT 33 (15/18) (15/17) 6 11 IDIAP 28 (7/21) (5/21) 6 11 LIA 27 (9/18) (8/18) 6 11 UMAN 27 (12/15) (11/14) 6 11 UNIS 26 (6/20) (5/19) 6 11 UOULU 20 (8/12) (7/10) 6 11 Council grant. Twelve identical stand-alone recording set-ups (see Fig. 2.4) were built and shipped to 17 different collection points at 15 locations. Each subject in the database was recorded in three sessions at intervals of at least one weak. The database contains spoken words, digits, sentences and paragraphs (see Table 2.2) from 1000 subjects representing regional and social diversity and linguistic variation of Australian English. The participants speech was recorded using five microphones and two stereo cameras. Each of the three sessions includes different subsets of the read and spontaneous speech tasks. The participants were prompted to read aloud a list of words, digits and sentences (bold faced in Table 2.2) presented on a computer screen in front of them. These recordings are ideally suited for speaker recognition (e.g., [32], [46] and [47]). The Read narrative and Re-told narrative tasks (underlined in Table 2.2) in Session 1 provide materials for the study of differences between spontaneous language. The Interview, Map task and Conversation tasks (italicised in Table 2.2) provide materials for speech analysis. Finally, a set of Yes/No questions recorded at the beginning and end of each session provides a range of positive and negative answers. MOBIO MOBIO [84] is a challenging bimodal database containing 61 hours of audio-visual data recorded in 12 sessions. It includes 192 audio-visual signals from each of the 150 participants, acquired at 6 different sites over a period of one and a half years. It is a challenging database because data were captured with real noise. For example, the video frames contain uncontrolled illumination, expression, near-frontal pose and occlusion, while speech signals are relatively short. Hence, the MOBIO database is suitable for evaluating systems that would operate in an uncontrolled environments (Chapter 6 and Chapter 7 of this thesis). The MOBIO data collection was carried out using two mobile devices: a Nokia N93i mobile phone and a 2008 MacBook laptop. Only the first session data was captured using the laptop, while in the other sessions mobile phones were used.

48 24 A Review of Audio-Visual Biometric Systems Table 2.4 The unbiased MOBIO evaluation protocol used at ICB Set Phase-I Phase-II samples subjects Sess. 1 Sess. 2-6 Sess /subject TR DEV EVAL Training 5p, 10f, 5r, 1l 5p, 10f, 5r, 1l 5p, 5f, 1l Development 5p Evaluation - 10f, 5r 5f Data acquisition was carried out in two phases: Phase I and Phase II. A Dialog Manager (DM) was installed on the mobile phones at each site and for each user. During data recording the DM prompted the participants with predefined and random short response questions, free speech questions and to read a predefined text. For the short response questions, the DM also supplied predefined and fictitious answers to the participants. In addition, the responses to the free speech questions were fictitious and hence did not necessarily relate to the question. The following predefined short response questions were considered: a) What is your name? b) What is your address? c) What is your birth date? d) What is your license number? and e) What is your credit card number? A summary of the MOBIO database is shown in Table Audio-visual biometric systems evaluated in a controlled environment Table 2.5 lists the existing audio-visual biometric systems evaluated in a controlled environment. Their feature extraction, fusion and classification technique have been included in the table. This section provides a brief overview of the systems listed in Table 2.5. While MFCCs are commonly used as acoustic features, there are three types of visual features used by the existing audio-visual biometric systems: a) appearancebased, b) shape-based and c) a combination of appearance- and shape-based features. An appearance-based feature is used by most systems listed in Table 2.5. A transformation (PCA, LDA, DCT, or Wavelet transforms) of the pixel intensity vectors representing a Region Of Interest (ROI) is used to generate the appearance-based features. A ROI can represent the following: eyes, nose, or mouth [4] [53] [54], or the whole face [28] [55 58]. Appearance-based features are computationally inexpensive and can be extracted dynamically in real time. Since a transformation of the pixel intensities are used as appearance-based features, the quality of these features degrades under intense head-pose and lighting variations. A geometric or model-based representation of the face or lip contours has been used for shapebased features [85] [88 90]. However, their extraction requires a robust algorithm which may be computationally intensive. Few audio-visual biometric systems used a

49 2.3 Audio-visual biometric systems evaluated in a controlled environment 25 Table 2.5 Audio-visual biometric systems evaluated in controlled environments Author and year Chibelushi et al. [85], 1993 Brunelli and Falavigna [53], 1995 Dieckmann et al. [54], 1997 Jourlin et al. [87], 1997 Wark et al. [88 90], Ben-Yacoub et al. [55, 91], 1999 Aleksic and Katsaggelos [27], 2003 Chaudhari et al. [28], 2003 Sanderson and Paliwal [56], 2004 Erzin et al. [57], 2005 Fox et al. [4], 2007 Sugiarta et al. [92], 2010 Wong et al. [93], 2011 Sahoo and Prasanna [58], 2011 Features Fusion Classifier Database (subjects #) Reported performance Visual Acoustic Shape-based MFCCs Score-level ANN 10 speakers EER=1.5% (PCA, LDA, (weighted sum) concatenation) Appearancebased Appearancebased (gray level projection and optical flow analysis) Appearancebased and shape-based Shape-based (PCA, LDA) Appearancebased Facial Animation Parameters (FAPs) Appearancebased (DCT) Appearancebased (PCA) Appearancebased (eigenface, DCT) Appearancebased (DCT) DT-CWT, DT-CWPT, PCA Wavelet subbands Appearancebased (PCA, LDA) MFCCs,, 2D Fourier transform LPCs,, MFCCs LPCs MFCCs,, MFCCs MFCCs, DCT MFCCs Wavelet packet tree coefficients MFCCs MFCCs,, Score-level (weighted product) Score-level (2-from-3 approach) Score-level (weighted summation) Score-level (weighted summation) Post classifier using binary classifiers (SVM, Bayesian, FLD, decision tree, MLP) Feature-level (concatenation) Feature-level concatenation, Score-level Score-level (weighted summation), Feature-level (concatenation), SVM, Bayesian classifier Score-level (Adaptive cascade with the ordering based on reliability of classifier) Score-level (weighted sum) Feature-level (concatenation) Feature-level (concatenation) Score-level (summation) VQ 89 speakers IR=98% Synergetics [86] 66 staff RR=93% HMM M2VTS(37) IR=100%, FAR=0.5% GMM M2VTS(37) IR=80% (equal weight fusion, 12.2dB SNR) HMM, XM2VTS(295) EER<1% spherity measure HMM CMU(10) EER=1.71% GMM IBM(304) IR=69.1% GMM VidTIMIT(43) TE vs SNR HMM HMM MVGL- AVD(50) EER=1.4% (clean), and 6.3% (5dB SNR) XM2VTS(295) IR=89.9% VQ VidTIMIT(43) IR=93.7% VQ GMM UNMC-VIER (123), CUAVE (36), XM2VTS(180) IITG-DIT M4(94) RR=98.4% (UNMC- VIER), 97.2% (CUAVE), 83.3% (XM2VTS) EER=5.85% (Cohort fed FV) * : first-order derivative; : second-order derivative; IR: Identification Rate; RR:Recognition Rate; EER:Equal Error Rate; FAR: False Acceptance Rate; TE: Total Error; SNR: Signal to Noise Ratio;

50 26 A Review of Audio-Visual Biometric Systems combination of appearance- and shape-based features [87] [94]. This was typically carried out by concatenating both appearance- and shape-based features. Shivappa et al. [95] presented a review paper that emphasized the fusion approaches used in audio-visual biometrics. They reported that a weighted summation of the matching scores is a popular fusion strategy [53], [57], [85], [87 90]. However, the selection of the weights is an open issue and researchers have addressed this by applying different methods. Chaudhari et al. [28] proposed a combination of feature-level and score-level fusion, but their proposed system achieved an identification rate of 69.1%. Sugiarta et al. [92] proposed a feature-level fusion of the Dual Tree Complex Wavelet Transforms (DT-CWT) which achieved a higher identification rate of 93.7%. Two review papers, presented by Shivappa et al. [95] and Aleksic and Katsaggelos [19], summarized the audio-visual biometric recognition systems proposed up to the year Chibelushi et al. [85] used a database of ten speakers. MFCCs were used as acoustic features, and PCA+LDA for extracting visual features. They proposed a weighted summation of the matching scores which achieved an EER of 1.5%. In another work, Brunelli and Falavigna [53] used a database of 89 speakers recorded in three sessions. Vector Quantization (VQ) was used for speaker identification and three visual classifiers (with features extracted from the nose, mouth, and eyes) were combined using the weighted product rule of fusion. Their proposed system achieved an overall identification accuracy of 98%. Dieackmann et al. [54] collected the audio-visual data of 66 staff at the Fraunhofer-Institute for Integrated Circuits. Unlike the other systems, a Synergetics [86] was used for classification. A score-level fusion (2-from-3 approach of opinion) was used to determine the identity of an unknown person. Their proposed system achieved an identification rate of 93% with the three modalities combined. Jourlin et al. [87] presented a speaker verification approach using the pixel intensity features extracted from speaking faces and a lip tracker to extract shape information from the lips. The proposed system achieved a False Acceptance Rate (FAR) of 0.5%. Wark et al. [88 90] presented few approaches for audio-visual biometrics using shape-based visual features and MFCCs as acoustic features. They carried out their training in clean conditions and the testing in degraded acoustic conditions. Experiments were carried on the M2VTS database using an equally weighted score-level fusion. Moreover, Ben-Yacoub et al. [91] utilized the features extracted from frontal faces and speech recordings in the XM2VTS database for text-dependent and text-independent audio-visual verification. They evaluated their system using the following binary classifiers: SVM, Bayesian classifier, Fisher s linear discriminant, decision tree, and MLP. Aleksic and Katsaggelos [27] presented an audio-visual speaker recognition system using the Hidden Markov Model (HMM) and Facial

51 2.3 Audio-visual biometric systems evaluated in a controlled environment 27 Animation Parameters (FAPs) that are supported by the MPEG-4 standard for representing visual features. They tested their system on the Cernegie Melon University (CMU) audio-visual database. Their audio-visual person identification system achieved an Identification Error (IE) of 5.13% under clean conditions. Moreover, their audio-visual person verification system achieved an Equal Error Rate (EER) of 1.71% under clean conditions. Chaudhari et al. [28] modelled the reliability of the audio and video information streams using time-varying and context-dependent parameters. Their system extracted 23 MFCCs and 24 DCT coefficients to use as acoustic and visual features, respectively. They tested the system on the IBM database, achieving an EER of 1.22% (feature fusion). Sanderson and Paliwal [56] used appearance-based visual features, and MFCCs as acoustic features. The proposed system was tested on the VidTIMIT database. They also analysed several fusion methods: weighted summation, Bayesian classifier, SVM, concatenation, adaptive weighted summation, modified Bayesian post-classifier, and piecewise linear post-classifier. They found that most of the non-adaptive fusion systems were similar and that the performance degraded in noisy conditions. Erzin et al. [57] presented a multimodal open-set speaker identification system that integrated the information from three modalities: audio, face, and lip motion. They proposed an adaptive cascade rule for the fusion of multiple modalities. They tested their system on the MVGL-AVD database. Their proposed system achieved an Equal Error rate (EER) of 1.4% and 6.3% under clean conditions and 5dB Signal-to-Noise Ratio (SNR) in the speech signal, respectively. Fox et al. [4], in 2007, proposed a system that was based on three sub-systems: audio, visual speech, and face. They performed fusion in an automatic unsupervised manner which adapted to the local performance and the reliability of each classifier. Identification experiments were carried out on a 248-subject subset of the XM2VTS database and an identification accuracy of 89.9% was achieved. In another approach, Sugiarta et al. [92] proposed a feature-level fusion of DT-CWT features. Their system achieved an identification accuracy of 93.7% on the Vid- TIMIT database. Wong et al. [93] presented an audio-visual biometric system in the compression domain. A multi-band feature fusion method was used to select the wavelet sub-bands invariant to pose and illumination. Their proposed system achieved 98.4%, 97.2%, and 83.3% identification accuracies on the UNMC-VIER, CUAVE, and XM2VTS databases, respectively. Sahoo and Prasanna [58] presented a system that was tested under degraded conditions. They used MFCCs as the acoustic feature and GMM for speaker modelling. A combination of the PCA and LDA was used for face verification.

52 28 A Review of Audio-Visual Biometric Systems Speaker recognition methods on MOBIO GMM-based Parts-based GMM-UBM Meta-modelling GMM templates Variability modelling ISV JFA TVM Fig. 2.5 A taxonomy of existing speaker recognition methods using MOBIO 2.4 Mobile person recognition The recent popularity of smart phones, tablet and laptop computers has laid the platform for various mobile friendly applications. Many of these applications are required to handle the personal information of their users. Hence, it is important that access is only given to the registered users. Recently, person recognition using mobile phone data has been extensively studied for the development of more robust biometric systems. Two evaluation competitions were also held on face and speaker recognition using MOBIO at ICPR 2010 [96] and at ICB 2013 [7] [97]. At ICB 2013, an unbiased protocol (see Table 2.4) was used to assure a fair evaluation and comparable results. The protocol divides the MOBIO database into three disjoint sets: training, development and evaluation. The training set is used for the learning of the background modelling and/or for score normalization, while the development set can be used for tuning the meta-parameters (e.g., number of Gaussians). Client models are built and biometric systems are evaluated using the evaluation set. The following performance measures are used: EER, Half-Total- Error Rate (HTER) and Detection Error Trade-off (DET). These measures rely on the False Acceptance Rate (FAR) and the False Rejection Rate (FRR), which are calculated for the development and evaluation sets independently.

53 2.4 Mobile person recognition 29 Table 2.6 Speaker recognition systems evaluated using MOBIO Reference Modelling Scoring Female Male EER HTER EER HTER Khoury et al. [97] Alpineon TVM PLDA ATVS TVM PLDA CDTA TVM cosine CPqD (sub-i) GMM-UBM LLR CPqD (sub-ii) TVM cosine CPqD (Fusion) Fusion logreg EHU TVM PLDA GIAPSI GMM-UBM LLR IDIAP ISV linear [98] L2F (sub-g) GMM-UBM LLR L2F (sub-s) GMS SVM L2F (sub-i) TVM cosine L2F (Fusion) Fusion logreg L2F-EHU logreg Mines-Telecom GMM-UBM LLR Phonexia TVM PLDA RUN TVM PLDA Khemiri et al. [99] GMM-UBM 3N Khoury et al. [100] GMM-UBM LLR ISV linear TVM linear Fusion logreg Boulkenafet et al. [101] TVM (LDA) cosine TVM (CEA [102]) cosine Roy et al. [103] Slice feature BSC Speaker recognition systems A number of speaker recognition systems using different modelling approaches have been evaluated on MOBIO (see Fig. 2.5). Based on client modelling the speaker recognition systems using the MOBIO corpus can be categorised as follows. GMM-based methods GMM-based modelling is the most commonly used method for client modelling in speaker recognition research. Recent variability modelling methods (e.g., ISV, JFA and i-vector) are also built upon the GMM-based modelling to compensate for the within-class and between-class variations. GMM-UBM: A client model, λ i, is built by adapting the UBM using all the utterances from client i and the maximum a posteriori (MAP) adaptation method [60]. Scoring is carried out by estimating the log likelihood ratio (LLR) of a test sample with regards to the client models. In Table 2.6, a list of speaker recognition systems evaluated on the MOBIO database is presented. In [97], the GMM-UBM was used by the systems identified as: CPqD (sub-i), GIAPSI, L2F (sub-g) and Mines-Telecom. These systems built gender-dependent UBMs with 512 or 1024 Gaussian components. The evaluation results showed that the GIAPSI system achieved the best HTER of 1 Using the mobile-0 protocol [9] 2 Using the MOBIO Phase I database [96]

54 30 A Review of Audio-Visual Biometric Systems 12.81% for the male clients among the systems which used a single decision maker. While none of the CPqD (sub-i), GIAPSI, L2F (sub-g) and Mines- Telecom performed score normalization, a Nearest Neighbour Normalisation (3N) technique was recently proposed [99] which improved the HTERs of the GMM-UBM based systems (see Table 2.6). In this approach, the test utterance is compared with the claimed identity model as well as the other models stored in the database. Then the difference between the LLR of the claimed identity and the maximal LLR of the other models is calculated. The claim is accepted only if the difference is above a predefined threshold. Meta-modelling: In this approach, the GMM-UBM method is utilised for two purposes: a) to generate a utterance-specific GMM (GMS) and b) for variability modelling. GMM templates: A GMM super-vector, µ i,j, corresponding to an utterance is formed by concatenating the means of a GMM λ i,j, representing only that utterance. Such super-vectors are used to build client templates with an SVM, referred to as a GMM-SVM in [104]. A sub-system of the L2F system, identified as L2F (sub-s) in [97], submitted to the 2013 speaker recognition evaluation used a GMM-SVM. Variability modelling: These methods aim to estimate and eliminate the effects of within-class and between-class variations. For example, the ISV [62] and JFA [63] methods attempt to eliminate the within-class variabilities commonly caused by the sensor (i.e., microphone) and the environment (i.e., background) noises. It is assumed that session variability results in an additive offset to the GMM super-vector. The JFA method can fail to separate within-class and between-class variations in two different subspaces, potentially due to the high dimensionality of the GMM super-vectors. The TVM method learns a lower dimensional total variability subspace (T) from which the i-vectors are extracted. In [97], the systems identified as Alpenion, ATVS, CDTA, CPqD (sub-ii), EHU, L2F (sub-i), Phonexia and RUN used i-vectors. Although the Alpenion system achieved the best HTER among all the submitted systems, it used a combination of 9 different TVM based sub-systems. Among the single systems, the EHU achieved the lowest HTER for both male and female clients. However, an ISV based system (IDIAP) performed better than EHU. Similar results were reported in [100], where a comparison between the baseline GMM-UBM, ISV and the i-vector based methods was presented (see Table 2.6). Moreover, a Conformal Embedding Analysis (CEA) [101] was presented as an alternative to the LDA subspace

55 2.4 Mobile person recognition 31 Face recognition methods on MOBIO Template-based GMM-based Raw pixels Hand-crafted features GMM-UBM Meta-modelling Subspace NeuNet Histograms Classifier Disparity GMM templates Variability modelling ISV JFA TVM Fig. 2.6 A taxonomy of the existing face recognition methods using MOBIO learning used in the TVM, but the HTERs achieved were not better than the TVM based systems presented in [97]. Parts-based method A fast parts-based method was presented in [103] using a Boosted Slice Classifier (BSC) and a novel set of features, called slice features, extracted from the speech spectra. The proposed method was inspired from the following object detection algorithms in the computer vision domain: a) rapid object detection using a boosted cascade of Haar features [52] [105], b) fast key-point recognition using random Fern features [106] and c) face detection and verification using LBP [105]. The 1-D spectral vectors derived from speech are considered equivalent to 2-D images. The classifier measures the spectral magnitudes at pairs of frequency points. The most discriminative classifiers is selected using the Discrete Adaboost algorithm. The BSC system was compared with the 17 systems presented by five research groups at the ICPR 2010 [96] for the first speaker recognition evaluation on the MOBIO Phase I database. Although the BSC system did not achieve the best HTERs for the male and female clients (18.9% and 15.5% respectively), it was computationally less complex than the GMM-UBM methods (see Table 2.6) Face recognition systems Different hand-crafted features such as the Patterns of Oriented Edge Magnitude (POEM), Gabor features, Local Binary Patterns (LBP), Local Phase Quantization (LPQ) and texture information have been utilised by the face recognition systems

56 32 A Review of Audio-Visual Biometric Systems Table 2.7 Template-based methods of face recognition on MOBIO taken from [7] HTER Participant Feature Image Block Method Female Male baseline raw subspace 20.94% 17.11% (PCA+LDA) UC-HU raw NeuNet % 6.21% LDA Subspace CDTA LBP Histogram 28.48% 11.92% TUT LBPHS Classifier 13.91% 11.54% (PLS) Idiap Gabor Disparity 12.50% 10.29% UTS Gabor wavelets unknown 8 8 Histogram LPQ unknown % 11.95% GRADIANT POEM Disparity Gabor % 9.52% CPqD LBP Classifier dlbp blocks (SVM) MSLBP locks 11.20% 7.66% MSLBP blocks Histogram UNIJ-ALP 3 cropping subspace - - (PCA) 10.45% 7.45% 3 features using MOBIO. Moreover, the session variability modelling techniques have been recently used in face recognition with some degree of success. In Fig. 2.6, a taxonomy of existing face recognition methods on MOBIO is presented. Template-based In this approach, raw or hand-crafted features are extracted from the enrolment images. Then, client models are built using one of the following: a) an average LBP histogram b) a Partial Least Square (PLS) classifier, c) an SVM classifier, or d) a subspace learning (e.g., PCA and/or LDA analysis). Raw pixel values: In [7], the baseline system was developed by finding a PCA + LDA [107] projection matrix on the raw pixel values taken from histogram equalized images of order pixels (see Table 2.7). The dimensionality of the PCA and LDA subspaces were limited to 200 and 199 respectively. The cosine similarity between the projected features of a model and probe image was used as the score. The baseline system achieved HTERs of 20.94% and 17.11% for the female and male clients respectively. In addition, the system identified as UC-HU in [7] learned features from grayscale images of order pixels using a Convolutional Neural Network similar to the one in [108]. Then, the Fisher LDA approach was used in order to adapt the learned features to the discriminant face aspects of the individuals in the training set. Person specific linear models were learnt by taking into consideration the samples of the person being enrolled as the positive class and all the other

57 2.4 Mobile person recognition 33 samples in the training set for the negative class. The dot product between the model and the probe samples was used as a score. The UC-HU system achieved the lowest HTER, 10.83% and 6.21% for the female and male clients respectively, among all the simple systems which participated in the face recognition evaluation at ICB Hand-crafted features: Table 2.7 shows that most of the systems which participated in the 2013 MOBIO face recognition evaluation [7] used one or more of the following hand-crafted features: a) LBP, b) Gabor wavelet responses (i.e., Gabor Phases), c) POEM and d) colour information. Among all the template-based systems, the best performance was achieved by the UNIJ-ALP system which combined nine sub-systems by extracting three different features (intensity, Gabor and LBP) from each of the three styles of cropped faces (i.e., tight, normal and broadly cropped). This was also referred to as the representation plurality method in [109]. However, a much simpler (single system) Gabor feature based approach was used by the Idiap system and achieved HTERs of 12.50% and 10.29% for the female and male clients. It is also observed in Table 2.7 that a lower HETR can be achieved using larger images and blocks. Moreover, the fusion of multiple SVM classifiers (e.g., CPqD) proved to be a better approach for the LBP based features. GMM-based The success of the GMM-based method for speaker recognition inspired a number of face recognition systems to apply it to the MOBIO dataset. These systems can be divided into two groups: a) GMM-UBM modelling and b) Meta-modelling. The meta-modelling methods can be further divided into two groups: i) GMM template modelling and ii) Variability modelling methods. In [110], a comparison between the GMM-UBM and GMM template based methods was performed using the MOBIO face verification evaluation protocol [96]. It was shown that the GMM-template based methods performed significantly better than the GMM-UBM approach for both female and male clients. On the other hand, different variability modelling based methods have also been applied to face recognition. For example, the ISV and JFA methods were compared for face verification in [66]. It was reported that the HTERs achieved with the ISV modelling were 11.4% and 8.3% for female and male clients, while the corresponding HTERs of the JFA were 13.0% and 7.3%, respectively. On an average, the ISV approach performed better than the JFA, which is consistent with the speaker recognition evaluation. Recently, a score calibration technique [111] and a local ISV [112] method were proposed in an attempt to improve the performance of ISV. In addition, an i-vector based face recognition approach using the TVM was presented in [113]. A comparison of several session

58 34 A Review of Audio-Visual Biometric Systems Table 2.8 Audio-Visual person recognition systems on MOBIO. Methods HTER Reference A: Audio modality Audio Visual Fusion (logreg) V: Audio modality Female Male Female Male Female Male A: GMM-UBM 3 Shen et al. [114] V: LBP histograms 33.1% 33.7% 26.7% 27.9% 19.3% 22.7% McCool et al. [84] A: TVM + PLDA V: LBP histograms 17.7% 18.2% 28.2% 24.1% 13.3% 11.9% Motlicek et al. [77] A: ISV V: ISV 15.3% 8.9% 12.2% 7.5% 9.7% 2.6% Khoury et al. [9] A: S-TVM + PLDA V: F-TVM + cosine 17.36% 11.11% 16.16% 8.9% 9.93% 3.77% A: S-All V: F-All 14.64% 7.89% 11.62% 6.06% 6.30% 1.89% Khoury et al. [8] A: S-1+S S-11 V: F-1+F F % 4.63% 8.47% 6.27% 3.80% 1.78% A: DBM-DNN S-GMS 4 DBM-DNN [115, 116] V: DBM-DNN F-LBP % 9.68% 11.52% 9.75% 5.08% 3.55% A: DBM-DNN S-TVM V: DBM-DNN F-TVM % 13.77% 14.01% 14.37% 8.69% 6.67% A: DBM-DNN S-All V: DBM-DNN F-All 5.08% 3.55% 8.69% 6.76% 3.38% 2.29% compensation and scoring techniques were evaluated. It was reported that a combination of the Within-Class Covariance Normalization (WCCN) along with the Cosine Kernel Normalisation (C-norm) performed better (HTERs of 15.2% and 8.7% for female and male clients respectively) than any other session compensation methods considered using the MOBIO database Audio-visual person recognition on MOBIO Besides the unimodal speaker/face recognition systems, several audio-visual systems have been proposed. A brief overview of existing audio-visual systems is presented in Table 2.8. One common aspect between all these systems is their use of the logreg fusion. In [114], the GMM-UBM and LBP histogram methods were respectively used for the audio and visual modalities. Their system used the evaluation protocol from ICPR 2010 [117], while all the other systems which are listed in Table 2.8 were based on the protocol of ICB 2013 [7] [97]. A combination of i-vector and LBP histogram based recognition systems was studied in [84]. However, much improved results were reported in [77] based on the use of ISV for both modalities. Furthermore, a comparison study (in that same paper) confirmed that ISV modelling achieved lower HTERs than the JFA and the traditional GMM based approaches. Moreover, the TVM method was used for both modalities in [9]. Session compensation, modelling and scoring were carried out using the PLDA [64]. In addition, the use of the cosine similarity measure [43] between the enrolment and test i-vectors was also used in [9]. In the same paper, three different sub-systems 3 Protocol used in the ICPR 2010 face and speech competition [96] 4 A DBM-DNN is a special kind of Deep Neural Network (DNN) which is initialized as a generative Deep Boltzmann Machine (DBM) and trained using the standard back-propagation method (see Section 2.5) for details

59 2.5 Deep neural networks for person recognition 35 Deep Boltzmann Machine Deep Belief Network h 2 h 2 W 1 W 1 h 1 h 1 W 2 W 2 v v Fig. 2.7 left: DBM; right: DBN corresponding to the face and speaker were combined using the logreg fusion to obtain F-All (= F-GMM+F-ISV+F-TVM) and S-All (= S-GMM+S-ISV+S-TVM), respectively. Finally, these combined unimodal systems were fused to obtain HTERs of 6.30% and 1.89% respectively for the female and male subjects of B-All (= S-All+F-All). The best visual modality HTER of 6.06% for the male clients was obtained by the F-All in [9]. The best HTER for the female clients (submitted to the ICB 2013 evaluation) were obtained by fusing eight speaker recognition systems [8]. However, the best audio modality HTERs 3.55% and 5.08% respectively for the male and female clients were achieved with the unimodal fusion of the DBM- DNN based sub-systems of that modality (i.e., DBM-DNN S-All = DBM-DNN S-GMS + DBM-DNN S-TVM and DBM-DNN F-All = DBM-DNN F-LBP + DBM-DNN F-TVM ). Moreover, the bimodal fusion of all the DBM-DNN based sub-systems (i.e., DBM-DNN S-All + DBM-DNN F-All ) resulted in the best HTER of 3.38% for the female clients, while the best fused HTER of 1.78% for the male clients was achieved by fusing the 19 sub-systems in [9]. In summary, the results in Table 2.8 provide evidence that the DBM-DNN based systems are able to play an important role in audio-visual person recognition using mobile phone data. A detailed description of DBM-DNN is presented in the next section. 2.5 Deep neural networks for person recognition A Deep Neural Network (DNN) is a feed-forward Artificial Neural Network (ANN) with multiple layers of hidden units between its input and output layers. Such networks can be discriminatively trained by back-propagating the derivative of the mismatch between the target outputs and the actual outputs [33]. In the training

60 36 A Review of Audio-Visual Biometric Systems phase, the initial weights of a DNN can be set to small random values. However, a better way of initialization is to generatively pre-train the DNN as a Deep Belief Network (DBN) or as a Deep Boltzmann Machine (DBM) and then fine-tune using the enrolment samples [118] A DBN-DNN for unimodal person recognition A DNN which is pre-trained generatively as a DBN is referred to as a DBN-DNN [33]. In a DBN, the top two layers are undirected but the lower layers have topdown directed connections (see Fig. 2.7). Undirected models such as restricted Boltzmann machines (RBMs) are ideal for layer-wise pre-training [33]. An RBM is a type of Markov random field (MRF) with a bipartite connectivity graph, no sharing of weights between different units and a subset of unobserved variables. Multiple RBMs can be stacked to form a DBN. Recently, a DBN-DNN approach for speaker recognition was presented in [2]. First, a global DBN model (Fig. 2.8a), is built by using unlabelled samples from the training set. Then, an impostor selection algorithm and a clustering method are used for each client to achieve the balance between the genuine and impostor samples. When the number of positive and negative samples are balanced the DBN parameters are adapted to each client. The adaptation process starts by pre-training each network initialized by the DBN parameters and using few numbers of iterations to avoid over-fitting. Once the adaptation process is completed, a network is fine-tuned by adding a label layer (with two units) on top and then using the stochastic back-propagation (Fig. 2.8b). The connection weights between the top label layer and the adjacent layer below are initialized randomly and then pre-trained by back-propagating the error one layer for few iterations. Finally, full back-propagation is carried on the whole network. If the genuine and impostor feature vectors are represented as o h N RBM N h N RBM N h N-1 h N-1 h 2 RBM 2 h 2 RBM 2 h 1 RBM 1 h 1 RBM 1 v v (a) (b) Fig. 2.8 a) Generative DBN and b) discriminative DBN-DNN [2]

61 2.5 Deep neural networks for person recognition 37 (l 1 = 1, l 2 = 0) and (l 1 = 0, l 2 = 1) during the training process, the final output score in the testing phase in the log likelihood ratio format can be given as follows: LLR = log(o 1 ) log(o 2 ) (2.19) where o 1 and o 2 represent the outputs of the top layer units. Although the DBN model was designed for a verification scenario, it can be modified and used for identification. For example, during the fine-tuning process, N units (corresponding to all the targets) instead of two may be added on top of a network initialized by the DBN parameters. Then, back-propagation can be carried out on the whole network by using only the genuine samples of all the target persons. The unit with the maximum value is declared the winner. The DBN-DNN approach presented in [2] was evaluated for speaker recognition on the NIST SRE 2006 corpora. A similar approach can also be adapted and used for audio-visual person recognition A DBM-DNN for person recognition A DBM is a variant of the Boltzmann machine which not only retains the multi- layer architecture but also incorporates the top-down feedback (see Fig. 2.7). Hence, a DBM has the potential of learning complex internal representations and dealing more robustly with ambiguous inputs (e.g., image or speech) [119]. DBM training Consider a two-layer DBM with no within-layer connections, Gaussian visible units v i R D and binary hidden units h 1 j, h 2 k {0, 1} P. A state of the DBM can be represented by a vector x = {v, h 1, h 2 }, where v = [v i ] i=1,...,u represent the units in the visible layer, and h 1 = [h 1 j] j=1,...,p1 and h 2 = [h 2 k] k=1,...,p2 respectively represent the units in the first and the second hidden layer. Then, the energy of the state of the DBM is given by: E(x θ) = D (vi bi) 2 2σ 2 i=1 i D P 1 vi σ 2 h 1 j w1 ij i=1 j=1 i 2 P l P 1 P 2 c l j hn j h 1 j h2 k w2 jk, (2.20) l=1 j=1 j=1 k=1 where the terms σ i represent the standard deviation of the units in the visible layer, whereas b i and c l j respectively represent the biases of the units in the visible and the l-th hidden layer. In addition, the symmetric interaction terms between the visible-to-hidden and hidden-to-hidden units are contained in W 1 = {w 1 ij} and W 2 = {w 2 jk }, respectively.

62 38 A Review of Audio-Visual Biometric Systems DBMs can be trained with the stochastic maximization of the log-likelihood function. The partial-derivative of the log-likelihood function is: L(θ v) E(v (t), h θ) E(v, h θ) =, (2.21) θ θ θ data model where h = [h 1, h 2 ],. data and. model denote the expectation over the data distribution P (h {v (t) }, θ) and the model distribution P (v, h θ), respectively. Here, {v (t) } is the set containing all the training samples. Although the update rules are well defined, it is intractable to exactly compute them. The variational approximation is commonly used to compute the expectation over the data distribution (the first term of Eq. 2.21). The variational parameters for the l-th hidden layer, µ l j, are estimated by: ( P l 1 µ l j f i=1 P l+1 µ l 1 i + k=1 ) µ l+1 k w l kj + c l j (2.22) where f(.) is a sigmoid function, µ 0 i = v i and P l+1 = 0. The variational approximation method provides the values of the variational parameters which maximise the following lower-bound with respect to the current parameters: where p(v θ) E Q(h) [ E(v, h)] + H(Q) log Z(θ) (2.23) H(Q) = 2 P l ( ) µ l jlog µ l j + (1 µ l j)log (1 µ l j). (2.24) l=1 j=1 is an entropy functional. Hence, the gradient update step increases the variational lower-bound of the log-likelihood. Subsequently, different persistent sampling methods [119] [120] can be used to compute the expectation over the model distribution (the first term of Eq. 2.21). The simplest approach is the Gibbs sampling which closely resembles to the variational expectation-maximization (EM) algorithm [121]. Learning is carried out by alternating between: a) finding variational parameters and b) updating the DBM parameters using the stochastic gradient method. The objective of updating the DBM parameters is to maximize the variational lower-bound. Person recognition using DBM-DNN Similar to the DBN-DNN, a DBM can be converted into a discriminative network, which is referred to as a DBM-DNN [42]. In [115] and [116] we proposed the score-level fusion of DBM-DNNs using the sum rule for audio-visual biometrics. First, two DBMs (one for each modality) are trained. For example, the DBM F-LBP

63 2.5 Deep neural networks for person recognition 39 h 2 RBM 2 Joint RBM h 2 h 1 h 1 v Stage 1: find a good set of variational parameters (µ 2 ) of Q(h 2 ) using a DBN RBM1 v Q(h 2 ) Stage 2: learn a model that has the predictive power of the variational parameters (µ 2 ) given v v Finetune: Find a set of DBM parameters that fit the variational parameters (µ 2 ) Fig. 2.9 Two-stage pretraining of a DBM with two hidden layers. Shaded nodes indicate clamped variables, while white nodes indicate free variables.

64 40 A Review of Audio-Visual Biometric Systems and DBM F-TVM for the face modality, and DBM S-GMS and DBM S-TVM for the speech modality. The steps involved in DBM-DNN based person recognition are detailed below. Generative pretraining of DBM: It is not trivial to start the training from randomly initialized parameters [122]. Hence, the DBMs are pre-trained using the two-stage algorithm shown in Fig.2.9. In this approach, posterior distributions over the hidden units and DBM parameters are obtained separately. The two stages are detailed below: Stage 1: At this stage the objective is to find a good set of variational parameters regardless of the parameter values of the DBM. This is performed by taking the posteriors of the hidden units from another model such as a DBN or a Deep Auto Encoder (DAE). A DBN can be trained efficiently [123] to find a good approximate distribution over units in the even-numbered hidden layer. Hence, a set of initial variational parameters (µ 2 ) for the second hidden layer is found from a DBN (left panel of Fig. 2.9). Stage 2: A joint distribution over the visible vector and variational parameters is learned using another RBM (central panel of Fig. 2.9). The visible layer of the joint RBM corresponds to the visible layer and the even-numbered hidden layer of the DBM that is being pretrained. The connections between the layers of the joint RBM are bidirectional like those of the DBM. Finally, when the joint RBM is trained, the learned parameters are used as initialisations for the training of the DBM (right panel of Fig. 2.9) which corresponds to freeing h 2 from its variational posterior distribution Q(h 2 ) obtained in Stage 1. Discriminative finetuning of DBM-DNN: Once the parameters of a DBM are learnt (see the right panel of Fig. 2.9), they can be used to initialize the hidden layers of a corresponding feed forward DNN. For a bimodal application (e.g., audio-visual person recognition), two such DBMs are generatively trained. These DBMs are then converted into discriminative DBM-DNNs such as the DBM-DNN F-LBP and DBM-DNN S-GMS [115], or the DBM-DNN F-TVM and DBM- DNN S-TVM [116]. This is done by first adding a softmax layer on top of each DBM and then fine-tuning them with the enrolment data by using the standard back-propagation algorithm (see Fig. 2.10). When a set of probes is presented to the system, they are clamped to the visible layer of the corresponding DBM-DNN and the resultant softmax layer is used to generate the scores. Decision making: The outputs of the DBM-DNNs are combined using the sum rule of fusion (see Fig. 2.10), which for a claimed identity j is given by

65 2.5 Deep neural networks for person recognition N 1 2 N N... Fusion h 2 h 2 h 1 h 1 v F Face modality DBM-DNN v S Speech modality DBM-DNN Fig DBM-DNNs are initialized with DBM weights and discrminitaively finetuned. The output scores are fused before reaching a final decision. the following equation: f j (v F, v S ) = m (o m,j ) (2.25) where m = {audio, visual} and o m,j represents the probability of the inputs v F and v S belonging to person j. An unknown person s claim j is accepted if f j (v F, v S ) is above a predetermined threshold. Evaluation: The equal error rate (EER) and the half-total-error rate (HTER) are used for the evaluation of the DBM-DNNs in a held-out dataset scenario (e.g., MOBIO evaluation): EER = F AR dev(θ dev ) + F RR dev (θ dev ) 2 HT ER = F AR eval(θ dev ) + F RR eval (θ dev ) 2 (2.26) (2.27) These measures rely on the false acceptance rate (FAR) and the false rejection rate (FRR), which are calculated for the development and evaluation sets independently: F AR(θ) = {s imp s imp θ} {s imp } (2.28)

66 42 A Review of Audio-Visual Biometric Systems F RR(θ) = {s cli s cli θ} {s cli } (2.29) where s cli are the client scores and s imp are the impostor scores. A score threshold based on the EER of the development set is calculated as: θ dev = arg min F AR dev (θ) F RR dev (θ) (2.30) θ and the HTER is calculated using this threshold. 2.6 Summary This chapter has presented an overview of the existing audio-visual person recognition systems those evaluated in both controlled and uncontrolled environments. The application of DNN based systems (DBN-DNN and DBM-DNN) are discussed. However, the DBM-DNN discussed in this chapter is only applicable in the context of a held out database. New clients can be enrolled and scoring can be performed by respectively following the adaptation and LLR scoring strategies used by the DBN-DNN. Despite the promising results reported here using DNNs, it should be emphasized that there are still scopes of further investigations, e.g., to improve the performance of the learning algorithms and the design of novel architectures using DNNs that are tailored to multimodal person recognition.

67 Chapter 3 Linear Regression-based Classifier for Audio-Visual Person Identification Abstract This paper presents an audio-visual person identification system using Linear Regression-based Classifier (LRC) for person identification. Class specific models are created by stacking q-dimensional speech and image vectors from the training data. The person identification task is considered a linear regression problem, i.e., a test (speech or image) feature vector is expressed as a linear combination of the (speech or image) model of the class it belongs to. The Euclidean distance between a test feature vector and the estimated response vectors for all the class specific models are used as matching scores. These matching scores from both modalities are normalized using the min-max score normalization technique and these are then combined using the the sum rule of fusion. The system was tested on 88 subjects from the AusTalk audio-visual database. Experimental results show that the identification accuracy after audio-visual fusion is higher compared to the identification accuracy of an individual modality. 3.1 Introduction The identity of a person can be established by comparing his/her personal information with the reference models stored in a database. A person can present several forms of identification tokens, such as an identification card (ID), a Personal Identification Number (PIN), login credentials, or a combination of these. The robustness This article is published in the Proceedings of the First International Conference on Communications, Signal Processing, and Their Applications (ICCSPA 13), pp. 1-5, Feb 12-14, 2013.

68 44 Linear Regression-based Classifier for Audio-Visual Person Identification of an unimodal system, where decisions are taken based on the information from a single modality, depends on the quality of sensed data, handling of impostor attacks, and non-universality. Multimodal biometric systems can overcome these limitations by using the information extracted from more than one biometric traits. An audio-visual biometric identification system is a multimodal system which uses the captured speech and visual signals as inputs and extracts a set of matching scores against all the user models. Multiple modalities can be fused at three different levels of recognition: the feature extraction level, matching score level, or decision level [31]. In the feature-level fusion, the correlations between the input feature vectors are used. However, the feature vectors should be represented in a common format and the time synchronization between them is also required. In the decision-level fusion, decisions from different modalities are combined to reach a final decision. The matching scores from multiple classifiers are combined, in the score-level fusion, to get a final set of scores. Recent audio-visual biometrics systems use of different techniques for face recognition, while the Mel-Frequency Cepstral Coefficients (MFCCs) are generally used as speech features. For example, Zhao et al. [124] presented a semi-supervised approach for audio-visual person recognition. They used the Local Binary Pattern (LBP) [65] feature extraction method on the most confident detected face from each video. Then, their system learned two discriminant subspaces with both labelled and unlabelled face and voice data. A nearest-template classifier was used to assign the label of the closest template to the test samples. In another work, Yu and Huang [125] presented a person recognition system using audio-visual feature fusion. Visual features were extracted using the Pyramid Gabor-Eigenface (PGE) algorithm [126]. The PGE algorithm uses 1-D filter masks instead of 2-D filtering operation to obtain Gabor features in spatial domain. They proposed a Probabilistic Neural Network (PNN) of four layers: input layer, two hidden layers, and output layer. In the classification phase, audio-visual patterns for each subject were used as the weights of its first hidden layer. The probability density functions (pdf) estimates for an input feature vector belonging to a class were added in the second hidden layer. Finally, the network classified an input to belong to the class with maximum pdf. In addition, Wong et al. [93] presented a video-based face recognition system in the compression domain. A multiband feature fusion method was used to select the wavelet subbbands that remained invariant to illumination and facial expression variations. They used the Radial Basis Function (RBF) Neural Networks for classification. Sahoo and Prasanna [58] presented an audio visual biometric system which was tested under degraded conditions. A combination of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) was used for face verification, and Gaussian Mixture Model (GMM) [127] was used for speaker modeling. In this paper, we propose an audio-visual identification

69 3.2 Linear Regression-based Classifier 45 system by using the down-sampled images [41] as visual features. This approach is computationally efficient because a high identification accuracy can be achieved even when the feature vector dimension is low. It also performs well under both clean and noisy conditions. Naseem et al. [41] presented the LRC algorithm. It develops class-specific models using down-sampled images of the enrolled users. A least-square estimation is used to estimate the vectors of parameters for a given probe against all class models. Finally, a decision is given in favor of the class with the most precise estimation. Speaker modeling is the other major part in audio-visual biometrics. Gaussian Mixture Model (GMM) is commonly used for speaker modeling. It assumes that the feature vectors from a modality follow a Gaussian distribution. In speaker verification, the claimed identity of a speaker is checked by scoring the test feature vector against the claimed speaker model and comparing this score against an impostor model [104]. The impostor model is built using all the models other than the model of the claimed identity. This is referred to as the the Universal Background Model (UBM). The UBM can also be used in open-set speaker identification for detecting unknown speakers. It is more reliably trained and accurately models all speaker data. Therefore, it does not suffer from insufficient training or unseen data [104]. In this paper, we propose to use LRC for speaker identification. Individual GMM is adapted from the UBM and this can be referred to as the GMM-UBM approach for speaker modeling. The mixture means from the GMM-UBM are then concatenated to form a supervector [128] that is considered a static template of the utterance. This is similar to the LRC used for face recognition. The vectors of parameters are estimated using a least-square estimation technique and a decision is given in favor of the class with the most precise estimation. This approach is referred to as the LRC-GMM-UBM, which is a novel application of this classification paradigm to speaker recognition. The rest of the paper is organized as follows: in Section 3.2, the LRC-GMM- UBM approach is described. Brief descriptions of the min-max score normalization technique and the fusion method are presented in Section 3.3 and Section 3.4 respectively. Experimental results on the speech and video recordings of 88 speakers from the AusTalk database are presented in Section 3.5. The paper concludes in Section Linear Regression-based Classifier In this paper, we propose the use of LRC for face identification [41] and a novel LRC-GMM-UBM approach for speaker identification. LRC is a linear regressionbased classifier that uses class specific models of each person in the database. The

70 46 Linear Regression-based Classifier for Audio-Visual Person Identification main concept of LRC is that the samples from a specific class lie on a linear subspace [129] [130]. Therefore, the task of person identification can be considered a linear regression problem. We use the LRC for face identification using the process described in Section Here, we present the LRC-GMM-UBM approach for speaker identification LRC-GMM-UBM for Speaker Recognition In this paper, we use the LRC-GMM-UBM, which is a novel approach, for speaker recognition. First, MFCCs are extracted from each speech signal. In MFCCs extraction process, complex spectral values are derived by applying a Fast Fourier Transform (FFT) operation to each uniformly spaced frame in the speech signal. These values are then converted to K filter bank values through logarithmic smoothing operation which uses a Mel scale. Then, the K log filter bank spectral values,{log(s k )} K k=1, are converted to L cepstral coefficients using DCT. Typically L = 12 MFCCs are extracted per frame which comprises the feature vector for that frame. When n = 0, the c 0 cepstral coefficient represents the average log-power of the frame. Then, Cepstral Mean Normalization (CMN) is applied to compensate for the channel variabilities. The delta and acceleration coefficients are also computed to capture the temporal dynamics in speech. This is done by using the and parameters, which are polynomial approximations of first and second order derivatives. These parameters are then augmented with the 13 dimensional (including c 0 ) MFCCs. Therefore, a 39-dimensional (D = 39) feature vector (MFCC + delta + acceleration) is created. Since an utterance from a speaker is of variable duration (T ), the size of the feature vector (D T ) is also not fixed. GMM is one of the most generic methods in speaker modeling. The basic approach of training a GMM consists of estimating the parameters λ = {P m, µ m, Σ m } M m=1 from a training sample S = {x 1,..., x T } using Maximum Likelihood (ML) estimation. The higher the value of the parameters the more likely it is that the unknown vector originate from the model λ. The popular Expectation Maximization (EM) algorithm [121] is used to maximize the likelihood with respect to the given data. Adaptation of acoustic models is an important step because of data variability due to different speakers, environments, style and so on. The adaptation for each utterance of all speakers can be done as follows: All (D T ) features from all utterances over all speakers are used to train the UBM of M mixtures and D dimension. The UBM can be represented as λ UBM = {P m, µ m, Σ m } M m=1.

3.2 Linear Regression-based Classifier 47 An utterance from a speaker is used to adapt the GMM-UBM from the UBM. The adaptation formula can be written as ˆµ = Wξ (3.

71 3.2 Linear Regression-based Classifier 47 An utterance from a speaker is used to adapt the GMM-UBM from the UBM. The adaptation formula can be written as ˆµ = Wξ (3.1) where ˆµ is the adapted mean vector, W is the D (D + 1) transformation matrix (D is the dimensionality of the data), and ξ is the UBM mean vector. The means from GMM-UBM are concatenated to form the supervector of length (M D) which is a fixed quantity for any duration of utterance. Training is carried out by processing the supervector from the Gaussian mixtures. A supervector consisting of all training utterances for a speaker is created. If there are N speakers and j i is the number of training utterances for a speaker i, each utterance can be represented as: (k) s i ϵr p m (3.2) where i = 1, 2,...N, k = 1, 2,...j i, p is number of MFCC coefficients (39 in this paper), and m is the number of Gaussian mixtures (128 in our experiments) used to model the data. Unlike the LRC used for face recognition, the speech vectors are not downsampled or transformed. Similar to the Eq. 2.7, class specific speaker models S i are developed by stacking q dimensional (q = p m) speech vectors: S i = [ (1) s i (2) s i... (j i) s i ]ϵr p m, i = 1, 2,..., N (3.3) In the testing phase, a test speech, s, is transformed into a vector, uϵr q 1. As in Eq. 2.8, the vector u should be represented by a linear combination of the training speech models of the class to which it belongs to. Once the response vector for each class (û i ) is predicted, the distance between the original response vector u and the predicted response vector û i is calculated. A decision is given in favor of the class with minimum distance. Fig. 3.1 A typical speaker with white Gaussian noise from 0.0 to 0.9 (clockwise from top left)

72 48 Linear Regression-based Classifier for Audio-Visual Person Identification 3.3 Score Normalization In the identification mode, a biometric system compares the captured biometrics with the templates of all users in the database for a match. Therefore, an expert (e.g., audio or visual) produces a set of matching scores that exhibits either likelihood or distance measures. In score normalization, all the scores from an expert are transformed to cover an identical range (e.g., [0,1]) so that a common threshold (particularly in verification) can be used. Many techniques can be used for score normalization, such as the min-max, z-score, double sigmoid function, tanh-estimators. The min-max normalization is the simplest technique and best suited when the minimum and maximum values of the scores are known. It can still be used if the bounds of the raw scores are unknown, but will be extremely sensitive to outliers. Although min-max normalization is straightforward, it has comparable performance to other techniques [69]. Therefore, we used the min-max normalization technique in our experiments Min-max normalization Let S denote the set of raw matching scores from an expert, and s denote a score such that sϵs. The normalized scores of s is denoted by s. In min-max normalization, the raw scores from the experts are mapped to an interval [0,1] and the original distribution of matching scores is retained. The normalized score s is calculated as s = (s k min) (max min) (3.4) where min is the minimum and max is the maximum value of the raw scores generated by a classifier. The min-max normalization transforms the raw scores into a [0,1] range. 3.4 Audio Visual Fusion The normalized scores from the audio and visual classifiers are combined to give the final decision about the identity of a person. A number of approaches for score level fusion can be found in the literature: the maximum rule, minimum rule, sum rule, and the product rule. In the sum rule, fused score is computed by adding the scores from each modality. The experimental results in [20] showed that the sum rule of the classifier fusion outperforms other rules. Therefore, we used the sum rule in our experiments, giving equal weights to both modalities. A decision was given in favor of the class with minimum fused score.

73 3.5 Experimental Results and Analysis 49 Table 3.1 audio-visual fusion performs better in clean conditions Modality Identification accuracy Audio 97.72% Visual 99.71% audio-visual fusion 100 % 3.5 Experimental Results and Analysis Extensive experiments have been carried out to illustrate the effectiveness of the audio-visual fusion and the LRC approach. A set of 88 speakers (recorded at our UWA campus) from the AusTalk [1] database was used in our experiments. Each speaker was asked to utter twelve different combinations of four-digit number sequence. The video data was captured using a Bumble Bee 2 stereo camera. We randomly picked one video frame from each of the twelve videos for a speaker. Then, the Adaboost algorithm proposed by Viola Jones [131] was applied for face detection. We used eight face images for training and four face images in the testing phase. Similarly, we used eight speech recordings for training and four for testing. All experiments were carried out by using the down-sampled face images to an order of pixels and also converting the audio signals from 32-bits to 16-bits. To test the robustness of the LRC and the LRC-GMM-UBM, data from both modalities were degraded by using the Additive White Gaussian Noise (AWGN). Signal to Noise ratios (SNRs) ranging from 12dB to 40dB in increments of 2dB were used for degrading the speech data, and AWGN variance levels (with zero mean) ranging from 0.06 to 0.2 in increments of 0.01 were used to degrade the 100 Identification accuracy (%) Feature vector dimension (image) Fig. 3.2 Identification accuracy with respect to feature vector dimension

74 50 Linear Regression-based Classifier for Audio-Visual Person Identification face image data. In Fig. 3.1, The impact of AWGN on a typical face image is shown. In our experiments, we considered the following three degraded conditions: i) clean visual and noisy audio; ii) clean audio and noisy visual; and iii) noisy audio and visual. However, initial experiments were performed keeping both audio and visual data clean. The performance of the audio-visual system in clean conditions is presented in Table 3.1. It can be seen that the audio-visual fusion achieved an identification accuracy of 100% in clean conditions. The use of down-sampled images makes the face recognition process faster and efficient. The performance of the visual modality was tested against different levels of the feature vector dimension. In Fig. 3.2, it can be seen that the identification accuracy maintained a maximum level of 98.15% when the feature vector size was 40 or more pixels. Therefore, the LRC approach for face recognition is able to achieve a high identification accuracy even when the feature vector is small. In Fig. 3.3, the performance of audio-visual fusion is shown in clean visual and noisy audio conditions. It can be seen that the audio-visual fusion achieved better identification accuracy than any individual modality. This is because the variance of scores generated by the LRC-GMM-UBM was very low when a higher level of noise is added to the audio data. The low variance of the LRC-GMM-UBM scores did not significantly affect the variance of the fused scores. As a result, the the identification accuracy of the audio-visual fusion was at least equal to the accuracy of the clean modality, which was the visual modality in this case. The opposite was also true when the visual data was noisy and the audio was clean. In Fig. 3.4, it is shown that the overall identification accuracy is at least equal to the accuracy Audio AV fusion (clean visual) Identification accuracy (%) SNR (db) Fig. 3.3 Audio-visual fusion performance with noisy audio and clean visual data

75 3.5 Experimental Results and Analysis Visual AV fusion (clean audio) Identification accuracy (%) Visual noise (AWGN variance) Fig. 3.4 Audio-visual fusion performance with noisy visual and clean audio data AV fusion Visual Audio Identification accuracy (%) Visual noise (AWGN variance) Acoustic Noise (SNR) Fig. 3.5 Audio-visual fusion performance with respect to AWGN (97.72%) of the audio modality under clean conditions. Therefore, the audio-visual fusion performs well when one modality is noisy and the other is clean. The surface plot in Fig. 3.5 shows the audio-visual fusion performance when both modalities are noisy. It can be seen that the audio-visual fusion achieved higher identification accuracy even at the worst conditions (12dB SNR and 0.2 variance of AWGN). For example, the audio modality achieved an identification accuracy of only 10.22% when the SNR was 12dB, and the visual modality achieved an identification accuracy of 26.13% at the AWGN variance of 0.2. However, when both modalities were fused, the identification accuracy increased to 31.3%. Therefore, the audio-visual fusion clearly outperforms an individual modality even when both the modalities are noisy.

76 52 Linear Regression-based Classifier for Audio-Visual Person Identification 3.6 Conclusion In this paper, the LRC is used for audio-visual person identification. The audiovisual fusion at the score level is evaluated under different noise conditions. It has been shown that the audio-visual fusion can achieve better identification accuracy than an individual modality when the conditions are poor. This paper also motivates the development of an adaptive fusion strategy to further improve the identification performance in challenging conditions.

77 Chapter 4 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers Abstract This paper presents a reliability estimation technique for an audio-visual person identification system. The estimated reliability measures are mapped to modality weights by using a mapping function. The weighted sum rule of fusion is used to fuse the audio and visual modalities before reaching a decision about the identity of a person. The proposed technique was tested on 88 subjects from the AusTalk audiovisual database. Experimental results show that the proposed technique improves the identification accuracy of the min-max normalization, compared to when it is used without any reliability estimation (as in Chapter 3). The identification accuracy of the proposed technique is also higher than a non-learned approach of modality weight adaptation. 4.1 Introduction Information fusion refers to the combination information from different sources, either to generate a common format, or to reach a decision [56]. In [56], the following motivations for using information fusion were reported: a) utilization of complementary information, b) use of multiple sensors, c) cost reduction, and d) ease of data acquisition. Information fusion techniques are also used in multimodal biometric systems to facilitate the recognition process. An audio-visual person This article is published in the Proceedings of the 8th IEEE Conference on Industrial Electronics and Applications (ICIEA 13), pp , June 19-21, 2013, with a title "An Efficient Reliability Estimation Technique for Audio-Visual Person Identification".

78 54 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers identification system is a multimodal biometric system where a person s identity is determined by comparing his/her captured speech and face images with the templates of all users stored in the database. The process of data acquisition in audio-visual biometrics is unobtrusive and requires low cost sensors. Therefore, information fusion can play an important role in audio-visual biometrics in the following ways: a) to fuse the information from the audio and visual modalities and b) towards the successful identification of a person. The recognition (identification/verification) performance of a biometric system is dependent on a number of factors that includes the quality of sensed data. For example, if the sensed data is noisy or poorly represented, the recognition accuracy of the biometric system will be low. Multimodal biometric systems can mitigate these problems in the following ways: a) using information from more than one source (e.g., using more than one sensors to acquire the same biometric), b) using different matching algorithms for the same biometric, or c) by using different biometric traits [69]. A multimodal biometric system improves the accuracy, is more robust to noise, and provides reasonable protection from spoofing attack, compared to a unimodal biometric system. However, assigning weights to the modalities is a critical problem. This can be done adaptively by mapping some kind of reliability measure into modality weights. The reliability of a modality can be estimated at the signal level or at the matching score level [3]. A signal based reliability is estimated directly from the signal and prior to feature extraction. Examples include estimation of Signal-to- Noise Ratio (SNR) in [132] and degree of voicing in [71], and fingerprint image quality in [133]. These measures can be included in the feature vector to achieve better classification or they can be used to give higher weight to the more reliable modality. Although SNR can be used for estimating the reliability of the audio modality, there is no corresponding measure for the visual modality in audio-visual biometrics. Using an audio only reliability measure is undesirable, because the integrity of the video signal is vulnerable due to various reasons, such as the high video bandwidth requirement and sensitivity to illumination conditions [3]. In addition, even if the visual signal is of high quality, the model of the subject can be a poor representation (i.e., the pose of the subject may be inconsistent). Therefore, reliability measures calculated from the matching scores have been employed, because it can quantify both train/test mismatch and the confidence in classification decision. Examples include score entropy [71], score dispersion [71], score variance [134], cross classifier coherence [135], score difference [88], scores itself [3], and the difference between two highest ranked scores normalized by the mean score [4]. A mapping between the reliability estimates and the modality weights is also required. This is done by using sigmoid mapping in [71], empirical regression in [136], or a non-learned approach in [3] and [4]. In sigmoid mapping,

79 4.1 Introduction 55 the parameters of the sigmoid curve requires training, which is difficult when the amount of audio-visual data is scarce. In [136], empirical regression was used to map SNR values to weights. However, the non-learned approach presented in [3] and [4] automatically maps the reliability estimates into modality weights by searching for the best weights that maximize the reliability of the combined scores. In a multimodal biometric system, fusion is performed at different levels of recognition: the feature extraction level, matching score level, or decision level [31]. Feature-level fusion is an early fusion strategy which uses the correlation between multiple features. It requires the features to be represented in a common format and time synchronization between them. However, when the number of modalities increases, finding cross-correlation among the features is difficult [137]. In decision-level fusion, decisions from different experts are combined to reach the final decision. Majority voting, combination of ranked lists, AND/OR operators are the common techniques used in decision level fusion [19]. In majority voting [138], the final decision is made when the majority of the classifiers reaches the same decision. In ranked list combination approach [138], the ranked lists from all experts are combined to obtain a final ranked list. In OR fusion, the final decision is made as soon as one of the experts reaches a decision, and in AND fusion, the final decision is made when all experts reach the same decision. On the other hand, fusion at the matching score level is preferred because the matching scores are easily accessible and they can be easily combined using simple rules (e.g., sum rule and product rule) as presented in [20]. However, the matching scores from different experts may not be homogeneous, because one expert may output distance measures and the other expert may output likelihood measures. Hence, transforming the scores into a common numerical scale is necessary before they are combined. This process of transforming the raw scores into a common range is known as score normalization. A brief discussion about the min-max score normalization, a simple normalization technique, is discussed in Section 4.2 of this paper. In this paper, we propose a novel reliability estimation technique which estimates the reliability of a modality from the ratio of the mean of top k ranked scores to the mean of the score distribution. It does not require any prior learning or searching for the best combination of weights. The proposed technique was tested on 88 subjects from the AusTalk audio-visual database. It improves the performance of the min-max normalization, compared to when it is used without a reliability estimation. The proposed technique also performs better in adverse conditions than the non-learned approach presented in [3] and [4]. The rest of the paper is organized as follows: a summary of the min-max score normalization technique is presented in Section 4.2. The proposed reliability estimation technique is discussed in Section 4.3. A description of the experimental

80 56 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers Table 4.1 Raw scores and the ranked list from the audio and visual expert for a probe (belonging to Client 1) LRC-GMM-UBM Score (clean audio) Client LRC-GMM-UBM Score (SNR = 30dB) Client LRC-GMM-UBM Score (SNR = 12dB) Client LRC Score (clean video) Client LRC Score (AWGN, σ 2 = 0.1) Client LRC Score (AWGN, σ 2 = 0.2) Client setup is included in Section 4.4. In Section 4.5, experimental results are presented followed by discussion in Section Score Normalization In the identification mode, a biometric system compares the captured biometrics with the templates of all users in the database for a match. Therefore, an expert (e.g., audio or visual) produces a set of matching scores that exhibits either likelihood or distance measures. In score normalization, all the scores from an expert are transformed to cover an identical range (e.g., [0,1]) so that a common threshold (particularly in verification) can be used. Many techniques can be used for score normalization, such as the min-max, z-score, double sigmoid function, tanh-estimators. The min-max normalization is the simplest technique and best suited when the minimum and maximum values of the scores are known. It can still be used if the bounds of the raw scores are unknown, but will be extremely sensitive to outliers. Although min-max normalization is straightforward, it has comparable performance to other techniques [69]. Therefore, we used the min-max normalization technique in our experiments Min-max normalization Let S denote the set of raw matching scores from an expert, and s denote a score such that sϵs. The normalized scores of s is denoted by s. In min-max normalization, the raw scores from the experts are mapped to an interval [0,1] and the original distribution of matching scores is retained. The normalized score s is calculated as s = (s min(s)) (max(s) min(s)) (4.1) where min(s) and max(s) are the minimum and maximum values of the raw matching scores, respectively.

81 4.3 Proposed Reliability Estimation Technique Proposed Reliability Estimation Technique In this paper, we propose a reliability estimation technique which is used after the raw scores are normalized by using the min-max normalization. We assume that there are N number of subjects in the database and each time a probe is presented N scores are generated from an expert. We also obtain a ranked list of N possible identities from an expert. The audio expert, E A, generates a set of N scores, S A = {a 1, a 2,..., a N }, and a set of N ranked identities, R A = {A 1, A 2,..., A N }, with the most likely identity listed on top. Similarly, we get a score set S V = {v 1, v 2,..., v N } and a ranked list R V = {V 1, V 2,..., V N } from the visual expert, E V. The raw scores in S A and S V are normalized so that we get two sets of normalized scores S A = {a 1, a 2,..., a N } and S V = {v 1, v 2,..., v N }. We assume that in presence of noise or poor representation of the probe, the best match identity from the experts may not be the same. In Table 4.1, an example of raw score distributions and ranked lists for a test speech and face image from a client (e.g., Client 1) is shown. The audio and visual experts are the LRC-GMM-UBM and LRC respectively as in [46]. The matching scores represent distance measure between the original response vector (y) and the predicted response vector (ŷ i ) for each user (i = 1, 2,..., N). The decision is given in favor of the user for whom the distance is minimum. In Table 4.1, it can be seen that the best match identity of both experts are same in clean conditions, and the combined decision would be given in favor of the correct user. On the other hand, if one of the modalities is noisy and the other is clean or both modalities are noisy, there is a mismatch in the best match from each experts. The best match identity can still be the same in noisy conditions, but it is more likely that the decision would be given in favor of a wrong user. We propose to assign different weights to the modalities when their best match identities are mismatched. Let the best match (V 1 ) of the visual expert (E V ) be in the k 1 -th place in the ranked list (R A ) of the audio expert (E A ), i.e., V 1 A 1 and V 1 = A k1. Similarly, let the best match (A 1 ) of the audio expert (E A ) be positioned in the k 2 -th place in the ranked list (R V ) of the visual expert (E V ), i.e., A 1 V 1 and A 1 = V k2. Therefore, the reliability of the modalities (ζ A and ζ V ) can be calculated using the following equations: ζ A = 1 mean(s A ) mean(s A ) (4.2) ζ V = 1 mean(s V ) mean(s V ) (4.3) where ζ A and ζ V are the reliability measures, and S A = {a 1,..., a k 1 } and S V = {v 1,..., v k 2 } denote the set of top k 1 and k 2 ranked scores of the audio and visual experts, respectively. When the best match identities match, ζ A = ζ V = 1 which

82 58 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers implies that the modalities are equally reliable and equal weights are assigned to each modality. In the worst case, k 1 = k 2 = N, the reliability measures ζ A and ζ V are equal to 0. This situation can be handled either by recapturing the test speech and image samples or by giving equal weights to the modalities. In our experiments we assigned equal weights to handle the worst case, which occurred only 0.005% times of the total tests. The mapping between the reliability measures to modality weights is governed by the following equation: α = ζ V ζ A 2 (4.4) where α is the weight assigned to the more reliable modality. For example, if ζ V > ζ A then the visual expert E V is more reliable, and the weights are assigned as: w V = α and w A = (1 α). Similarly, if ζ A > ζ V then the audio expert is more reliable, and the weights are assigned as: w A = α and w V = (1 α). Then, the normalized scores from the audio and visual experts are combined using the weighted sum rule as follows: F = w A S A + w V S V (4.5) where F = {f 1, f 2,..., f N } is the set of fused scores. The decision is given in favor of the user for whom the fused score is minimum. 4.4 Experimental Setup We tested the proposed technique on a set of 88 speakers (recorded at our campus) from the AusTalk audio-visual database [1]. Videos were captured using a Bumble Bee 2 stereo camera. We preprocessed the raw videos and extracted face images from each frame using the Adaboost algorithm [131]. The speech signals were converted to a 16 bit format so that speaker models can be created using the Hidden Markov Model Toolkit (HTK) [139]. We used the downsampled face images and Mel Frequency Cepstrum Coefficients (MFCCs) the visual and acoustic features respectively. There are 12 sessions for each user, uttering different 4 digit numbers in each session. We used 8 speech recordings for training the speaker models, and 10 randomly picked frames from the videos for training the visual models. To test the robustness of the proposed technique, both audio and visual test signals were degraded at fifteen different levels. Additive White Gaussian Noise (AWGN) was applied to clean speech data at SNR levels ranging from 40dB to 12dB in decrements of 2dB. Similar noise of 0 mean and variance ranging from 0.06 to 0.2 in increments of 0.01 was applied on visual data. Thus, in our experiments, we

83 4.5 Results and Analysis Identification accuracy (%) min max with rel. min max AWGN (variance) Fig. 4.1 Identification accuracy at different levels of noise on visual data and mild audio noise (SNR = 40dB) carried a total of = 79, 200 tests covering different combinations of audio and visual noise on four test samples per speaker. Initially, we implemented the min-max technique and fused the modalities assigning equal weights to them. Then, we applied the proposed technique on min-max normalized scores to demonstrate that the proposed technique improves the identification accuracy. We also implemented the non-learned approach of mapping reliability estimates into weights as presented in [3] and [4]. In [3], the authors proposed a reliability estimation technique that was used for verification. They used the difference between genuine and impostor scores to estimate the reliability of a modality. On the other hand, the authors in [4] proposed a reliability estimation technique for the identification problem. They used the difference between the best score and the second best score normalized by the mean as the reliability estimate of a modality. However, both techniques used a non-learned approach of mapping reliability estimates into modality weights. We implemented the latter technique to compare with our reliability estimation technique which used Eq. 4.2 and Eq. 4.3 for reliability estimation. 4.5 Results and Analysis We used the min-max normalization for score normalization because its performance is comparable with the other complicated normalization techniques. We compared the identification performance between the min-max normalization with the proposed reliability estimation and the min-max normalization without any

84 60 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers 100 Identification accuracy (%) min max with rel. min max Audio SNR (db) Fig. 4.2 Identification accuracy at different levels of audio SNR and mild visual noise (AWGN variance = 0.06) reliability estimation. In Fig. 4.1, the comparison of identification accuracy at different levels noise on the visual modality and mild audio noise is presented. It can be seen that the proposed technique improves the accuracy when visual data is adversely affected by AWGN but the audio is less noisy (SNR = 40dB). This is because, when the level of noise is mild, the ratio of the mean of top k ranked scores to the mean of the score distribution is very low. Therefore, at mild noise levels, the modalities are almost equally reliable and receives almost equal weights. When one of the modalities is adversely noisy then the ranked list that it generates becomes less reliable because the the ratio of the mean of top k ranked scores to the mean of the score distribution is high. Therefore, the scores from the noisy modality should get less weight in the fusion process. Equations 4.2, 4.3 and 4.4 presented in this paper ensure that the more noisy modality gets lower weight than the other. In Fig. 4.2, the comparison of identification accuracy at different levels of audio SNR and mild visual noise is shown. It can be seen that the proposed technique improves the accuracy when the audio data is adversely degraded but the visual data is less noisy. Then, we tested the system at adverse noise levels on both audio and visual modality. In Fig. 4.3, the identification accuracy is shown for different noise levels on the visual data and higher noise on audio (SNR = 12dB). Although it is difficult to achieve higher accuracy when both modalities are highly noisy, our proposed technique performed better than using min-max without reliability estimation. Similarly, in Fig. 4.4, it can be seen that the performance of our proposed technique is better than using min-max without reliability estimation. This is because, when both modalities are noisy, the ranked list generated by both

85 4.5 Results and Analysis Identification accuracy (%) min max with rel. min max AWGN (variance) Fig. 4.3 Identification accuracy at different levels of noise on visual data (AWGN variance) and high audio noise (SNR = 12dB) 90 Identification accuracy (%) min max with rel. min max Audio SNR (db) Fig. 4.4 Identification accuracy at different levels of audio SNR and high visual noise (AWGN variance of 0.2) of them are unreliable and also the ratio of the mean of top k ranked scores to the mean of the score distribution is high for both modalities. As a result, there is not much difference between the weights assigned to the modalities. If two different modalities are used and the data degradation is done in different ways, it is difficult to measure which modality is relatively less affected in adverse noisy conditions. Our proposed technique efficiently assigned weights to the modalities and achieved better identification accuracy. On the other hand, the non-learned approach presented in [3] and [4] is an optimization problem and also is time

86 62 A Reliability-based Score-level Fusion of Linear Regression-based Classifiers Identification accuracy (%) our technique non learned approach Audio SNR (db) Fig. 4.5 Performance of the reliability estimation technique compared to the nonlearned approach in [3] and [4]at high visual noise with variance of 0.2 consuming as it searches for the best combination of weights that maximized the confidence if the combined decision. In Fig. 4.5, it can be seen that our proposed technique also achieved higher accuracy compared to the non-learned approach of assigning weights, at adverse audio and visual noise levels. 4.6 Discussion In this paper, a reliability estimation technique for an audio-visual biometric system has been presented. The proposed technique is based on the observation that the best ranked identity from each expert is same when the conditions are good. However, the ranked list as well as the score distribution changes when the data is degraded. The experimental results presented in this paper show that the proposed technique improves the identification accuracy of the min-max normalization without a reliability measure. Our proposed technique also achieved higher identification accuracy than the non-learned approach in adverse noisy conditions.

87 Chapter 5 A Late Fusion Framework For Audio-Visual Person Identification Abstract This paper presents a confidence-based late fusion framework and its application to audio-visual biometric identification. We assign each biometric matcher a confidence value calculated from the matching scores it produces. Then a transformation of the matching scores is performed using a novel confidence-ratio (C-ratio) i.e., the ratio of a matcher confidence obtained at the test phase to the corresponding matcher confidence obtained at the training phase. We also propose modifications to the highest rank and Borda count rank fusion rules to incorporate the matcher confidence. We demonstrate by experiments that our proposed confidence-based fusion framework is more robust compared to the state-of-the-art late (score- and rank-level) fusion approaches. 5.1 Introduction Identification systems have long been used for criminal investigations and are now increasingly being used for various real life applications, e.g., computer login, physical access control, time attendance management [140]. The identification task can be more challenging compared to the verification when the number of enrolled users is large. One way of developing an accurate identification system is to use instances from multiple modalities [141], such as the face image, speech and fingerprint. Multiple modalities are usually combined either at an early or at a late stage of recognition. This article is published in Pattern Recognition Letters, vol. 52, pp , January, 2015, with a title "A confidence-based late fusion framework for audio-visual biometric identification".

88 64 A Late Fusion Framework For Audio-Visual Person Identification Existing score fusion techniques can be categorized into four groups. The first group are the transformation-based fusion methods: the match scores are transformed into (not necessarily) a common range and then simple rules (e.g., product, sum, mean, max, etc.) [20] are applied to them. The second group are the density-based fusion methods: underlying match score densities are first estimated and then the joint likelihood ratio is calculated [142] [143]. The third group are the classifier-based fusion methods: the match scores are considered as features of a fusion classifier [ ]. Recently, another framework reported in [5] is known as the quality-based fusion approach: the modalities are weighted based on the quality measure of the corresponding biometric samples. Although score-level fusion is commonly adopted in multimodal biometrics, rank-level fusion is considered a more viable option [143] for systems operating in the identification mode. The ranked lists from different matchers are combined to reach a final decision [147]. Unlike score-level fusion, the accurate estimation of the underlying genuine/impostor score distributions and normalization are not required in ranklevel fusion Motivation and Contributions In real life scenario, a biometric system may encounter noisy outdoor environments. For example, a missing/wanted person detection system being operated at an airport, train/bus station, or some other public place. The biometric traits to be used in these types of applications must be unobtrusive (e.g., audio-visual) and the user s claim of an identity for verification may not be available. The challenge is, in an outdoor environment, the captured biometric samples may contain noise or corruption due to various environmental conditions (e.g., windy/gloomy atmosphere and low configuration capture devices). Quality-based fusion [5] offers a solution to this problem by measuring the quality of the input samples and passing this bit of additional information to the fusion module. However, measuring the quality at the signal level is particularly difficult from face image samples [3] because the source of statistical deviation is varied and difficult to model. Alternatively, the matching scores from a biometric matcher provide a good indication of the quality of the input samples, given the matcher s decision making ability is strong under normal circumstances. Incorporating a system s confidence in the participating modalities (matchers) has not been well studied and this lack of development has also been highlighted in [6]. Our motivation is to develop a fusion framework that works well when either or all input samples presented are contaminated by noise (e.g., detector noise, bit-error, transmission error and additive noise). The core contributions of this paper are listed below:

5.1 Introduction 65 Feature extraction LRC-GMM-UBM S Quality assessment Matcher confidence Q 1 Speech matcher Templates Training data Fusion Fused ranks Fused scores Face matcher Feature extraction

89 5.1 Introduction 65 Feature extraction LRC-GMM-UBM S Quality assessment Matcher confidence Q 1 Speech matcher Templates Training data Fusion Fused ranks Fused scores Face matcher Feature extraction LRC-ROI-RAW R Q2 Quality assessment Fig. 5.1 Block diagram of an audio-visual biometric system that incorporates sample quality and (or) matcher confidence measures in the fusion. Although quality-based fusion has been studied extensively [5], incorporating matcher confidence in the fusion has not been well studied [6]. Moreover, achieving sample quality in audio-visual biometrics is challenging [3]. The shaded box highlights our contribution.

90 66 A Late Fusion Framework For Audio-Visual Person Identification We propose a novel C-ratio which is the ratio of the matcher confidence obtained from the matching scores during the test phase to the maximum value of the matcher confidence obtained at the training phase (Section 5.3.1). We also propose a confidence factor to be used in rank-level fusion (Section 5.3.2). Our proposed confidence-based rank-level fusion approach considers that only the ranked lists and the maximum matcher confidence obtained at the training phase are available to the fusion module. We evaluate the robustness of our proposed framework and compare its performance with state-of-the-art score fusion approaches (Section 5.5). We also present a comparative analysis of our proposed confidence-based ranklevel fusion approach with state-of-the-art rank-level fusion approaches. In Fig. 5.1, a typical audio-visual biometric recognition system is shown with all possible fusion approaches including our proposed confidence-based fusion. Our contribution as a whole lies in the shaded box where the matcher confidence values are calculated from the match scores. 5.2 Fusion In Multibiometric Identification Let N denote the number of enrolled users and M denote the number of modalities. If s m,j is the score and r m,j the rank provided for the j-th template by the m-th matcher, j = 1... N; m = 1... M then for a given query we get M N score and rank matrices as follows: s 1,1 s 1,N s S = 2,1 s 2,N....., (5.1) s M,1 s M,N and r 1,1 r 1,N r R = 2,1 r 2,N (5.2) r M,1 r M,N Our objective is to determine the true identity of the given query from S and (or) R. In this section, we briefly discuss the state-of-the-art in late fusion for multibiometric identification.

91 5.2 Fusion In Multibiometric Identification Existing Score-level Fusion Approaches Existing score fusion approaches can be categorized into four groups. Here, we briefly discuss all four approaches of score-level fusion for multimodal biometric identification. Transformation-based Score Fusion An example of a simple transformation-based score fusion approach is the use of the min-max score normalization to transform the raw scores into [0,1] range and then use the equally weighted sum rule (EWS) of fusion. The min-max normalization is performed as s m,j = s m,j min(s m ) max(s m ) min(s m ), (5.3) where s m,j is the match score provided by the m-th matcher to the j-th identity, S m is the m-th row in S that corresponds to the matching scores from the m-th matcher. Then, the matching scores from all the matchers are added without any bias to a particular matcher: f j = M w m s m,j, (5.4) m=1 where, f j represents the fused score for the j-th identity and w m is the weight assigned to the m-th matcher such that w 1 = w 2 =... = w m. Density-based Score Fusion In [141] the authors used a likelihood ratio score fusion [142] which was originally designed for verification under certain assumptions: (i) prior probabilities are equal for all the users, (ii) the match scores for different users are independent of one another, and (iii) genuine (impostor) match scores of all users are identically distributed. The aim of an identification system is to assign the query an identity I j0 that maximizes the posterior probability. The decision rule for closed set identification is governed by P (I j0 S) P (I j S), j = 1,..., N. (5.5) For open set identification, the query is assigned identity I j0 only when P (I j0 S) > τ in the above equation. According to Bayes theory [148] we can calculate P (I j S) as follows: P (I j S) = p(s I j)p (I j ). (5.6) p(s)

92 68 A Late Fusion Framework For Audio-Visual Person Identification Now, under the assumption of equal prior P (I j ) for all users [141], the posterior probability P (I j S) is proportional to the likelihood p(s I j ). The likelihood p(s I j ) can be written as p(s I j ) = f gen(s j ) f imp (s j ) N f imp (s i ) (5.7) where s j = [s 1,j,..., s M,j ] is the score vector corresponding to user j from M modalities, and f gen (s j ) and f imp (s j ) are the densities of genuine and impostor match scores, respectively, assuming that they are identically distributed for all users. Thus, the likelihood of observing the score matrix S given the true identity is I j is proportional to the likelihood ratio for verification used by the authors in [142]. The authors in [141] assumed that the scores from different matchers are conditionally independent. Hence, the joint density of the genuine (impostor) match scores can be estimated as the product of marginal densities, which we refer to as LRT-GMM in this paper: M m=1 fgen(s m m,j0 ) M fimp m (s m,j 0 ) m=1 i=1 fgen(s m m,j ) fimp m (s, j = 1,..., N. (5.8) m,j) Quality-based Score Fusion In [142], the authors presented the quality-based likelihood ratio (QLR) fusion technique provided the sample quality information is available. Being inspired by their LRT-GMM method, we can define the QLR framework for identification problem as follows M m=1 fgen(s m m,j0, Q m ) M fimp m (s m,j 0, Q m ) m=1 fgen(s m m,j, Q m ) fimp m (s, j = 1,..., N. (5.9) m,j, Q m ) We use the universal image quality index presented in [149] to represent face image quality and the NIST-SNR as in [150] to represent speech signal quality Existing Rank-Level Fusion Methods In rank-level fusion, the ranked lists from different matchers are combined using a number of methods, such as the highest rank, Borda count, logistic regression, etc. Highest Rank Fusion In the highest rank method [147], the combined rank r j of a user j is calculated by taking the lowest rank (r) assigned to that user by different matchers. One of the shortcomings of the highest rank fusion is that it may produce the same final rank

93 5.2 Fusion In Multibiometric Identification 69 for multiple users. The authors in [147] proposed to randomly break ties between different users. On the other hand, in [143] a perturbation factor ϵ was introduced to break ties: where, r j = M min m=1 r m,j + ϵ j, (5.10) M r m,j m=1 ϵ j = K. (5.11) The perturbation factor biases the fused rank by considering all the ranks associated with user j, by assuming a large value for K. Borda Count Rank Fusion In Borda count method [147], the fused rank is calculated by taking the sum of the ranks produced by individual matchers for a user j. The Borda count method accounts for the variability in ranks due to the use of a large number of matchers. The major disadvantage of this method is that it assumes all the matchers are statistically independent and perform equally well. In practice, a particular matcher may perform poorly due to various reasons, such as the quality of the probe data, quality of the templates in gallery etc. In [143], a method which is also known as the Nanson function [151], was used to eliminate the worst rank for a user: M max m=1 r m,j = 0. (5.12) This can be extended by eliminating the lowest rank k times before applying the Borda count on remaining ranks. Another quality-based approach was proposed in the same paper [143] with the inclusion of an input image quality in Borda count method as follows: r j = M Q m,j.r m,j, (5.13) m=1 where, Q m,j = min(q m, Q j ), and Q m and Q j are the quality factors of the probe and gallery fingerprint impressions, respectively. A predictor-based approach was proposed in [152] which calculates the final rank for each user as the weighted sum of individual ranks assigned by M matcher. A higher weight was assigned to the ranks provided by the more accurate matcher: r j = M w m.r m,j, (5.14) m=1

94 70 A Late Fusion Framework For Audio-Visual Person Identification Match Scores (LRC GMM UBM) genuine impostor T C speech (a) Match Scores (LRC ROI RAW) genuine impostor T C face (b) Fig. 5.2 Variation of match scores with the (a) speech and (b) face matcher confidence measures in the training dataset (T ). Our proposed matcher confidence measure is able to separate the genuine scores from the impostor scores. For example, the difference between the genuine and impostor scores is high when our proposed matcher confidence measure is high and vice versa. The maximum value of Cface T is whereas the maximum value of CT speech is We calculate the C-ratio of a modality by normalizing the matcher confidence obtained at the evaluation phase (c E m) by the maximum value of corresponding Cm. T

95 5.3 Proposed Fusion Framework 71 where, w m is the assigned weight for matcher m. An additional training phase was used for determining the weights. In [153] a non-linear approach of rank-level fusion was proposed for palm-print recognition. On the other hand, in [154] the ranks of only those identities were fused which appear in at least two classifiers (face, ear and signature). 5.3 Proposed Fusion Framework C-ratio Score Fusion It is a well known fact that the difference between the genuine and impostor match scores is usually high under normal circumstances (e.g., clean conditions). In practice, noisy samples may be presented to the system and therefore the decision making task may become difficult for the matchers. We propose to set the confidence of a matcher as the normalized difference between the best match score and the mean of the k subsequent match scores. We obtain a two column matrix S by first sorting the score matrix S and then keeping the best matching score (i.e., column 1) and the mean of k subsequent matching scores: s 1 1 µ 1 s S = 1 2 µ 2.., (5.15) s 1 M µ M where, µ m = 1 k 1 k s n m. (5.16) n=2 Then, the matcher confidence for modality m is calculated as c m = s1 m µ m µ m. (5.17) Here, a higher value of c m refers to a strong classification (i.e., clean probe data), and a smaller value of c m refers to a weak classification (see Fig. 5.2). We propose a novel confidence-ratio (C-ratio) for a matcher m as follows: γ m = c E m/max(c T m), (5.18) where c E m is the matcher confidence for modality m obtained at the evaluation (E) phase and max(c T m) is the maximum matcher confidence for modality m from the training (T) phase. This approach requires that the most likely identity is assigned

96 72 A Late Fusion Framework For Audio-Visual Person Identification the lowest matching score and other identities get higher scores (e.g., Euclidean distance). If the match scores follow opposite trend (e.g., likelihoods or probability) they must be inverted before our proposed matcher confidence and C-ratio can be applied. We then transform the matching score matrix (S) and perform fusion as follows: f = S T x (5.19) where, x = [γ 1... γ M ] T is the transformation vector containing the C-ratio of all the matchers and f = [f 1... f N ] T is the fused score vector. The decision is ruled in favor of the template achieving highest fused score Confidence-Based Rank-Level Fusion In this section, we discuss the confidence-based approach in rank-level fusion. We propose to use a novel confidence factor to be used with the highest rank fusion rule. We also propose a modification to the Borda count rank fusion. Confidence-Based Highest Rank Fusion The confidence measures obtained by Eq can be consolidated into a confidencebased highest rank fusion rule as follows: r j = M min m=1 r m,j + η j, (5.20) where the term η j is the confidence factor which can be calculated as follows: η j = M m=1 max(c T m).r m,j M m=1 r m,j, (5.21) We use the novel confidence factor (η j ) so that the ranks produced by a more confident classifier get more emphasis. The denominator in Eq transforms the confidence factor for a user (j) into the range [0, 1]. Here, we analytically show how the use of a confidence factor η j can handle ties in the highest rank fusion better than the modified highest rank fusion rule in Eq Let the ranks for a user (j = 1) from two matchers of a mutibiometric system be r 1,1 = 1 and r 2,1 = 2, while for another user (j = 2), let r 1,2 = 2 and r 2,2 = 1. By the modified highest rank fusion in Eq. 5.10, we obtain r 1 = 1.03 and r 2 = 1.03, when K = 100 as in [143]. On the other hand, let the confidence measure max(c T 1 ) for a matcher be 0.3 and max(c2 T ) for another matcher be 0.9. By using Eq. 5.21, we get r 1 = 1 + (0.3 1)+(0.9 2) = 1.7 and r (1+2) 2 = 1 + (0.3 2)+(0.9 1) = 1.5. Thus, not only (1+2)

5.4 Databases And Systems 73 (a) Reference (b) Q I =0.28 (c) Q I = 0.057 (d) Q I =0.0378 (e) Q I =0.029 (f) Q I =0.1152 (g) Q I =0.0409 (h) Q I =0.0135 Fig. 5.

97 5.4 Databases And Systems 73 (a) Reference (b) Q I =0.28 (c) Q I = (d) Q I = (e) Q I =0.029 (f) Q I = (g) Q I = (h) Q I = Fig. 5.3 Universal image quality index (Q I ) for a reference image (a) matched against clean input image (b) from a different user, and against the same image (a) corrupted with AWGN of levels of (c) σ 2 = 0.3 (d) σ 2 = 0.3 and (e) σ 2 = 0.9 as well as (f) 25% (g) 50% and (h) 75% salt and pepper noise. a tie between the final ranks of the users j = 1 and j = 2 is avoided but also the ranking of the more confident classifier is emphasized. Confidence-Based Borda Count Fusion We propose to modify the Borda count method as follows: M r j = max(cm).r T m,j. (5.22) m=1 The proposed confidence-based Borda count fusion rule is indeed the numerator of Eq and similar to the quality based Borda count fusion in [143]. Here, instead of quality measures for the probe data, we propose to use confidence measures for the classifiers. 5.4 Databases And Systems AusTalk In our experiments, we used a new audio-visual database, namely the AusTalk [1]. The AusTalk is a large collection of audio-visual data captured at several university campuses across Australia. We used the audio-visual data from 248 individuals that consists of 4-digits utterances recorded in twelve sessions. We used the data in the first six sessions for enrollment of speaker models/templates and the data in the

98 74 A Late Fusion Framework For Audio-Visual Person Identification remaining six sessions as probes. We randomly selected half the users to be in the training set (T ) and the remaining half in the evaluation set (E). This process was repeated five times and therefore the recognition performances reported in this paper on the AusTalk are the averages of five test runs. We needed the training set (T ) to estimate the genuine/impostor score densities to implement LRT-GMM and QLR. We used the GMM fitting algorithm presented in [155] for density estimation. Since the AusTalk database consists of clean speech and videos recorded in room environment, we degraded the data using additive white Gaussian noise (AWGN) and salt and pepper noise. In Fig. 5.3(b), a value of the universal image quality index is shown for the face image from a person when it is matched with a reference image in Fig. 5.3(a). In Fig. 5.3(c-h), the values of the universal image quality are shown when the reference image is compared with itself, but corrupted at different levels of AWGN and salt and pepper noise VidTIMIT We also used the VidTIMIT database [81] to evaluate the performance of our prosed fusion framework. It comprises audio-visual data from 43 persons (19 females and 24 males) reciting short sentences in 3 sessions. There are 10 sentences per person, with the first six sentences captured in Session 1, the next two sentences in Session 2 and the remaining two in Session 3. In our experiments, we used Session 1 and Session 2 data for the enrollment of speaker models/templates and Session 3 data as probes. We randomly selected 21 speakers (9 females and 12 males) to be in the training set (T ) and the remaining 22 speakers in the evaluation set (E). This process was repeated five times; therefore, the recognition performances reported in this paper on the VidTIMIT database are the averages of five test runs. We used the same mechanisms for density estimation and data degradation as described in Section System We used the LRC-GMM-UBM and LRC-ROI-RAW frameworks that we previously used in our works in [46] and [47] as the matchers of the audio and visual modalities, respectively. The main concept is that the samples from a specific user lie on a linear subspace, and therefore the task of person identification is considered to be a linear regression problem [41]. In the LRC-GMM-UBM, a Universal Background Model (UBM) is trained using the MFCCs extracted from the enrollment data of all the speakers in the training set (T). Then, the enrollment data from an individual speaker is used to adapt a Gaussian Mixture Model (GMM) from the UBM. An adapted GMM from the UBM is also commonly known as

99 5.5 Experiments, Results and Analysis 75 the GMM-UBM speaker model. Finally, the means from all the GMM-UBMs are concatenated to form a supervector. Speaker-specific templates are created stacking all the feature vectors (GMM-UBM means) from the enrollment data. Similarly, in the LRC-ROI-RAW framework, user-specific templates are created by stacking the feature vectors obtained from down-sampled raw face images. In the test phase, a feature vector is first extracted from the probe data and a response vector is then predicted as a linear combination of the templates of each speaker stored in the gallery. Finally, the Euclidean distance between the test and a predicted response vector is used as a matching score. 5.5 Experiments, Results and Analysis In this section, we present experimental results on the AusTalk and the VidTIMIT databases. We evaluated the robustness of our proposed fusion framework considering additive white Gaussian noise (AWGN) and salt and pepper noise on the face images as well as AWGN in the speech samples. We compared the performance of our fusion framework with the LRT-GMM, QLR and min-max normalized equal weighted sum (EWS) methods for score-level fusion. Then, we tested the robustness of our proposed framework for rank-level fusion on the AusTalk database with AWGN only. We compared the performance of our proposed confidence-based rank-level fusion (conbordacount and conhighestrank) with the Borda count (bordacount) and the highest rank (highestrank) fusion as well as the perturbation factor based highest rank (pfactorhighestrank) and the predictor based Borda count (predictorbasedborda) methods of rank-level fusion. The weights w m in Eq were computed using the probe data in the training set (T ). The ratio between correct identification and the total number of probes [152] as determined by the matchers were used as weights. In our predictor-based experiments, the audio sub-system weight w 1 = 0.98 and the visual sub-system weight w 2 = Robustness to AWGN The additive white Gaussian noise is always an important case-study in the context of robustness because it models the detector noise of the imaging system [156]. The input face images were distorted by adding zero-mean Gaussian noise with three different error variances (see Fig. 5.3(c-e)). The speech samples were distorted by adding white noise at three different SNR levels. In Table 5.1 and Table 5.2, the rank-1 identification accuracies for the LRT-GMM in Eq. 5.8, QLR in Eq. 5.9, and our proposed confidence-based (C-ratio) score fusion in Eq. 5.19

100 76 A Late Fusion Framework For Audio-Visual Person Identification Table 5.1 Rank-1 identification at various levels of additive white Gaussian noise on speech and face probes in AusTalk Noise Level Speech Face LRT-GMM QLR EWS C-ratio (SNR,σ 2 ) (%) (%) (%) (%) (%) ( %) (clean, 0.3) (clean, 0.6) (clean, 0.9) (30dB, clean) (20dB, clean) (10dB, clean) (30dB, 0.3) (30dB, 0.6) (20dB, 0.3) (20dB, 0.6) (20dB, 0.9) Average Table 5.2 Rank-1 identification at various levels of additive white Gaussian noise on speech and face probes in VidTIMIT Noise Level Speech Face LRT-GMM QLR EWS C-ratio (SNR,σ 2 ) (%) (%) (%) (%) (%) (%) (clean, 0.3) (clean, 0.6) (clean, 0.9) (30dB, clean) (20dB, clean) (10dB, clean) (30dB, 0.3) (30dB, 0.6) (20dB, 0.3) (20dB, 0.6) (20dB, 0.9) Average as well as the EWS in Eq. 5.4 with min-max score normalization are listed for the AusTalk and the VidTIMIT databases, respectively. On the AusTalk database, our proposed C-ratio score fusion outperforms the state-of-the-art density-based, quality-based and transformation-based fusion techniques, particularly when probes from both modalities are degraded due to AWGN. For example, in Table 5.1, when the speech signal is corrupted with 30dB and face images with σ 2 = 0.3 of AWGN, our proposed C-ratio score fusion achieves a rank-1 recognition accuracy of 96.39%, which is significantly higher than the state-of-the-art in score fusion. We achieved 75.15%, 81.18% and 91.56% rank-1 recognition rates with the LRT-GMM-UBM, QLR and EWS fusion methods, respec-

101 5.5 Experiments, Results and Analysis 77 Table 5.3 Rank-1 identification at various levels of additive white Gaussian noise on speech and salt and pepper noise on face probes in AusTalk Noise Level Speech Face LRT-GMM QLR EWS C-ratio (SNR,σ 2 ) (%) (%) (%) (%) (%) (%) (clean, 0.25) (clean, 0.50) (clean, 0.75) (30dB, 0.25) (30dB, 0.50) (20dB, 0.25) (20dB, 0.50) (20dB, 0.75) Average Table 5.4 Rank-1 identification at various levels of additive white Gaussian noise on speech and salt and pepper noise on face probes in VidTIMIT Noise Level Speech Face LRT-GMM QLR EWS C-ratio (SNR,σ 2 ) (%) (%) (%) (%) (%) (%) (clean, 0.25) (clean, 0.50) (clean, 0.75) (30dB, 0.25) (30dB, 0.50) (20dB, 0.25) (20dB, 0.50) (20dB, 0.75) Average tively and at the same noise level. The overall rank-1 identification accuracy using our proposed C-ratio score fusion on the AusTalk is 87.67% which is also at least 7.5% higher than any other method. Similarly, our proposed method outperforms state-of-the-art score fusion techniques on the VidTIMIT database. The overall rank-1 identification accuracy (Table 5.2) obtained using our proposed C-ratio score fusion is 86.19% which is slightly better than the rank-1 identification accuracy obtained using the EWS method and at least 5% higher than the accuracies achieved using the LRT-GMM and QLR methods Robustness to Salt and Pepper Noise In the next set of experiments, we tested the robustness of our proposed confidencebased score fusion by considering input probes contaminated with data drop-out and snow in the image simultaneously, usually referred to as salt and pepper noise

102 78 A Late Fusion Framework For Audio-Visual Person Identification Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Rank Rank 1 (a) Noise = (clean,0.3) 1 (b) Noise = (clean,0.997) Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Rank (c) Noise = (30dB, clean) Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Rank (d) Noise = (10dB, clean) Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Rank (e) Noise = (30dB,0.6) Identification (%) bordacount conbordacount conhighestrank highestrank pfactorhighestrank predictorbasedborda Rank (f) Noise = (20dB,0.9) Fig. 5.4 CMC curves for our confidence-based rank fusion method (conbordacount and conhighestrank) at different (audio,visual) noise levels and compared against the Borda count (bordacount), the highest rank (higestrank), perturbation-factor highest rank (pfactorhighestrank) and the predictor based Borda count (predictorbasedborda) methods.

103 5.5 Experiments, Results and Analysis 79 [157]. This type of noise can be caused by analog-to-digital converter errors and bit errors in transmission [158]. In Table 5.3, a summary of rank-1 identification accuracies on the AusTalk databse has been presented. It shows that the proposed C-ratio score fusion also performs better than the state-of-the-art score fusion methods when salt and pepper noise was considered for face images and AWGN on speech signals. For example, when the speech signal is corrupted with 30dB SNR and 25% of the pixels on the face image are assumed to be contaminated, our proposed C-ratio fusion achieves 99.06% rank-1 accuracy. The overall rank-1 identification accuracy using the proposed C-ratio score fusion is at least 4.8% higher than any other method. Similarly, our proposed C-ratio score fusion outperforms state-of-the-art on the VidTIMIT database. The overall rank-1 identification accuracy (Table 5.4) using our proposed C-ratio score fusion on VidTIMIT is 84.09% which is at least 1.5% higher than the accuracies obtained using the EWS, LRT-GMM, and QLR fusion methods Rank-level Fusion With AWGN We also performed experiments on rank-level fusion under various levels of audio and visual degradations on the AusTalk database. In Fig. 5.4(a-f), the Cumulative Match Characteristics (CMC) curves for different (audio,visual) noise levels are shown. Our proposed confidence-based rank fusion approach achieved better rank- 1 identification rates than the state-of-the-art highest rank fusion approaches. For example, in Fig. 5.4(a) the CMC curve for clean speech and slightly corrupted face images is shown. Our proposed confidence-based highest rank fusion achieved 99% rank-1 identification accuracy that is 2.5% higher than the conventional highest rank fusion (highestrank) and slightly higher than the modified highest rank (pfactorhighestrank) approach in [143]. Fig. 5.4(b-d) shows the CMC curve at other (audio,visual) noise levels. In all settings, the rank-1 recognition rate obtained using the confidence-based highest rank fusion was higher than the highest rank (highestrank) and the modified highest rank (pfactorhighestrank). In Fig. 5.4(ef), we show the CMC curves of our proposed rank fusion approach considering that data from both modalities are degraded. Although the confidence-based rank fusion approach outperforms the conventional highest rank fusion (highestrank) on both the occasions, its performance is almost equal at (30dB,0.6) noise level and slightly worse at (20dB,0.9) noise level when compared with the modified highest rank fusion (pfactorhighestrank) approach. On the other hand, the performance improvement by using the confidence-based Borda count method was higher for all rank levels (rank-1 to rank-10). Therefore, the confidence-based rank-level fusion clearly improves the recognition accuracy.

104 80 A Late Fusion Framework For Audio-Visual Person Identification Another interesting observation is that the predictor-based Borda count method [152] does not improve recognition performance if there is noise on probe data because the predictor-based method uses fixed weights for the matchers. 5.6 Conclusions We have presented a confidence-based late fusion framework and its application to audio-visual biometrics. We showed that matcher confidence can be calculated from the output match scores and a novel C-ratio can be calculated for transforming the match scores before they are fused. We have also proposed a novel confidence-factor that can successfully break the ties in the highest rank fusion. Finally, experimental results have been presented which give us a clear indication that confidence-based fusion can be considered as a robust and accurate fusion method for biometric systems operating in the identification mode.

105 Chapter 6 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Abstract We present an audio-visual person recognition system using deep neural networks (DNNs) initialized as a deep Boltzmann machine, referred to as DBM-DNNs. We show that lower error rates can be achieved with DBM-DNNs compared to the stateof-the-art linear regression-based classifier (LRC), support vector machine (SVM), DNN initialized as a deep belief network (DBN-DNN) and a simple baseline DNN approach. We use reliability-based classifier fusion at the score level. We present a novel reliability-to-weight mapping function that can be used with different reliability measures such as the entropy of posteriors and the C-ratio. Our proposed mapping function achieves a better accuracy than the state-of-the-art in both clean and noisy conditions. Our bi-modal verification experiments on the MOBIO dataset show that our proposed fusion achieves competitive error rates compared to the linear logistic regression (LLR) fusion. A similar trend is also observed in our experiments on the VidTIMIT database. 6.1 Introduction A person recognition system recognizes persons by comparing their physiological [18] or behavioural [159] representations with the pre-stored model(s). In the identification mode, the system produces an ordered list of identities and accepts the highest ranked identity as the winner. In the verification mode, a test sample This article is under second round of review for publication in the Pattern Recognition Letters journal, 2016.

106 82 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion J h h W W v v L Deep Boltzmann Machine (a) h 2 Deep Belief Network W 2 h 1 W 1 v (b) Fig. 6.1 a) left: Boltzmann machine (BM); right: Restricted Boltzmann Machine (RBM), b) left: Deep Boltzmann Machine (DBM); right: Deep Belief Network (DBN) and a claim of an identity is presented to the system. A claim is accepted only if the score corresponding to the model of the claimed identity is above an empirically pre-determined threshold. A biometric system may also be characterized as unimodal or multi-modal depending on the number of modalities used. An audio-visual biometric system is a multi-modal system that makes decisions based on people s facial images and speech data [47]. The emergence of smart devices makes such systems more user friendly, cost effective, and easily deployable in real life applications. Recent advancements in computer hardware and machine learning algorithms triggered the interest of developing biometric systems based on deep artificial neural networks. A deep neural network (DNN) is defined as a feed-forward artificial neural network which has multiple layers of hidden units between its inputs and outputs. Such networks can be discriminatingly trained by back-propagating the derivative of the mismatch between the target outputs and the actual outputs [33]. In the training phase, the initial weights of a DNN can be set to small random values. However, a better way of initialization is to generatively pre-train them as a DBN or DBM and then fine-tune using a set of labelled samples from the target subjects [118]. Undirected models such as the restricted Boltzmann machines (RBMs) are ideal for layer-wise pre-training [33]. An RBM (Fig. 6.1a, right panel) is a type of Markov random field (MRF) with a bipartite connectivity graph, no sharing of weights between different units, and a subset of unobserved variables. Multiple

107 6.1 Introduction 83 RBMs can be stacked to produce a single multilayer generative model called a Deep Belief Network (DBN) [160]. In a DBN, the top two layers are undirected but the lower layers have top-down directed connections (see Fig. 6.1b, right panel). A DNN which is pre-trained generatively as a DBN is referred to as a DBN-DNN [33]. On the other hand, Deep Boltzmann Machine (DBM) is a recently introduced variant of Boltzmann machines [119]. A DBM differs from a DBN in that every layer in a DBM has undirected connections. When a DNN is pre-trained generatively as a DBM it can be referred to as a DBM-DNN. Recently, learning deep networks for tasks such as speech, vision and language processing has gained in popularity [34]. A number of speaker recognition techniques have also been proposed. For example, the use of RBMs in [35] and [36] or generative DBNs in [2] and [37]. DBMs have been recently used to extract joint representations for tasks such as image retrieval [40] and face modelling [39]. In biometrics, learning joint representations may be interpreted as a way of feature level fusion, which requires synchronization between the input modalities. Extracting such representations also becomes difficult when the number of modalities increases [137]. Hence, we focus on systems that separately learn model for each modality. For example, in [119] and [161], DBMs with binary valued inputs were used for handwritten text recognition. In the same paper, an object recognition task was also presented for natural images (integer grey-scale values), which were converted to binary-valued images using an RBM with Gaussian visible and binary hidden units. However, a Gaussian-Bernoulli deep Boltzmann machine (GDBM) can also be used with real-valued inputs [162]. We used GDBMs for pre-training the parameters of DNNs for audio-visual person recognition in [115] and [116]. It was shown that the DBM-DNNs can be used with different types of hand-crafted features such as Linear Binary Patterns (LBP) [65], Gaussian Mean Supervector (GMS) [61] and i-vectors. However, no direct comparison between the DBM-DNN with DNN initialized with small random values and DBN-DNN was made. The performance of the DBM-DNN was also not evaluated in noisy conditions. Moreover, an adaptively weighted fusion strategy was not considered. In a real life scenario, the quality of data may vary due to variations in background, pose and illumination. It is therefore necessary to adapt the modality weights in fusion based on a reliability measure. A reliability-based fusion strategy is one which adjusts the contribution of each modality based on a reliability measure commonly calculated from the matching scores [32]. For example, the dispersion of log-likelihoods [70], the entropy of the posteriors [71], and the C-ratio [47]. For classifiers which produce posterior probabilities as outputs, the entropy of posteriors has been used as a reliability measure along with a reliability-to-weight mapping function [163]. Since the DBM- DNN, the baseline DNN, and the DBN-DNN evaluated in this paper are probabilistic

108 84 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion classifiers, we use entropy as a reliability measure. In [44], a bias problem of the existing entropy based reliability-to-weight mapping functions was discussed. For example, the negative entropy mapping has a bias towards low entropies, meaning when the entropy of one modality is close to the maximum value (= log 2 N), the mapping assigns a weight of the other modality to a value that is close to 1, even when the other entropy is also close to the maximum. In this paper, we present a novel reliability-to-weight mapping function which overcomes such bias and can be used with different reliability measures. In summary, the contributions of this paper include: 1) The design and training of DBM-DNN for audio-visual person recognition (Sections and 6.3.2). To the best of our knowledge, this is the first time DBM-DNNs have been used for audio-visual person recognition. 2) The introduction of a novel reliability-to-weight mapping function (Section 6.3.4). This has shown to contribute to the performance boost of our system in both clean and noisy conditions. 3) Extensive experiments show that a higher accuracy can be achieved with our DBM-DNN and the proposed fusion compared to the state-of-the-art (Section 6.5). The rest of this paper is organized as follows. In Section 6.2, a brief overview of the existing audio-visual biometrics systems and the reliability-to-weight mapping functions is presented. Our DBM-DNNs are presented in Section 6.3. Then, we describe setup of our experiments in Section 6.4. Experimental results are presented in Section 6.5. We conclude the paper in Section Background In recent years, several approaches have been proposed for audio-visual person recognition. Entropy of posteriors is commonly used as a reliability measure for systems that produce posterior probabilities as scores. In this section, we present a brief overview of the recent approaches in audio-visual person recognition and reliability-to-weight mapping. We also present the theoretical background of DBM Audio-visual biometrics The use of a linear regression-based classifier (LRC) [41] for audio-visual person identification was used in [46] and [47]. Class specific models were created by stacking features extracted from the speech and facial image data. Classification was performed by considering the task of person identification as a linear regression problem. In [9], the evaluation of different session variability modelling techniques for audio-visual person verification was presented. It was shown that the inter session variability (ISV) modelling using the Gaussian mixture models provides a consistently robust system compared to the joint factor analysis (JFA) and the

109 6.2 Background 85 total variability modelling (TVM) [43]. A co-training approach that used a semisupervised machine learning was presented in [124]. Their approach used both labelled and unlabelled data to capture large intersession variations and to learn discriminative subspaces. In [115] DBM-DNNs were used with LBP and GMS features extracted from the facial images and speech signals, respectively. In another paper [116], TVM was used to extract i-vectors from both the modalities. It was also shown that three-layer DBM-DNNs with 800 units in both the hidden layers achieved a better accuracy than other DBM setups. However, a comparative analysis was missing in [115] and [116] about how the pre-training phase was helpful to improve the accuracy of the system. Moreover, the simple and standard sum fusion rule was used in both papers to combine the modalities at the score level. In this paper, we present extensive experiments (Section 6.5) using the DBM-DNNs and compare our results with the state-of-the-art support vector machine (SVM) [104], LRC [41], DBN-DNN, and a simple baseline DNN which is initialized with small random values Reliability-to-weight mapping Information from multiple modalities can be fused at different stages of recognition (e.g., data level, feature level, score/rank level). Fusion at the score/rank level is the most commonly used approach because the matching scores or the ranked list of identities are easily available and can be combined using simple rules (e.g., sum and product rules) [20]. When combining multiple classifiers, a reliability measure is commonly used to adjust the modality weights. A number of reliability measures have been used in the literature [32]. For systems which produce posterior probabilities as scores, the entropy of posteriors is commonly used as a reliability measure [71]. The entropy of a posterior distribution from a modality m is given by N H m = p i log 2 p i, (6.1) i=1 where N is the number of target subjects and p i is the probability assigned to i-th subject. Several mapping functions have been used to directly obtain modality weights from the entropy. For example, the negative or the inverse mapping function in [163], which are given by (a bi-modal case): w A = H max H A m H max H m (6.2) and w A = 1/H A m 1/H, (6.3) m

110 86 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion respectively, where w A is the weight assigned to modality A, and H max (= log 2 N) is the maximum entropy for N target subjects. The weight of the other modality V is obtained as w V = 1 w A. These mapping functions have a bias towards high and low values of the entropy, respectively [44]. For example, when the entropy of a modality is close to H max, the negative entropy method assigns the weight of the other modality to a value which is close to 1, irrespective of the entropy of that modality. On the other hand, the inverse mapping function assigns the weight of one modality to a value that is close to 1 when its entropy is close to 0, irrespective of the entropy of the other modality. Such bias is undesirable in a real life scenario where both modalities can be corrupted due to variations in background, pose and illumination. To avoid any bias as mentioned above, the authors in [44] proposed a mapping on a 3D plane: w A = H V H A 2H max + 1 2, (6.4) but their performance was almost equivalent to the negative entropy mapping function. Hence, the authors argue in [44] and [163] that the mapping should only be sensitive to a certain range of entropy values. They therefore proposed to maintain a histogram of past entropy values (e.g., 15 histogram bins comprising 300 past entropy values from two modalities) and a piecewise-linear mapping based on that histogram. However, the results in [44] and [163] show no significant performance improvement using the piece-wise linear mapping function. In this paper, we present a novel reliability-to-weight mapping function (see Section for details) which overcomes the bias and can also be used with other reliability measures such as the C-ratio presented by us in [47]. 6.3 Methodology This section presents the steps involved in classification using a DBM-DNN. There are four steps in the process: 1) unsupervised training of the DBMs, 2) supervised fine-tuning of the DBM-DNNs, 3) decision making, and 4) fusion. In the fusion phase, we propose to use a novel reliability-to-weight mapping function that overcomes the bias issue previously mentioned in Section and can be used with other reliability measures Unsupervised training of the DBMs We use DBMs to model each modality using a set of unlabelled data samples. It is not trivial to start the training from a random set of parameters [119] [122]. Hence, an algorithm that greedily pre-trains each layer of a DBM was presented

6.3 Methodology 87 800 800 800 800 Face image... LBP v 1 v2... v D 3712 v 1 v2... v D 39*c GMS MAP adaptation MFCCs Speech segments Fig. 6.2 Unsupervised training of the DBMs.

111 6.3 Methodology Face image... LBP v 1 v2... v D 3712 v 1 v2... v D 39*c GMS MAP adaptation MFCCs Speech segments Fig. 6.2 Unsupervised training of the DBMs. The dimension of inputs to the DBM LBP face (left) is = 3712 ( i.e., 58 LBP patterns are extracted from each (8 8) block from a input image and then concatenated). The dimension of inputs to DBM GMS speech (right) is 39 c where c is the number of Gaussian mixture components.

112 88 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion in [119]. They used a special trick to form a DBM from a stack of modified RBMs. For the first RBM, they made two copies of the visible units and tied the weights to each copy. Conversely, for the RBM on top, they made two copies of the hidden units and tied weights to each copy. In both cases, it is not required that each copy has the same state vector. If there are more than two hidden layers in the DBM, the weight matrix learned by an RBM for each intermediate layer was divided by two. When these three types of RBMs are composed to form a single system, the conditional distribution defined by the composite model becomes exactly the same as the conditional distributions defined by a DBM. In [122], an alternative approach of DBM pre-training was presented that consists of two stages: a) obtaining approximate posterior distribution over the hidden units and b) maximizing the variational lower-bound given the fixed hidden posterior distributions. In this paper, we use the former approach for pre-training the DBMs. We use the pre-trained parameters as the starting point of the unsupervised training of the DBMs. Parameters are updated using the stochastic maximization approach. We train a DBM for each modality, DBM LBP speech and DBMGMS face (see Fig. 6.2) where LBP and GMS represent the features extracted from the facial image and speech, respectively. A detailed description of these features and the other feature used (i-vector) in this paper is presented in Section Supervised fine-tuning of the DBM-DNNs When the DBMs for both the face and speech modalities are trained we use their learned parameters to initialize corresponding DNNs, which are referred to as DBM- DNNs. Then, a softmax layer is added on top of each DBM-DNN. The networks are discriminatingly fine-tuned with the standard back-propagation algorithm and a set of labelled samples from the target subjects. Thus, each unit and its value in the softmax layer represents a target subject and a posteriors probability for that subject, respectively Decision making When a set of test samples is presented, they are clamped to the visible layers of the DBM-DNNs. The resultant softmax layers from the DBM-DNNs are used as scores. In the verification scenario, the output layer of a DBM-DNN is taken into consideration for obtaining the genuine and impostor scores. For example, assuming N number of nodes in the output layer of a DBM-DNN corresponding to the N possible target subjects and a claim of identity j, the value of the j-th node of the softmax layer represents a genuine score. The values of the remaining nodes

113 6.3 Methodology 89 are treated as impostor scores. In the identification scenario, the node with the maximum value is declared the winner (rank-1 identification) Fusion For a given input, each DBM-DNN produces posterior probabilities in the output. We can combine the DBM-DNNs at the score level using the following adaptive sum rule of fusion: f j = m w m.p m (v m, j), (6.5) where m is the modality assignment (in our case, m = 1 for audio and m = 2 for visual modality), w m is the weight assigned to modality m, and p m (v m, j) represents the probability of the input, v m, belonging to person j ( i.e., the value assigned by j-th node of a DBM-DNN for v m ). In a real life scenario, the inputs may contain significant variations in lighting, background and pose. In [32], we showed how the quality of the input data can affect the score distributions. Therefore, we argued that the weight assigned to the modalities should be adjusted based on a reliability measure, which can be calculated from the output scores. The entropy of posterior gives a measure of variation in the matching scores. For example, a low entropy value suggest that there is peak (i.e., the classifier is able to clearly distinguish the genuine target and the impostor targets) in the score distribution, while a high entropy value suggest a flat score distribution (i.e., the classifier can not clearly distinguish the genuine target and the impostor targets). Several mapping functions have been used to directly obtain the modality weights from the entropy. These mapping functions however have a bias issue and in [44] it was argued that a mapping function should be sensitive to some intervals of entropy values. We propose a mapping function which is not biased but sensitive to certain interval of reliability values. Our proposed reliability-to-weight mapping function is given by: w m = e ( k. 1 H m Ĥ H max ), (6.6) where Ĥ = H 1 + H H m, (6.7) M H m is the reliability measure for modality m, M is the total number of modalities and k is a constant which is empirically determined. The constant k actually determines the interval centered at Ĥ where the mapping function is linear (sensitive). The proposed mapping function not sensitive for values outside this interval (see Fig 6.3).

114 90 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion W speech H face H speech 4 6 (a) W speech H face H speech 4 6 (b) W speech H face H speech 4 6 (c) Fig. 6.3 Reliability-to-weight mapping on MOBIO using the (a) proposed, (b) negative, and (c) inverse mapping

115 6.4 Experimental setup Evaluation criteria The final score vector containing the fused scores given a set of test samples is obtained as f = [f 1, f 2,..., f N ]. An identity claim j is accepted if the j-th fused score (f j ) is greater than an empirically determined threshold τ. To measure the verification performance of our recognition system on MOBIO (Section 6.5.1), we use the Half Total Error Rate (HTER) and the Detection Error Trade-off (DET) plots as in [84]. HTER is calculated as the average of the False Positive Rate (FPR) and False Negative Rate (FNR) at τ on the evaluation partition (EVAL) of the dataset: HT ER = F P R(τ, EV AL) + F NR(τ, EV AL). (6.8) 2 Here, τ is defined on the development data as the intersection point of the FPR and FNR. This intersection point is termed as the Equal Error Rate (EER). FPRs and FNRs are calculated as: # of false accepts F P R = 100 N a + N r # of false rejects F NR = 100, N a + N r (6.9) where N a and N r are the total number of trials for the true and impostor clients, respectively. Finally, the DET plots outline the FPR versus the FNR on the evaluation set. For our experiments on VidTIMIT (Section 6.5.2) we used Eq. (6.9) to calculate the EER. 6.4 Experimental setup In this section, we discuss the preprocessing of data, the feature extraction, and the tools used in our experiments. We extracted a single face image from a set of randomly selected frames of a video. The locations of the eyes were detected using the Viola-Jones algorithm [52] and these locations were used to extract a face image of size The face images were photometrically normalized using the Tan-Triggs algorithm in [59]. We also performed speech enhancement and voice activity detection using the VOICEBOX toolbox [51]. Speech frames were extracted from a silence removed speech signal with a window size of 20ms and a sampling rate of 10ms. Then, 12 cepstral coefficients were extracted form each frame and augmented with the log energy to form a 13 dimensional static feature vector. The delta and acceleration of the static features were appended to form the final 39 (13 static + 13 delta + 13 acceleration) dimensional mel frequency cepstral coefficients (MFCCs) feature vector.

116 92 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Local binary patterns, LBP, were extracted from each block of 8 8 pixels of the preprocessed face images. We extracted 58 uniform patterns from each block, resulting in a (64 58) = 3712 dimensional visual feature vector from a face image by concatenating the patterns from each block. Since the input dimensionality of the DBMs are fixed, we extracted fixed-length representations (e.g., GMS and i-vectors) from the speech signals. A Gaussian mean super-vector (GMS) was formed by concatenating the means of the GMM components that were obtained by adapting the means of a universal background model (UBM) [104]. Thus, we extracted fixed length speech feature vectors from the speech samples. The dimensionality of a GMS was 39 c, where c is the number of UBM components. A large value of c would result in over-fitting, whereas a small value would generate a poor model of the data. Therefore, the value of c was empirically determined to best fit the size of the dataset. We also used i-vectors in our experiments, extracted using the TVM method. First, blocks from the face images were extracted using an exhaustive overlap, which resulted in = 2809 blocks per face image. The pixel values of each block were normalized to zero mean and unit variance. Then, 44 2D- DCT coefficients [67] were extracted from each image block excluding the zero frequency coefficient. The resultant 2D-DCT vectors were then normalized in each dimension to zero mean and unit variance with respect to the other feature vectors. The total variability matrix, T, is learned by maximizing the likelihood over the background data as in [113]. Finally, 400 dimensional i-vectors were extracted using the T matrix. Similarly, we learned a T matrix for the speech modality which was used to extract 400 dimensional i-vectors from the speech signals. We implemented the DBM-DNNs using the Deepmat toolbox [164]. In the unsupervised training of the DBMs, the parameters were updated using the Contrastive Divergence (CD), adaptive learning rate and an enhanced gradient methods [165]. The learning rate was set to In each step, the model was trained for 25 epochs and with mini-batch sizes of 100 and 24 for our experiments on the MOBIO and VidTIMIT databases, respectively. In [116], it was shown that DBMs with 2 layers of 800 hidden units is an optimal setup. We therefore used the same setup for our experiments. We used the MSR Identity Toolbox [166] to build the UBM (used for GMS and i-vector extraction), to use TVM for i-vector extraction, and to calculate FPR and FNR for performance evaluation. 6.5 Experimental results Extensive experiments were carried out on the publicly available MOBIO and VidTIMIT databases. In this section, we briefly introduce both the databases and

117 6.5 Experimental results 93 Table 6.1 Rank-1 identification accuracy ( %) on MOBIO. Method Face Speech Entropy based fusion ( %) (%) (%) negent invent 3D-plane proposed SVM LRC DNN DBN-DNN DBM-DNN the protocols used in our experiments. Experimental results with discussions are also presented MOBIO database MOBIO [84] is a collection of videos (with speech), captured using a laptop computer and a Nokia N90 mobile phone. The database contains audio-visual data from 150 subjects (50 females and 100 males) captured in 12 sessions. In each session there are videos. Because the data was captured using mobile devices there were significant variations in pose, background and illumination in the database. We evaluated the DBM-DNN and the proposed fusion technique for both the identification and verification tasks. Identification In our identification experiments, we randomly selected 100 subjects as targets and the remaining 50 subjects as non-targets. Unlabelled data from the non-target subjects was used in the unsupervised training of the DBMs (see Section 6.3.1), while the labelled data from the target subject was used for the supervised finetuning of the DBM-DNNs (see Section 6.3.2). In each iteration of our identification experiment, we randomly divided the data into two equal groups of sessions. This resulted in 6 sessions for each group. Then, five samples from each session were randomly selected. We used one of the groups (of sessions) for training and the other for testing. Therefore, for each subject, we used 30 samples for training and 30 samples for testing. Hence, the overall size of the training and test sets was ( = ) 3000 samples. In the unsupervised training phase, we used a set of 9600 unlabelled samples (192 each from the 50 non-targets). The experiments were repeated five times and the average identification accuracy is reported (Table 6.1). In Table 6.1, we report the rank-1 identification accuracy when a DNN is initialized with small random values (DNN), pre-trained as a DBN (DBN-DNN) and pre-trained as a DBM (DBM-DNN). We also list the identification accuracy of the state-of-the-art LRC and SVM classifiers for the purpose of comparison. Rank-1

118 94 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Table 6.2 Rank-1 identification accuracy (%) on MOBIO using DBM-DNN when the face modality is corrupted due to different levels of JPEG compression noise but the speech modality is clean. JPEG Face Entropy based fusion ( %) quality factor (%) negent invent 3D-plane proposed QF QF QF identification accuracies of 90.34% and 92.30% were achieved using the DBM- DNNs for face and speech modality, respectively. The results presented in Table 6.1 clearly shows that pre-training the DNN helps to achieve higher identification accuracies compared to the random initialization. It is also evident in the results that the pre-training of DNN as a DBM achieved a better identification accuracy than the pre-training using a DBN. This is because a DBM has undirected connections between the layers, which helps it to learn a good model of the underlying data. Our DBM-DNN also achieved a better accuracy compared to the state-of-the-art LRC and SVM classifiers for the face modality and comparable results were obtained for the speech modality. Hence, our DBM-DNN can be used for the person identification task and achieving a higher accuracy compared to the state-of-the-art. In Table 6.1, we also list the rank-1 identification accuracies of the fused system using our proposed fusion. The combined identification accuracy achieved using the DBM-DNN and our proposed mapping function was 99.23%. This is slightly better than the negative entropy (Eq. 6.2), inverse entropy (Eq. 6.3) and mapping on a 3D plane (Eq. 6.4). We also show that the proposed reliability-to-mapping function can be used with the state-of-the-art LRC and SVM classifiers even though they did not produce posterior probabilities. In such cases, we propose to use the C-ratio as a reliability measure. Besides the higher accuracy, another motivation behind our proposed mapping function is its ability to overcome the bias issue of the existing functions. Fig. 6.3 compares the bias of different mapping functions. It is evident that our proposed mapping function is not sensitive to high/low values of entropy for a particular modality. In Table 6.2, we show the impact of a noisy modality on the identification accuracy of our DBM-DNN. We tested our DBM-DNN with facial image data corrupted with three different levels of JPEG compression rates, while keeping the speech modality clean. It can be seen that the proposed fusion outperforms the existing techniques even when one of the modalities is noisy. Hence, our proposed fusion can be used with different reliability measures and it can achieve higher accuracy in both clean and noisy conditions.

119 6.5 Experimental results 95 Table 6.3 MOBIO verification protocol. There are four types of utterances in the MOBIO dataset which are designated as p: personal questions, f: free speech, r: short response, and l: long speech. Set Phase-I Phase-II samples subjects Sess. 1 Sess. 2-6 Sess /subject TR DEV EVAL Unlabelled 5p, 10f, 5r, 1l 5p, 10f, 5r, 1l 5p, 5f, 1l Enrolment 5p Test - 10f, 5r 5f Table 6.4 Face verification on MOBIO using the systems presented at the ICB 2013 face recognition challenge [8], evaluation of the session variability modelling approach [9], the LRC, the SVM, and DBM-DNN. Reference ID System Male Female EER HTER EER HTER ICB 2013 challenge [8] F-1 UNILJ-ALP F-2 GRADIANT F-3 CPqD F-4 UTS F-5 UC-HU F-6 TUT F-7 IDIAP F-8 CDTA F-9 baseline F-10 F-1 + F F Khoury et al. (2014) [9] F-11 F-GMM F-12 F-ISV F-13 F-TV (PLDA scoring) F-14 F-TV (cosine scoring) F-15 F-11 + F-12 + F-13 + F this paper F-16 DBM-DNN LBP face F-17 DBM-DNN T face V F-18 F-16 + F Verification Table 6.3 shows the MOBIO verification protocol adopted in this paper. The same protocol was used in the MOBIO challenge [84] held at the ICB 2013 conference. The MOBIO database was partitioned into three segments: training (TR), development (DEV) and evaluation (EVAL) (see Table 6.3). In the training segment (TR) there were data from 50 non-target subjects, which we used as unlabelled data during the unsupervised training of the DBMs. In DEV and EVAL partitions, a small number of samples from the remaining 42 and 58 target subjects, respectively, were used as labelled data. A number of systems were presented in the challenge highlighting their verification performance. In Table 6.4, we list the systems presented at the ICB 2013 face recognition challenge [8] and identify them by F-1 to F-10. Among the submitted works, some used a single classifier (F-5 to F-9), while the others used a combination of different classifiers. We refer to these systems as simple (F-5 to F-9)

120 96 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Table 6.5 Speaker verification on MOBIO using the systems presented at the ICB 2013 speaker recognition challenge [8], evaluation of the session variability modelling approach [9], the LRC, the SVM, and DBM-DNN. Reference ID System Male Female EER HTER EER HTER ICB 2013 challenge [8] S-1 Alpenion S-2 L2F-EHU S-3 L2F S-4 CPqD S-5 Phonexia S-6 GIAPSI S-7 IDIAP S-8 Mines-Telecom S-9 EHU S-10 CDTA S-11 RUN S-12 ATVS S-13 S-1 + S S Khoury et al. (2014) [9] S-14 S-GMM S-15 S-ISV S-16 S-TV (PLDA scoring) S-17 S-TV (cosine scoring) S-18 S-14 + S-15 + S-16 + S this paper S-19 DBM-DNN GMS speech S-23 DBM-DNN T speech V S-20 S-19 + S and combination (F1-F4), respectively. We also list the systems analysed in [9] and identify them by F-11 to F-15. We then include the DBM-DNN systems using the LBP and TV (i-vector) features, respectively. The results in Table 6.4 show that the DBM-DNN combination system (F-18) achieved the lowest HTER among all the systems of the evaluation set for the both male and female partitions. Unlike other systems, the HTERs for the male and female participants using the DBM-DNN based systems (F-16 to F-17) were close. For example the HTERs for the male and female participants obtained by F-10 were 6.27% and 8.47% respectively and by F-15 they were 6.06% and 11.62% respectively. Hence, pre-training a DNN as a DBM was useful to overcome the gender bias in the MOBIO dataset (where there are more male participants than females). The DBM-DNN LBP face system (F-16) also consistently achieved lower HTERs than all but one simple system for males identified as UC-HU (F-5) which used convolutional neural networks (CNNs) with much larger images (of size pixels). For females the DBM-DNN LBP face achieved HTER very close to the UC-HU system. Hence, competitive results can be achieved using our DBM-DNN for face verification. In Table 6.5, the speaker recognition systems presented at the ICB 2013 challenge are listed (S-1 to S-13). Among them the systems identified as S-5 to S-12 are simple systems, while the other systems are combination systems. We also include in the table the speaker recognition systems analysed in [9]. The results in Table 6.5

121 6.5 Experimental results 97 show that our DBM-DNN GMS speech achieved the lowest HTER of % for the female evaluation among the simple systems S-5 to S-12. The combined system (S-20) using our DBM-DNNs and the proposed fusion achieved an HTER of 11.21% for the females which is the second best compared to the combined systems presented in ICB The best system (S-1), namely the Alpenion, used a combination of nine sub-systems using different features. We achieved competitive results just by combining two sub-systems. The DBM-DNN T speech V achieved lower HTERs for females and comparable HTERs for males compared to the TV + PLDA scoring (S-16) and the TV + cosine scoring (S-17) systems. Therefore, our DBM-DNN T V speech provides a better way of evaluating the TVM based systems. We also evaluated the bi-modal verification performance using DBM-DNN. We also used the matching scores of three systems submitted to both the speaker and face recognition challenges at ICB 2013 (namely the CPqD, IDIAP, and the CDTA). Since the scores from these systems are not posterior probabilities, we used the C- ratio [47] as a reliability measure. In Table 6.6, we report the bi-modal recognition results using our proposed fusion and the LLR fusion technique on MOBIO for three systems presented in both the face and speaker recognition challenges held at the ICB We also report the bi-modal recognition result from [9] where LLR was used for fusion. Finally, we report the bi-modal recognition results achieved using DBM-DNN and the proposed fusion. It is evident in the results that the best HTER for male (B-7) subjects were achieved when the session variability systems evaluated in [9] are combined using LLR fusion. The best HTER for female subjects was achieved by combining the systems using our proposed fusion (B- 10). Although our proposed fusion did not achieve the best HTERs among all the methods reported in Table 6.6, it outperformed the LLR fusion method for both male and female subjects for systems B-1 to B VidTIMIT database The VidTIMIT [81] database contains videos (with speech) of 43 subjects (19 females and 24 males) reciting short sentences and rotating their heads under facial expressions. There are 10 videos per subject collected in 3 sessions. The first six videos were captured in Session 1, the next two in Session 2, and the other two in Session 3. We carried both identification and verification experiments on the VidTIMIT database Identification In our identification experiments, we used the data captured in Session 1 for training and the data captured in Session 2 and 3 for testing, and vice versa. Since

122 98 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Table 6.6 Bi-modal recognition on the MOBIO for text-independent verification Reference ID System Fusion technique Male Female EER HTER EER HTER ICB 2013 challenge [7][97] B-1 CPqD LLR B-2 IDIAP B-3 CDTA B-1 CPqD this paper B-2 IDIAP B-3 CDTA Khoury et al.[9] B-4 B-GMM LLR B-5 B-ISV B-6 B-TV B-7 B-4 + B-5 + B this paper B-8 DBM-DNNs (LBP + GMS) this paper B-9 DBM-DNNs (TVM) B-10 B-8 + B

123 6.5 Experimental results 99 Table 6.7 Rank-1 identification accuracy (%) on VidTIMIT Method Face Speech Fusion negent invent 3D-plane proposed SVM LRC DNN DBN-DNN DBM-DNN Table 6.8 Rank-1 identification accuracy on VidTIMIT using DBM-DNN when the face modality is corrupted due to different levels of JPEG compression noise but the speech modality is clean. JPEG Face Entropy based fusion ( %) quality factor (%) negent invent 3D-plane proposed QF QF QF Table 6.9 EERs (%) on VidTIMIT for face, speaker and bi-modal recognition. Method Face Speech Fusion (%) (%) negent invent 3D-plane proposed SVM DNN DBN-DNN DBM-DNN there is only a limited number of subjects in the VidTIMIT database, we used unlabelled samples from the target subjects in the unsupervised training of DBMs. The average identification accuracy is reported in Table 6.7. The face and speech modality identification accuracy obtained using our DBM-DNN were 94.38% and 78.49%, respectively. The speech and face modality identification accuracy using DNNs were 86.53% and 58.53%. Hence, pre-training a DNN as a DBM significantly improved the identification accuracy on the VidTIMIT database. The DBM-DNN also outperformed the the state-of-the-art LRC and SVM classifiers. In Table 6.7, we also report the performance of our proposed fusion technique compared to the state-of-the-art. An overall identification accuracy of 97.48% was achieved using our DBM-DNN and the proposed fusion technique which is better than the other methods that are listed in Table 6.7. Our proposed reliabilityto-weight mapping function can be used with the LRC and SVM classifiers. In Table 6.8, we list the rank-1 accuracy using DBM-DNN under various levels of JPEG compression noise on the facial images. It can be seen that the proposed fusion technique outperforms all other techniques.

124 100 Audio-visual person recognition using deep neural networks and a novel reliability-based fusion Verification In our verification experiments, we used the samples of Session 1 for the unsupervised training of DBMs as well as the supervised fine-tuning of our DBM-DNNs and the samples of Session 2 and 3 for testing our DBM-DNNs. Since there is only a limited number of subjects in the VidTIMIT database, we used unlabelled samples from the target subjects for the unsupervised training of DBMs. In Table 6.9, the verification performance using DBM-DNN and our proposed fusion is presented. It can be seen that the lowest error for both speech and face modalities were achieved using DBM-DNNs. The proposed fusion also outperformed all other techniques. 6.6 Conclusion We have presented audio-visual biometrics system using deep neural networks and a novel reliability-based fusion. Both identification and verification experiments were carried out on two publicly available datasets (the MOBIO and VidTIMIT). Our DBM-DNN and the proposed mapping function were evaluated in clean and noisy conditions. Our experimental results showed that pre-training a DNN as a DBM achieved a better accuracy than the random initialization. Our DBM-DNN also achieved a better recognition performance compared to the state-of-the art. We also showed that the proposed mapping function can be used with different reliability measures such as the entropy of posteriors and C-ratio.

125 Chapter 7 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using Mobile Phone Data Abstract We propose an audio-visual person identification approach based on a joint deep Boltzmann machine (jdbm) model. The proposed jdbm model is trained in three steps: a) learning the unimodal DBM models corresponding to the audio and visual modalities, b) learning the shared layer parameters using a joint Restricted Boltzmann Machine (jrbm) model and c) the fine-tuning of the jdbm model after the initialization with the parameters of the unimodal DBMs and the shared layer. The activation probabilities of the units of the shared layer are used as the joint features and logistic regression is used to perform audio-visual person identification. We show that by learning the shared layer parameters using a jrbm, a higher accuracy can be achieved compared to the greedy layer-wise initialization. The performance of our proposed model is also compared with state-of-the art support vector machine (SVM), deep belief network (DBN), and the deep autoencoder (DAE) models. In addition, our experimental results show that the joint representations obtained from the proposed jdbm model are robust to noise and missing information. Experiments were carried out on the challenging MOBIO database, which includes audio-visual data captured using mobile phones. This article has been accepted with minor corrections for publication in the IEEE Transactions of Multimedia, 2016.

126 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 102 Mobile Phone Data 7.1 Introduction A biometric system recognizes persons using their behavioral and/or physiological data (e.g., fingerprint, retinal or facial image, or speech). The recognition is performed by extracting features from the input biometric samples and comparing these with the pre-stored client model(s). A person recognition approach based on biometrics is considered a more secure option than the traditional knowledge-based or token-based recognition approach. This is because, a piece of information or a token such as a personal identification number (PIN), password or an ID card may be lost, forgotten, fabricated or stolen. A biometric system operates in two modes: identification or verification. In the identification mode, it is assumed that an unidentified person is already registered in the database (referred to as clients) and he/she is classified as one of the N registered clients. In the verification mode, the system tries to determine whether a person is who he/she claims to be. An unknown person s claim is accepted only if the matching score for the claimed client s model is above a predetermined threshold. A multimodal biometric system performs recognition by fusing (i.e., scorelevel fusion) the decisions of multiple sub-systems or using a single system with concatenated inputs (i.e., feature-level fusion). An audio-visual biometric system, which is a multimodal system, offers the following advantages: a) it addresses the issue of non-universality of data and b) it is not prone to spoofing attacks [19]. Furthermore, the data acquisition can be carried out using low cost devices (e.g., mobile phones) and the level of user participation required while acquiring the data is much less compared to that of other biometric traits (e.g., DNA and retinal image). Therefore, an audio-visual biometric recognition system is a cost effective and easily deployable solution. Moreover, the recent developments of smart technologies have introduced many mobile friendly and real-life applications of biometric systems, e.g., e-banking, e-governance, e-education and e-ticketing services. However, the data acquired by hand held devices (e.g., the MOBIO database [84]) commonly contains noise due to the illumination, pose and background variations in the video data, or additive noise in the speech data. Therefore, developing a robust biometric system using mobile phone data is a challenging task. This issues were addressed in the evaluations of state-of-the-art face [7] and speaker [97] recognition systems using the MOBIO database at the ICB 2013 conference. Besides, the recent advancements of machine learning algorithms have facilitated the use of unsupervised learning methods for various recognition tasks. For example, the existing unsupervised learning based speaker recognition systems either use RBMs [35, 36] or a generative DBN [2, 37]. Apart from this, obtaining the joint features using deep models has gained more attention in recent years (details

127 7.1 Introduction 103 h 3 h 2 h 2 h 2 s h 2 f h 1 s h 1 f h s 1 h 1 f h s 1 h 1 f s f s f s f (a) Bimodal Deep Auto-Encoder (DAE) (b) Bimodal Deep Boltzmann Machine (DBM) (c) Bimodal Deep Belief Network (DBN) Fig. 7.1 Example of models that can be used for learning shared features

128 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 104 Mobile Phone Data in Section 7.2). For example, a bimodal DAE (Fig. 7.1a) used for speech recognition [38] or a bimodal DBM (Fig.7.1b) used for face modelling [39] and information retrieval tasks [40]. Recently, we proposed in [115] [116] the score-level fusion of two feed forward Deep Neural Networks (DNNs), which were initialized with DBM parameters (referred to as DBM-DNNs). However, audio-visual person recognition (identification or verification) using the joint features has not been fully explored. This paper aims at filling that void by presenting a jdbm model for audio-visual person identification. Our contributions in this paper are listed below: We present a jdbm model for audio-visual person identification (Section 7.3). To the best of our knowledge, this is the first use of joint features for audiovisual person identification. A higher identification accuracy can be achieved using our proposed model compared to the state-of-the-art SVM, DAE and DBN models (Section and Section 7.5.2). We present a novel three-step algorithm of training the jdbm model. Our proposed learning approach achieves a higher accuracy compared to the conventional approach of training a multimodal DBM (Section 7.5.3). We show that the joint features obtained from the proposed jdbm model are more robust to noise and missing information than the state-of-the-art models (Section 7.5.4). The rest of the paper is organized as follows. In Section 7.2, we briefly review the literature which uses the joint features for different applications and put our work in this context. In Section 7.3, we describe our approach. Our experimental setup is described in Section 7.4 followed by results and their analysis in Section 7.5. Finally, this paper is concluded in Section Related Work A multimodal (bimodal) deep network can be trained to learn the correlations between multiple data modalities (e.g., audio and visual). The activation probabilities of the units in the shared layer can be used as feature vectors for a recognition task. In recent years, a number of bimodal deep models have been proposed and used for different recognition tasks. We can classify them into two categories: denoising and generative deep models. Denonising Deep Models: In [38], a bimodal DAE was used in a denoising fashion for reconstructing both inputs (see Fig. 7.1a). The network was initialized with the parameters of a bimodal DBN (see Fig. 7.1c). A greedy layer-wise approach

7.2 Related Work 105 - - d n a H d n a g in s e c o r p e r P n io c a r x e e r u a f e d e t t t f t a r c LBP GMS DBM face jrbm jdbm DBM speech q(h 2 f ) fk q(h 2 s ) sk Unimodal DBMs Pre-training

129 7.2 Related Work d n a H d n a g in s e c o r p e r P n io c a r x e e r u a f e d e t t t f t a r c LBP GMS DBM face jrbm jdbm DBM speech q(h 2 f ) fk q(h 2 s ) sk Unimodal DBMs Pre-training the shared layer parameters Fine-tuning Classifier r ie s la c n s io e r g e r ic is g o L if t Fig. 7.2 Block diagram of our proposed jdbm-based audio-visual person identification from mobile phone data q(h 3 m s,f ) n io is c e d n io a ic if n e d I t t

130 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 106 Mobile Phone Data was used in training of the bimodal DBN using Gaussian RBMs to model the visiblehidden interactions (i.e., connection weights between the units of the visible and first hidden layer), and Bernoulli-Bernoulli RBMs to model the hidden-hidden interactions (i.e., connection weights between the units of the adjacent hidden layers). The DAE model presented in [38] is suitable for tasks that require the ability to reconstruct an input given the other. In addition, the activation probabilities of the shared layer in the middle (h 2 in Fig. 7.1a) can be used as the feature vectors for a recognition task (e.g., audio-visual speech recognition) in a noisy environment. Generative Deep Models: Recently, a number of generative deep models have been proposed which have a shared layer on top. For example, a bimodal DBM (Fig. 7.1b) was trained in [40] to model the joint space of image and text. Then, the model was used to obtain the joint features (i.e., the activation probabilities of the units in the shared layer) which were used for image classification and information retrieval tasks. In that approach, unimodal DBMs were also trained to model the data distribution of each modality. For an image specific DBM, a Gaussian RBM and Bernoulli-Bernoulli RBMs were used to model the visible-hidden and hidden-hidden interactions, respectively. But for the text specific DBM a Repilcated Softmax model was used for the visible-hidden interactions. Then, a standard DBM training approach which is based on a mean-field inference and a Markov Chaim Monte Carlo (MCMC) a stochastic approximation procedure was used to estimate/approximate data-dependent expectations and the model s sufficient statistics, respectively. Comparisons with other models such as the bimodal DAE and the bimodal DBN showed a noticeable improvement using the bimodal DBM model. A similar approach was also used in [39] for face modeling using facial landmark coordinates and shape free texture images. However, in the existing DBM based models (e.g., [39] and [40]), it is considered that one modality is more dominant than the other. For example, images and their corresponding tags (e.g., a description of the image) were used as inputs to the bimodal DBM model in [40]. That model was capable of generating a tag given only an image at the input, but not the opposite. This is because, tags only carry additional information about the image, but not the discriminative features that can be used for image classification. Similarly, the coordinates of facial landmarks and shape-free facial images were used as inputs in [39]. However, in the case of audio-visual biometrics, both the speech and facial image data carry sufficient information that can be used to independently identify people. Therefore, a robust approach is required in scenarios where e.g., one of the modalities is missing or both inputs are heavily corrupted by noise. Our proposed jdbm model, which is trained using a novel three-step algorithm, attempts to address these issues. In Fig. 7.2, a block diagram of our proposed audio-visual person identification approach is shown (details in Section 7.3).

131 7.3 Proposed Learning of the jdbm Model 107 h 2 W 1 h 1 W 2 v Fig. 7.3 A two-layer deep Boltzmann machine (DBM) with a visible layer and two hidden layers. 7.3 Proposed Learning of the jdbm Model In this section, we present the proposed jdbm model which infers joint features given audio-visual signals at the input. First, we discuss the conventional approach of training a jdbm (Fig. 7.1b). We then explain how to train the proposed jdbm model using a novel three-step algorithm (Fig. 7.4) Training of Bimodal DBM Conventional approach A jdbm model, similar to the one presented in [39], can be trained using Gaussian- Bernoulli RBMs to model the visible-hidden interactions and Bernoulli-Bernoulli RBMs to model the hidden-hidden interactions. Let s R D and f R D represent the vectors corresponding to the speech and facial image inputs, respectively. Two unimodal DBMs (as mentioned in Section 2.5.2), referred to as the DBM speech and DBM face (see Fig. 7.4a), are trained to model the data of the respective modalities (i.e., speech and face). Now, let h 3 contains the units shared between the unimodal DBMs (see Fig. 7.4c), then the joint probability distribution over the inputs s and f can be given by: P (s, f θ) = ( ) P (h 2 s, h 2 f, h 3 ) P (s, h 1 s, h 2 s) h 2 s,h2 f,h3 h 1 s ( h 1 f ) P (f, h 1 f, h 2 f ), (7.1) where θ = {W 1 s, W 2 s, W s, W 1 f, W 2 f, W f} is the set of parameters. The true posterior P (h s, f; θ) is approximated with a fully factorized approximating distribution:

132 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 108 Mobile Phone Data h 3 h 2 h 1 s s 2 Ws 1 Ws DBM speech 2 Wf 1 Wf DBM face h 2 f h 1 f f h 3 Ws Wf q(h 2 s ) sk q(h 2 f ) fk h 2 s h 2 f h 1 s Ws Wf 2 2 Ws Wf 1 1 Ws Wf s f h 1 f (a) Step 1: Learning unimodal DBM models (b) Step 2: Bernoulli-Bernoulli joint RBM (c) Step 3: jdbm model. Fig. 7.4 Three-step training of the proposed jdbm model. In the first step, we learn unimodal DBMs corresponding to the audio and visual modalities. In the second step (a), we learn the shared layer parameters as a Bernoulli-Bernoulli joint RBM. In the third step, the jdbm is fine-tuned after the initialization with the parameters of the unimodal DBMs and the parameters of the joint RBM.

133 7.3 Proposed Learning of the jdbm Model 109 Q(h s, f; µ) = ( P1 q(h 1 sj s, f) j=1 ( P1 q(h 1 fj s, f) j=1 P 2 k=1 P 2 k=1 P 3 m=1 ) q(h 2 sk s, f) ) q(h 2 fk s, f) q(h 3 m s, f) (7.2) where h = {h 1 s, h 2 s, h 1 f, h2 f, h3 }, and µ = {µ 1 s, µ 2 s, µ 1 f, µ2 f, µ3 } are the variational parameters. The learning is carried out by finding the value of µ that maximizes the variational lower bound for the current value of model parameter θ, which results in a number of mean-field equations. Given µ, the model parameters are then updated using an MCMC based method. The initial parameters of the unimodal DBMs are learned using a greedy layer-wise approach (details in [119]). Beside, the parameters of the bimodal DBM (Fig. 7.1b) are pre-trained (i.e., initial values are learned) using a bimodal DBN, which is formed by stacking RBMs. Proposed approach We propose a novel three-step training algorithm for jdbm, using the pre-training strategy proposed in [122] and a jrbm (Fig. 7.4b). Unlike the conventional approach, we divide the task of learning the joint distribution into three sub-tasks (i.e., the three factors of Eq. 7.2 are approximated separately). Then, we propose to fine-tune the jdbm model for a small number of iterations. Unlike the conventional approach, we use the parameters of the unimodal DBMs (Fig. 7.4a) to initialize the lower layers of the jdbm (Fig. 7.4c) and pre-train the shared layer parameters using a jrbm (Fig. 7.4b). In the first step, we train two unimodal DBMs, referred to as DBM speech and DBM face (Fig. 7.4a), which correspond to the speech and face modalities. We use the algorithm presented in [122] to pre-train each DBM. After the pre-training, we use the mean-field inference to find the approximate posterior distribution for a given input to a DBM. For example, given an input s to the DBM speech, we obtain the approximate posterior distribution Q(h s s), where h s = {h 1 s, h 2 s}. Similarly, for input f to the DBM face, we obtain the approximate posterior distribution, Q(h f f), where h f = {h 1 f, h2 f }. Their marginals, q(h2 sj = 1 s) and q(h 2 fk = 1 f), are used with logistic regression for unimodal person identification (see Section 7.5.1). In the second step, the parameters for the shared layer of the jdbm are pretrained using a Bernoulli-Bernoulli jrbm (see Fig. 7.4b). The joint distribution of

A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 110 Mobile Phone Data Fig. 7.

3) exp E y, z, h where y = [q(h 2 sk s)] k=1,...,p s 2 and z = [q(h 2 fk f)] k=1,.

In addition, θ = {W s, W f } represents the parameter set of the jrbm.

We solve the following optimization in order to maximize the log-likelihood: ( ) θ = arg max ln L θ θ y, z. (7.

134 A joint Deep Boltzmann Machine (jdbm) Model for Person Identification using 110 Mobile Phone Data Fig. 7.5 Top row: frames extracted from different videos of a person; Bottom row: detected faces from the video frames. the jrbm is given by: ) P (y, z θ = h = 1 Z(θ ) h ( ) P y, z, h ( ) (7.3) exp E y, z, h where y = [q(h 2 sk s)] k=1,...,p s 2 and z = [q(h 2 fk f)] k=1,...,p f, correspond to the marginals 2 of the approximate posterior distributions obtained from the DBM speech and DBM face, respectively. In addition, θ = {W s, W f } represents the parameter set of the jrbm. Here, W s, and W f are initialized with small random values, while the biases for the top layer of the DBM s and DBM f are used as the biases for the visible layers of the jrbm. We solve the following optimization in order to maximize the log-likelihood: ( ) θ = arg max ln L θ θ y, z. (7.4) The optimal values of the parameters can be obtained using a method similar to the one presented in Eq An approximate of the true posterior distribution is given by: Q(h 3 y, z; µ) = P 3 m=1 ( ) q h 3 m y, z (7.5) where µ represents the mean-field parameters. The objective function (Eq. 7.4) is optimized in order to find µ which maximizes the variational lower bound for the current value of the model parameter θ. In the third step, we use the parameters of the joint RBM as well as the unimodal DBMs to initialize the jdbm (see Fig. 7.4c). After this initialization, the jdbm model is fine-tuned for a small number of epochs so that its parameters are slightly adapted to the joint space of input data modalities. The marginals, q(h 3 m s, f), of the approximate posterior distribution Q(h s, f; θ, θ), where h = {h 1 s, h 2 s, h 1 f, h2 f, h3 }, can be used as joint features. The joint features are obtained by sampling from the posteriors P (h 3 s) if the face modality is missing or from P (h 3 f) if the speech modality is missing.

International Journal of Advanced Research in Computer Science and Software Engineering

Volume 3, Issue 2, February 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Level of Fusion