Listening With Your Eyes: Towards a Practical. Visual Speech Recognition System

Size: px

Start display at page:

Download "Listening With Your Eyes: Towards a Practical. Visual Speech Recognition System"

Virginia Thompson
5 years ago
Views:

1 Listening With Your Eyes: Towards a Practical Visual Speech Recognition System By Chao Sui submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Software Engineering University of Western Australia December 2016

2 ii

3 Abstract Speech has long been acknowledged as one of the most effective and natural communication means between human beings. In recent decades, continuous and substantial progress has been made in the development of Automatic Speech Recognition (ASR) systems. However, in most practical applications, the accuracy of ASR systems is negatively affected by the exposure to noisy environments. Research on audio-visual speech recognition has been undertaken to overcome the recognition degradation that occurs in the presence of acoustic noises. Despite the promising application perspective, state-of-the-art visual speech recognition systems still cannot achieve an adequate performance in practical applications, because the representation of the joint visual and speech information is a largely undeveloped area. Given the limitations of the current 2D based VSR systems, this thesis endeavours to find a better way to represent visual speech information to perform speech recognition from two perspectives. The first to develop a hand-craft visual speech feature that uses the recent advances of the 3D computer vision based methods. The second is to develop an automatic visual feature learning scheme to overcome the limitations of hand-crafted visual features using deep learning techniques. In order to explore the possibility of using 3D visual information for speech recognition, a comprehensive study on the characteristics of both grey and depth visual iii

4 features was firstly carried out, and why the depth information can boost the performance of visual speech recognition was extensively analysed. Based on the different information embedded in the grey and depth features, a new Cascade Hybrid Appearance Visual Feature (CHAVF) extraction scheme was proposed, which successfully combines grey and depth visual information into a compact feature vector. This novel feature was evaluated on visual and audio-visual connected digit recognition and isolated phrase recognition. The results showed that depth information is capable of significantly boosting the speech recognition performance, and that the performance of the proposed visual feature outperforms the other commonly used appearance-based visual features on both the visual and audio-visual speech recognition tasks. Particularly, the proposed grey-depth visual feature yields an approximately 21% relative improvement over the grey visual feature. In terms of the deep learning based automatic visual feature learning, three learning schemes are proposed in this thesis. First, a visual deep bottleneck feature (DBNF) learning scheme was developed in this research. This learning scheme uses a stacked auto-encoder combined with some popular hand-crafted techniques. Experimental results showed that the proposed deep feature learning scheme yields an approximately 24% relative improvement for visual speech accuracy. Second, unlike the DBNF which solely learns features from video sequences, a novel learning method based on the Deep Boltzmann Machine (DBM) was then developed which is able to explore both the acoustic information and the visual information to learn a better visual feature representation. During the test stage, instead of using both audio and visual signals, only the videos are used to generate the missing audio features, and both the given visual and audio features are used to produce a joint representation. Experimental results showed that the DBM based method outperformed the originally proposed DBNF iv

5 method. At last, a novel feature learning framework based on the marginalised stacked auto-encoder, which does not require practitioners to have any deep learning specific knowledge, was also proposed. This method was applied to the task of visual speech recognition, and the performance of the proposed feature learning framework outperformed the other feature extraction methods. This method is also a universal solution which can be used for any deep learning based task. Therefore, this method was also verified on the popular hand written digit recognition MNIST database, and experimental results showed that the proposed method is as effective as (if not better than) models that are tuned by experts. v

6 vi

7 Acknowledgements First of all, my biggest thanks to my supervisors Winthrop Professor Mohammed Bennamoun and Professor Roberto Togneri, who provided me with the opportunity and scholarship to study in this fantastic university. Throughout my study at UWA, they used their patience and experience to guide me to make this thesis possible. Besides imparting academic knowledge and research skills, Winthrop Professor Bennamoun and Professor Togneri also gave me a great deal of advices on my life and career. I truly indeed often feel graceful for their sacrifices and efforts during my PhD Candidature. Also, I must thank Dr Senjian An for the insightful discussion with him and helpful guidance on my research paper and thesis. I also appreciate and cherish the friendship built up between my colleagues, especially with Dr Yulan Guo and Dr Yinjie Lei. As the first two friends I made in Perth, they gave me a lot of help both on my research and my life. Along with their selfless help, their sense of humour always refresh me during the frustrating even hopeless research time. Also thanks to the computer vision group members, Dr. Suman Sedai, Dr. Munawar Hayat, Dr. Said Elaiwat, Dr Syed Afaq Ali Shah, Alam Mohammad, Umar Asif and Salman Khan. It is them to make me spend a wonderful time in UWA. Last but not least, special thanks go to all my families that concern about me, particularly to my parents. Their love and support is an important contribution to my vii

8 research. It is them to raise me up to more than I can be. Without their support and spiritual encouragement, it is impossible for me to continue my study in this fantastic university. I also cannot express how grateful I am to my wife, Yingnan, for sharing the hardships in this totally unfamiliar environment when we just moved to Perth. They are the most important people in my life. viii

9 Contents 1 Introduction Background Objectives Thesis contributions Publications Thesis Outline Literature Review Introduction Hand Crafted Feature Extraction Graph Based Visual Feature Representations Visual Feature Learning Using Deep Learning Discussion A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio- Visual Speech Recognition Introduction Related Works ix

10 CONTENTS 3.3 Visual Feature Extraction DCT LBP-TOP Cascade Hybrid Appearance Visual Feature Extraction Performance Evaluation and Results The AusTalk and OuluVS Corpora Speaker-Independent Visual Speech Recognition Speaker-Independent Visual Phrase Classification Speaker-Independent Audio-Visual Speech Recognition Conclusion Visual Feature Learning Using Deep Bottleneck Networks Introduction Deep Bottleneck Features Experiments Data Corpus Experimental Setup Stacked Auto-Encoder Architecture Performance of the Augmented Bottleneck Feature Conclusion Visual Feature Learning Using Deep Boltzmann Machine Introduction Related Works and Our Contributions Proposed Feature Learning Scheme Multimodal Deep Boltzmann Machine x

11 CONTENTS DBM Pre-training DBM Fine-tuning Generating Missing Audio Modality Deterministic Fine-Tuning Augmented Visual Feature Generation Experiments Data Corpus and Experimental Setup Audio and Visual Features Learning Model Architecture Performance Evaluation Conclusion Visual Feature Learning Using PSO Tuned Deep Models Introduction Related Works Proposed PSO-mSDA Feature Learning System Marginalised Stacked Denoising Autoencoder Particle Swarm Optimization for msda Experimental Results Database Performance Evaluation Conclusion Conclusion Discussion Future work xi

12 CONTENTS xii

13 List of Tables 2.1 Summary of the recently proposed multi-speaker and speaker-independent visual-only speech recognition performance on popular and publicly available visual speech corpora Bumblebee camera configuration used for building our system Digit sequences in the AusTalk data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ) The 10 phrases in OuluVS data corpus Planar LBP-TOP feature. The superscript 1 1 which represents the entire lip region are fed into the feature extraction procedure without subdivision, while the superscript 2 5 which represents the mouth region is divided into 10 subregions as shown in Figure 3.2. The subscript represents the radii of the spatial and temporal axes (i.e., X, Y and T ) and the number of neighbouring points in these three orthogonal planes are 8 and 3, respectively Visual speech recognition performance comparison between popular appearance features and our proposed method xiii

14 LIST OF TABLES 3.6 Comparison between our proposed method and the classifier-level fusion methods Visual speech classification comparison on the OuluVS data corpus. The results is reported in terms of speaker-dependent speech classification The Audio Weight (AW) and the Video Weight (VW) of the MSHMM Digit sequences in the Big ASC data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ) Evaluation on various stacked denoising auto-encoder architectures Visual speech recognition performance comparison between our proposed DBNF and other methods Digit sequences in the AusTalk data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ) Connected digit recognition performance with multi-modal inputs Performance comparison between the DBM learned feature with it variants proposed in this paper Performance comparison between our proposed method with other feature learning and extraction techniques The 10 phrases in OuluVS data corpus Comparison between human-tuned msda and PSO tuned msda Parameter comparison between the expert-tuned msda and PSO-tuned msda on the MNIST xiv

15 LIST OF TABLES 6.4 Test error rates produced by our proposed method and various methods on MNIST Visual speech classification comparison on the OuluVS data corpus. The results were reported in terms of speaker-dependent speech classification xv

16 LIST OF TABLES xvi

17 List of Figures 1.1 Possible application scenarios of VSR. In an acoustically noisy environment, using an intelligent handset to capture and extract visual features is an effective solution for ASR Possible application scenarios of VSR. In an acoustically noisy environment, using an intelligent handset to capture and extract visual features is an effective solution for ASR Lip spatio-temporal feature extraction using LBP-TOP feature extraction. (a) Lip block volumes; (b) lip images from three orthogonal planes; (c) LBP features from from three orthogonal planes; (d) concatenated features for one block volume with appearance and motion xvii

18 LIST OF FIGURES 2.3 The idea behind graph-based feature representation methods is to project the original high-dimensional spatio-temporal visual features to a trajectory in a lower-dimensional feature space, thereby reducing the feature dimension to boost the performance of speech recognition. Each point (p(w T )) of the projected trajectory represents a frame in the corresponding video. This figure appeared in [131]. In this work, each image x i of the T-frame video is assumed to be generated by the latent speaker variable h and the latent utterance variable w i Two RBM-based deep models. Blue circles represent input units and red units represent hidden units. (a): An RBM. (b): A Stacked RBM-based Auto-Encoder Different deep models. The blue and orange circles represent input units, the red units represent hidden units, and the green circles represent representation units. (a): A DBN. (b): A DBM. (c): A multimodal DBM. When we compare (a) with (b), one can note that the DBN model is a directed model, while the DBM model is undirected Lip spatio-temporal feature extraction using the LBP-TOP feature extraction. (a) Lip block volumes; (b) Lip images from three orthogonal planes; (c) LBP features from three orthogonal planes; (d) Concatenated features for one block volume with the appearance and motion The mouth region is divided into 10 subregions Cascade Hybrid Appearance Visual Feature (CHAVF) extraction The recording environments and devices used to collect the AusTalk data Sample RGB images and meshes from the AusTalk visual dataset xviii

19 LIST OF FIGURES 3.6 The comparison of the amount of relevant information (I(x j ; C)) embedded in different types of feature dimensions The performance of visual-only speech recognition using various feature types and feature reduction techniques: planar (gray-level) DCT (Figure 3.7a), stereo (depth-level) DCT (Figure 3.7b), hybrid-level DCT (Figure 3.7c), planar (gray-level) LBP-TOP (Figure 3.7d), stereo (depthlevel) LBP-TOP (Figure 3.7e), hybrid-level LBP-TOP (Figure 3.7c) The visual-only speech recognition performance of the mrmr selected LBP-TOP features followed by LDA for further feature dimension reduction D t-sne visualisation of different visual features with various feature reduction techniques. Figure 3.9a: Planar DCT+LDA; Figure 3.9b: Planar LBP+mRMR+LDA; Figure 3.9c: Stereo DCT+LDA; Figure 3.9d: Stereo LBP+mRMR+LDA; Figure 3.9e: Hybrid-level DCT+LDA; Figure 3.9f: Our proposed CHAVF Multistream HMM audio-visual digit classification results with various white noise SNR levels for different types of visual features Proposed augmented deep multimodal bottleneck visual feature extraction scheme Possible application scenarios of our proposed framework. In an noisy environment, visual features are a promising solution for automatic speech recognition xix

20 LIST OF FIGURES 5.2 Block diagram of our proposed system. The left side of the figure shows the training phase.the visual feature is learned from both the audio and video stream using a multimodal DBM. The right side of the figure shows the testing phase, where the audio signal is not used. In the testing phase, the audio is generated by clamping the video input and sampling the audio input from the conditional distribution The generation of the missing audio signals can be divided into two steps: 1. Infer the audio signal from the given visual features. 2. Generate a joint representation using both the reconstructed audio and the given visual features Discriminative fine-tuning of our proposed DBM model Examples in the AusTalk Corpus. Fig. 5.5a: Original recordings in the corpus. Fig. 5.5b: Corresponding mouth ROI examples extracted from the original examples in Fig. 5.5a D t-sne visualization of the DBM learned feature (Fig. 5.6a) and our proposed feature (Fig. 5.6b) Proposed system overview of our proposed framework xx

21 Chapter 1 Introduction The ultimate goal of artificial intelligence research is to empower machines to perform complex tasks as competently as humans. This empowerment also includes providing machines with the ability to understand human language. In terms of typical humancomputer interaction, operators use buttons, joysticks, keyboards, mice and so forth to command devices. However, these devices are not natural and straightforward enough to operate, especially for disabled users. Automatic speech recognition (ASR) systems are more likely to overcome these problems effectively, and have already been applied to a number of applications, such as in-vehicle environments, robotics, and intelligent phones. Inspired by human perception systems, acoustic signals are not the only source for individuals to interpret language. Listeners also rely on information from nonverbal sources, such as lip and tongue movements [52] and facial expressions [57]. A speech recognition system that solely relies on the visual information is called Visual Speech Recognition (VSR) system. Moreover, a number of Audio-Visual Speech Recognition (AVSR) systems that add lip features to the speech recognition process have shown to 1

22 Chapter 1. Introduction boost the recognition accuracy compared with their acoustic-only counterparts [90]. Despite the promising application prospects of VSR and AVSR, numerous challenges still exist to deploy these technologies commercially. This thesis addresses one of the fundamental issues in the area of VSR: how to embed visual speech relevant information into visual features, and how to use these visual features for speech recognition. The rest of this chapter is organised as follows: a brief introduction of research background and objectives is given in Section 1.1 and Section 1.2, respectively. The main contributions of this thesis are presented in Section 1.3, followed by a list of publications in Section 1.4. An outline of the thesis is provided in Section Background Given that speech is widely acknowledged to be one of the most effective means of communication between humans, researchers in the ASR community have made great efforts to provide users with a natural way to communicate with intelligent devices. This is particularly important for disabled people, who may be incapable of using a keyboard, mouse or joystick. In recent years, great achievements have been made by the ASR community, such as the application of deep learning [46] techniques to ASR. As a result, it is generally believed that we are getting closer to talking naturally and freely to our computers[22]. Although a number of ASR systems have been commercialised and have entered our daily lives (e.g., Apples Siri and Microsofts Cortana), several limitations still exist in this area. One major limitation is that ASR systems are still prone to environmental noises, thereby limiting their applications. Given ASR s vulnerability, research in the 2

23 Chapter 1. Introduction Figure 1.1: Possible application scenarios of VSR. In an acoustically noisy environment, using an intelligent handset to capture and extract visual features is an effective solution for ASR. area of VSR has emerged to provide an alternative solution to improve speech recognition performance. Further, VSR systems have a wider range of applications compared to their acoustic-only speech recognition counterparts. For example, as shown in Fig. 1.1, in many practical applications where speech recognition systems are exposed to noisy environments, acoustic signals are almost unusable for speech recognition. Conversely, with the availability of front and rear cameras on most intelligent mobile devices, users can easily record facial movements to perform VSR. In extremely noisy environments, visual information basically becomes the only source that ASR systems can use for speech recognition. 1.2 Objectives To address the challenges presented in Section 1.1, this research aims to develop visual speech feature extraction methods using the most advanced computer vision and 3

24 Chapter 1. Introduction machine learning techniques: 1. Motivated by the encouraging performance of 3D computer vision based methods, this research aims to develop a visual and audio-visual speech recognition system that combines grey and depth information to boost visual and audiovisual speech accuracy. 2. Motivated by the great success achieved by deep learning techniques in the area of acoustic speech recognition [23], this research aims to develop an automatic deep visual feature learning technique to overcome the limitations of hand-crafted visual features. 1.3 Thesis contributions With the purpose of achieving the objectives discussed in Section 1.2, this thesis makes several contributions to the fields of visual speech recognition and machine learning: 1. This research developed a novel cascade feature extraction method that can effectively encode both grey-level and depth-level visual information into a compact visual feature. Moreover, this new visual feature carries both global and local appearance-based visual information, and this research demonstrated that both local and global appearance-based visual information are able to contribute to improvements in speech recognition. Experimental results show that the performance of this proposed feature extraction method achieves promising accuracy in different speech recognition tasks. To the best of our knowledge, this is also the first comprehensive work that experimentally shows the efficacy of 4

25 Chapter 1. Introduction depth-level visual features on speaker-independent continuous speech recognition on a large-scale (164 speakers) audio-visual corpus. The experimental results also demonstrate an improved performance with the integration of greylevel features. Since using depth-level visual information for VSR is still a largely undeveloped area and yet has promising potential applications, this research is expected to provide the research community with a new perspective to overcome the limitations of grey-level visual speech features. This feature extraction method is introduced in Chapter In this thesis, a novel augmented deep bottleneck feature (DBNF) extraction method for visual speech recognition is introduced. Experimental results show that the proposed deep feature learning scheme outperformed various hand-crafted features, and boosted speech accuracy significantly. The proposed DBNF based method is introduced in Chapter A novel deep feature learning framework is developed which uses both the audio and visual signals to enrich the visual feature representation. Unlike previous works which only extract visual features from the video data, both audio and visual features are used during the training process. During the test phase, only the visual information is needed since this feature learning framework is capable of inferring the missing (i.e. degraded) audio modality. Experimental results show that the proposed techniques outperform the performance of the hadncrafted features and features learned by other commonly used deep learning techniques. This feature learning framework is introduced in Chapter This thesis introduces a novel framework for stacked auto-encoder training which does not need practitioners to have any deep learning specific knowledge. This 5

26 Chapter 1. Introduction method is also a universal solution which can be used for any tasks that can be solved by deep learning based methods. Experimental results show that the model automatically trained by the proposed method is as accurate as if not better than the model tuned by experts. This automatic parameter tuning framework is introduced in Chapter Publications Outcomes of the this research are disseminated in the following publications. 1. C. Sui, R. Togneri, M. Bennamoun, Deep Feature Learning for Dummies: A Simple Auto-Encoder Training Method Using Particle Swarm Optimisation, Pattern Recognition Letters, C. Sui, R. Togneri, M. Bennamoun, A Cascade Gray-Depth Visual Feature Extraction Method for Visual and Audio-Visual Speech Recognition, Speech Communication, C. Sui, M. Bennamoun, R. Togneri, Visual Speech Feature Representations: Recent Advances, inadvances in Face Detection and Facial Image Analysis, Kawulok, Michal, Celebi, M. Emre, Smolka, Bogdan (Eds.), Springer, ISBN , C. Sui, R. Togneri, M. Bennamoun, Listening by Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines, in Proceedings of The IEEE International Conference on Computer Vision (ICCV2015),

27 Chapter 1. Introduction 5. C. Sui, R. Togneri, M. Bennamoun, Extracting Deep Bottleneck Features for Visual Speech Recognition, in Proceedings of 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP2015), C. Sui, M. Bennamoun, R. Togneri, S. Haque, A lip extraction algorithm using region-based ACM with automatic contour initialisation, in Proceedings of 2013 IEEE Workshop on Applications of Computer Vision (WACV2013), 2013, pp C. Sui, S. Haque. R. Togneri, M. Bennamoun, A 3D Audio-Visual Corpus for Speech Recognition, in Proceedings of the 14th Australasian International Conference on Speech Science and Technology (SST2012), 2012, pp C. Sui, R. Togneri, S. Haque, M. Bennamoun, Discrimination Comparison Between Audio and Visual Features, in Proceedings of 46th Annual Asilomar Conference on. Signals, Systems, and Computers (Asilomar2012), 2012, pp Thesis Outline Chapter 2 presents an up-to-date survey on visual speech feature representation and highlights the strengths of two newly developed approaches (i.e., graph-based learning and deep learning) for VSR. In particular, this chapter summarises the methods which use these two techniques to overcome one of the most challenging difficulties in this area, i.e., how to automatically learn informative visual feature representations from facial images to replace the widely used handcrafted features. This chapter concludes with a discussion of the potential visual feature representation solutions that may overcome the challenges in this domain. Chapter 2 was published as a book chapter in Advances in Face Detection and Facial Image Analysis published by Springer 7

28 Chapter 1. Introduction Chapter 3 explores the use of depth-level information for VSR. Particularly, this chapter addresses three fundamental issues: 1) Will depth-level features benefit visual and audio-visual speech recognition? 2) If so, how much information is embedded in depth-level features? 3) How to encode both grey-level and depth-level information in a compact feature vector? In this research, a comprehensive study on the characteristics of both grey-level and depth-level visual features is carried out, and how depth-level information can boost the visual speech recognition is extensively analysed. Based on the different information embedded in grey-level and depth-level features, this chapter presents a new visual feature extraction scheme which successfully combines grey-level and depth-level visual information into a compact feature vector. The results show that depth-level information is capable of significantly boosting the speech recognition accuracy, and the performance of the proposed visual feature outperforms other commonly used grey-level appearance visual features in the case of both visual and audio-visual speech recognition. Chapter 3 was submitted as a journal paper in Speech Communication. Chapter 4 presents a visual deep bottleneck feature (DBNF) learning scheme using a stacked auto-encoder combined with other techniques, which is motivated by the recent progresses in the use of deep learning techniques for acoustic speech recognition. Experimental results show that the proposed deep feature learning scheme yields an approximately 24% relative improvement for visual speech accuracy. To the best of our knowledge, this is the first study which uses deep bottleneck features over handcrafted visual features. Our work is the first to show that the deep bottleneck visual feature is able to achieve a significant accuracy improvement in the case of visual speech recognition. Chapter 4 was published as a conference paper in the Proceedings of the Inter- 8

29 Chapter 1. Introduction national Conference on Acoustics, Speech and Signal Processing (ICASSP2015). Chapter 5 presents a novel feature learning method for visual speech recognition using Deep Boltzmann Machines (DBM). Unlike all existing visual feature extraction techniques which solely extract features from video sequences, the proposed method is able to explore both acoustic information and visual information to learn a better visual feature representation. During the test stage, instead of using both audio and visual signals, only the videos are used to generate the missing audio features, and both the given visual and audio features are used to produce a joint representation, Experimental results on a large scale audio-visual corpus show that the proposed technique outperforms the hand-crafted features. Chapter 5 was published as a conference paper in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV2015). Chapter 6 presents a novel framework for stacked auto-encoder training which does not require practitioners to have any deep learning specific knowledge. Instead, this framework is able to automatically choose the type of model and the parameters of the model according to the provided tasks. This method is also a universal solution which can be used for any tasks that can be solved by deep learning based methods. Experimental results show that the model automatically trained by the proposed method is as accurate as if not better than the model tuned by experts. Chapter 6 was submitted as a journal paper in Pattern Recognition Letters (currently under review). Chapter 7 concludes the thesis with a discussion of all the proposed methods in this thesis and the main conclusions from our work. 9

30 Chapter 1. Introduction 10

31 Chapter 2 Literature Review 1 Abstract Exploiting the relevant speech information that is embedded in facial images has been a significant research topic in recent years, because it has provided complementary information to acoustic signals for a wide range of automatic speech recognition (ASR) tasks. Visual information is particularly important in many real applications where acoustic signals are corrupted by environmental noises. This chapter reviews the most recent advances in feature extraction and representation for Visual Speech Recognition (VSR). In comparison with other surveys published in the past decade, this chapter presents a more up-to-date survey and highlights the strengths of two newly developed approaches (i.e., graph-based learning and deep learning) for VSR. In particular, we summarise the methods of using these two techniques to overcome one of the most challenging difficulties in this area-that is, how to automatically learn good visual feature representations from facial images to replace the widely used handcrafted 1 Published in Advances in Face Detection and Facial Image Analysis, Springer, ISBN ,

32 Chapter 2. Literature Review features. This chapter concludes by discussing potential visual feature representation solutions that may overcome the remaining challenges in this domain. 2.1 Introduction Given that speech is widely acknowledged to be one of the most effective means of communication between humans, researchers in the automatic speech recognition (ASR) community have made great efforts to provide users with a natural way to communicate using intelligent devices. This is particularly important for disabled people, who may be incapable of using a keyboard, mouse or joystick. As a result of the great achievements made by the ASR community in recent years in terms of the application of novel techniques such as deep learning [46], people generally believe that we are getting closer to talking naturally and freely to our computers[22]. Although a number of ASR systems have been commercialised and have entered our daily lives (e.g., Apples Siri and Microsofts Cortana), several limitations still exist in this area. One major limitation is that ASR systems are still prone to environmental noises, thereby limiting their applications. Given ASR s vulnerability, research in the area of Visual Speech Recognition (VSR) has emerged to provide an alternative solution to improve speech recognition performance. Further, VSR systems have a wider range of applications compared to their acoustic-only speech recognition counterparts. For example, as shown in Fig. 2.1, in many practical applications where speech recognition systems are exposed to noisy environments, acoustic signals are almost unusable for speech recognition. Conversely, with the availability of front and rear cameras on most intelligent mobile devices, users can easily record facial movements to perform VSR. In extremely noisy environments, visual information basically becomes the only source 12

33 Chapter 2. Literature Review Figure 2.1: Possible application scenarios of VSR. In an acoustically noisy environment, using an intelligent handset to capture and extract visual features is an effective solution for ASR. that ASR systems can use for speech recognition. Moreover, inspired by bimodal human speech production and perception even in clean and moderate noise conditions, where good-quality acoustic signals are available for speech recognition visual information can provide complementary information for ASR [90, 91]. Therefore, research on VSR is of particular importance, because once an adequate VSR result is obtained, speech recognition performance can be boosted through the fusion of audio and visual modalities. Despite the wide range of applications of VSR systems, there are two main limitations related to this area: the development of appropriate dynamic audio-visual fusion and the development of appropriate visual feature representations. Regarding dynamic audio-visual fusion, although several high-quality works on this topic have been published recently [101, 84, 27, 105], a similar fusion framework was used in most of the cases. More specifically, in these works, the quality of both the audio and visual signals was evaluated using different criteria, such as signal-to-noise ratio, dispersion 13

34 Chapter 2. Literature Review and entropy. Weights were dynamically assigned to the audio and visual streams according to the quality of the audio and visual signals. However, compared with the audio-visual fusion method, visual feature representation techniques are more controversial. The goal of visual feature representation is to embed spatio-temporal visual information into a compact visual feature vector. This is the most fundamental problem for VSR, because it directly affects the final recognition performance. Hence, in this survey, we mainly focus on the most recent advances in the area of visual feature representation, and we discuss potential solutions and future research directions. Regarding audio feature representation, Mel-Frequency Cepstral Coefficients (MFCCs) are generally acknowledged to be the most widely used acoustic features for speech recognition. However, unlike audio feature extraction, there is no universally accepted visual feature extraction technique that can achieve promising results for different speakers and different speech tasks, as three fundamental issues remain unresolved [132]: 1) how to extract visual features with constant quality from videos with different head and pose positions; 2) how to remove speech-irrelevant information from the visual data; 3) how to encode temporal information into the visual features. This chapter will summarise recent research that has examined solutions to these issues, and it will provide an insight into the relationships between these methods. This chapter is organised as follows. Section 2.2 introduces handcrafted visual feature extraction methods, which are still the most widely used techniques for visual feature representation, and they are sometimes used in pre-processing steps for automatic feature learning. Sections 2.3 and 2.4 respectively describe graph-based feature learning and deep learning-based feature learning methods. Finally, Section 2.5 provides insights into potential solutions for the remaining challenges and possible future research directions in this area. 14

35 Chapter 2. Literature Review 2.2 Hand Crafted Feature Extraction Before introducing visual feature learning techniques, this section describes some of the handcrafted visual features that still play a dominant role in VSR. In addition, handcrafted feature extraction methods can be used in the pre-processing steps of many visual feature learning frameworks. In terms of the type of information embedded in the features, visual features can be categorised into two classes: appearancebased and geometric-based features [14]. For appearance-based visual features, the entire ROIs (e.g., mouth, lower face or even the face area) are considered informative regions in terms of VSR. However, it is infeasible to use all the pixels of ROIs because the dimensions of the features are too large for the classifiers to process. Hence, appropriate transformations of the ROIs are used to map the images to a much lower-dimensional feature space. More specifically, given the original image I in the feature space R D (where D is the feature dimension), appearance-based feature extraction methods seek to transform matrix P to map I to a lower feature space R d (d << D), such that the transformed feature vector contains the most speech-relevant information with a much smaller feature dimension. The Discrete Cosine Transform (DCT) [90, 91] is among the most commonly used appearance-based visual feature extraction methods. It can be formulated as: Y(i, j) = N 1 y=0 N 1 x=0 f(x, y) cos ( π(2y+1)j 2N ) (π(2x+ 1)i) cos, (2.1) 2N for i, j, x, y = 1, 2,..., N, where N is the width and height of the mouth ROI, the value of N is a power of two and I(x, y) is the grey-level intensity value of the ROI. To avoid the curse of dimensionality, low-frequency coefficients are selected and used as the static 15

36 Chapter 2. Literature Review components of the visual feature. To encode the temporal information, the first and second derivatives of the DCT coefficients are used along with the static coefficients of the DCT (Y(i, j)) as the dynamic components of the visual feature. Other appearancebased techniques can also be used to extract appearance-based visual features [91], such as Principle Component Analysis (PCA) [25]], Hadamard and Haar transform [98] and Discrete Wavelet Transform (DWT) [89]. In addition to the methods described above, other appearance-based visual feature extraction methods have been proposed. More specifically, instead of seeking a global transformation that can be used on the entire ROI, other methods use a feature descriptor to describe a small region centred at each pixel in the ROI, and to count the descriptors response occurrence in the ROI. Typical methods in this category include Local Binary Pattern (LBP) [77] and Histogram of Oriented Gradients (HOG) [18]. However, these methods are incapable of extracting temporal dynamic information from the ROIs. Hence, a number of variants have been proposed. For example, Zhao et al. [129] proposed a local spatio-temporal visual feature descriptor for automatic lipreading. This visual feature descriptor can be viewed as an extension of the basic LBP [77].. More specifically, to encode the temporal information into the visual feature vector, Zhao et al. [129] extracted LBP features from Three Orthogonal Planes (LBP-TOP), which contain the spatial axes of the images (X and Y) and the time axis (T), as shown in Fig Although the LBP-TOP feature contains rich visual speechrelevant information, the dimensionality of the original LBP-TOP feature is too large to be used directly for VSR. Hence, in [129], AdaBoost was used to select the most informative components from the original LBP-TOP feature for VSR. Numerous works [133, 3, 131] have used LBP-TOP for VSR, and variations of the original LBP-TOP feature have also been proposed. Pei et al. [86] used Active Ap- 16

37 Chapter 2. Literature Review Figure 2.2: Lip spatio-temporal feature extraction using LBP-TOP feature extraction. (a) Lip block volumes; (b) lip images from three orthogonal planes; (c) LBP features from from three orthogonal planes; (d) concatenated features for one block volume with appearance and motion. pearance Models (AAM) to track keypoints on the lips. For each small patch centred around the keypoints of the lips, LBP-TOP and HOG were used to extract the texture features. In addition to the texture features, the difference between the patch positions in the adjacent frames was used as a shape feature. Given that rich speech-relevant information is embedded in LBP-TOP features, a number of feature reduction techniques have been introduced to extract a more compact visual feature from the original LBP- TOP feature. In addition to the LBP-TOP feature, a number of other appearance-based feature descriptors have been used to extract temporal dynamic information, such as LPQ-TOP [54], LBP-HF [128], and LGBP-TOP [2]. Although both DCT and LBP-TOP are widely used for VSR, they are quite different because they represent visual information from different perspectives. More specifically, as shown in (2.1), each component (Y(i, j)) of the DCT feature is a representation of the entire mouth region at a particular frequency. Hence, DCT is a global feature 17

38 Chapter 2. Literature Review representation method. Conversely, the LBP-TOP feature uses a descriptor to represent the local information in a small neighbourhood; therefore, the LBP-TOP is a local feature representation method. Hence, the development of a method that can combine both global and local information using a compact feature vector would be expected to boost visual speech accuracy. Although Zhao et al. [130] showed that combining different types of visual features (LBP-TOP and EdgeMap [34]) can improve recognition accuracy, finding an effective way to combine DCT and LBP-TOP features is still an undeveloped area. Although the dimensionality of the appearance-based visual features is much smaller compared to the number of pixels in the ROI, it still makes the system succumb to the curse of dimensionality. Hence, a feature dimension reduction process is essential as a prior step to VSR. Among the feature reduction methods, LDA and PCA are the most widely used [90, 91]. In addition, Gurban et al. [40] presented a Mutual Information Feature Selector (MIFS)-based scheme to select an informative visual feature component subset and thus reduce the dimensionality of the visual feature vector. Unlike feature reduction schemes such as PCA and LDA, MIFS analyses each feature component in the visual feature vector and selects the most informative components using the greedy algorithm. In addition, Gurban et al. [40] proposed that penalizing features for their redundancy is essential to yield a more informative visual feature vector. Geometric visual features explicitly model the shape of the mouth and are potentially more powerful than appearance-based features. However, they are sensitive to lighting conditions and image quality. Geometric-based features include Deformable Template (DT), Active Shape Model (ASM), Active Appearance Model (AAM) and Active Contour Model (ACM). DT [65] is a method that uses a parametric lip template to partition an input image into a lip region and a non-lip region. However, this approach 18

39 Chapter 2. Literature Review is degraded when the shape of the lip is irregular or the mouth is opened wide [58]. The ASM [74] uses a set of landmarks to describe the lip model. The AAM approach [67] can be viewed as an extension of ASM that incorporates grey-level information into the model. However, as the landmarks need to be manually labelled during training, it is very laborious and time-consuming to train the ASM and AAM for lip extraction. In terms of ACM-based lip extraction, there are two main categories, namely edgebased and region-based. With respect to the edge-based extraction approach, the image gradients are calculated to locate the lip potential boundary [20]. Unfortunately, given that the intensity contrast between the lip and the face region is usually not large enough, the edge-based ACM is likely to achieve incorrect extraction results. Moreover, this method has been confirmed to be prone to image noise, and it is highly dependent on the initial parameters of the ACM [58]. In terms of region-based techniques, the foreground is segmented from the background by finding the optimum intensity energy in the images. Compared to its edge-based counterpart, this method has been shown to be robust with respect to the initial curve selection and the influence of noise [58]. In contrast, because of the appearance of the teeth and tongue, intensity values inside the lips are usually different. In this situation, a Global region-based ACM (GACM) can fail because all of the pixels inside the lips are taken into consideration. However, with a Localised region-based ACM (LACM), only the pixels around the objects contour are taken into account. This method can therefore successfully avoid the influence of the appearance of the teeth and the tongue [17]. However, provided that the LACM is used solely for lip extraction and the initial contour is far away from the actual lip contour, the curvature may converge to a local minima without finding the correct lip boundary. Therefore, the initial contour needs to be specified near the lip boundary as a priori. The common method for specifying 19

40 Chapter 2. Literature Review the initial contour is to detect several lip corners [64, 19] and to construct an ellipse surrounding the lip. Unfortunately, this approach is either sensitive to the image noise and illuminations or needs a complex training process. In order to effectively solve this problem, Sui et al. [109] presented a new extraction framework that synthesises the advantages of both the global and localised region-based ACMs. Although the geometric feature can explicitly model the shape of the lips, it is difficult to derive an accurate model that can describe the dynamic movement of the mouth. Hence, appearance-based features remain the most widely used features in the VSR community. 2.3 Graph Based Visual Feature Representations In most cases, the dimensions of the visual features that are extracted using handcrafted feature extraction methods are usually too large for the classifiers. Graph-based learning methods that non-linearly map the original visual features to a more compact and discriminatory feature space have also been used in recent years. Initially, graph-based methods were commonly used in human activity recognition [88]. Given that both human activity and speech recognition deal with the analysis of spatial and temporal information, graph-based feature learning can therefore also be used for VSR. The idea behind graph-based learning is that visual features can be represented as the elements of a unified feature space, and the temporal evolution of lip movements can be viewed as the trajectory connecting these elements in the feature space. Hence, after the extracted feature sequences from the videos have been correctly mapped to the corresponding trajectories, the speech can be correctly recognised. In addition, it is generally believed that the dimension of the underlying structure of the 20

41 Chapter 2. Literature Review Figure 2.3: The idea behind graph-based feature representation methods is to project the original high-dimensional spatio-temporal visual features to a trajectory in a lowerdimensional feature space, thereby reducing the feature dimension to boost the performance of speech recognition. Each point (p(w T )) of the projected trajectory represents a frame in the corresponding video. This figure appeared in [131]. In this work, each image x i of the T-frame video is assumed to be generated by the latent speaker variable h and the latent utterance variable w i. visual speech information should be significantly smaller than the dimension of the corresponding observed videos. Based on the above assumptions, numerous papers have proposed different frameworks to parameterise the original high-dimensional visual features to the trajectories to extract lower-dimensional features. An illustration of the concept behind graph-based feature representation methods is shown in Fig Zhou et al. [133] proposed a path graph based method to map the image sequence of a given utterance to a low-dimensional curve. Their experimental results showed that the recognition rate of this method is 20 per cent higher than the recognition rate reported in [129] on the OuluVS data corpus. Based on this work, the visual feature sequence of a speakers mouth when talking is further assumed to be 21

42 Chapter 2. Literature Review generated from a speaker-dependent Latent Speaker Variable (LSV) and a sequence of speaker-independent Latent Utterance Variables (LUV). Hence, Zhou et al. [131] presented a Latent Variable Model (LVM) that separately represents the video by LSV and LUV, and the LUV is further used for VSR. Given an image sequence of length T, X = {x t } t=1 T, the LVM of an image x t which is generated from the inter-speaker variations h (LSV) and dynamic changes of the mouth w t, can be formulated by (2.2): x t = µ+fh+gw t + ffl t, (2.2) where µ is the global mean, F is a factor matrix whose columns span the inter-speaker space, G is the bias matrix that describes the uttering variations and ffl t is the noise term. The model described in (2.2) is a compact representation of high-dimensional visual features. Compared with the 885-dimensional raw LBP-TOP feature, the sixdimensional LUV feature is very compact and can yield better accuracy than other features, such as PCA [11], DCT [36], AF[95] and AAM [67]. Pei et al.[86] presented a method based on the concept of unsupervised random forest manifold alignment. In this work, both appearance and geometric visual features were extracted from the lip videos, and the affinity of the patch trajectories in the lip videos was estimated by a density random forest. A multidimensional scaling algorithm was then used to embed the original data into a low-dimensional feature space. Their experimental results showed that this method was capable of handling large datasets and low-resolution videos effectively. Moreover, the exploitation of depth information for VSR was also discussed in this paper. Unlike the unsupervised manifold alignment approach proposed by Pei et al. [86], Bakry and Elgammal [3] presented a supervised visual feature learning framework 22

43 Chapter 2. Literature Review where each video was first mapped to a manifold by the manifold parametrisation [26], and then kernel partial least squares was used in the manifold parameterisation space to yield a latent low-dimensional manifold parameterisation space. It is well known that different people speak at different rates, even when they are uttering the same word. The varying rates of speech result in random parameterisations of the same trajectory, which leads to a failure in speech recognition. Hence, a temporal alignment is essential for VSR to remove any temporal variabilities caused by different speech rates. Su et al. [107] applied a statistical framework (introduced in [106]]) and proposed a rate-invariant manifold alignment method for VSR. In this method, each trajectory α of the video sequence in the trajectory set M is represented by a Transported Square-Root Vector Field (TSRVF) to a reference point c: h α (t) = α(t) α(t) c α(t), (2.3) where h α (t) is the TSRVF of trajectory α at time t, α(t) it the velocity vector of α(t), and is defined as the Riemannian metric on the Riemannian manifold. Given the TSRVFs of two smooth trajectories α 1 and α 2, these two trajectories can be aligned according to: γ = arg min γ Γ 1 0 h α1 (t) h α2 (γ(t)) γ(t) 2 dt, (2.4) where Γ is the set of all diffeomorphisms of [0, 1] : Γ = {γ : [0, 1] [0, 1] γ(0) = 0, γ(1) = 1, γ is a diffeomorphism}. The minimization over Γ in (2.4) can be solved using dynamic programming. After the trajectories have been registered, the mean of the multiple trajectories can be used as a template for visual speech classification. Although the method introduced in [107] did not produce superior performance over 23

44 Chapter 2. Literature Review other recent graph-based methods [133, 3, 86, 131], and although only speech-dependent recognition was reported, this work provided a general mathematical speech-rateinvariant framework for the registration of trajectories and for comparison. Although graph-based methods have shown promising recognition performance compared to conventional feature reduction methods [132] such as LDA and PCA, it should be noted that none of the above graph-based methods were tested on continuous speech recognition. Even though Zhou et al. [131] reported that their method achieved promising results on classifying visemes, which are generic images that can be used to describe a particular sound, it is still unclear whether their graph-based method can be used for continuous speech recognition. 2.4 Visual Feature Learning Using Deep Learning Section 2.3 introduced various graph-based methods that can map high-dimensional visual features to non-linear feature spaces. However, the use of graph-based methods for VSR requires prior extraction of the visual features, and the classification performance largely depends on the quality of the extracted visual features. In this section, we introduce deep feature learning-based methods, which can directly learn visual features from videos. These techniques offer the potential to replace handcrafted features with deep learned features for the VSR task. Deep learning techniques were first proposed by Hinton et al. [47], who used the greedy, unsupervised, layer-wise pre-training scheme to train a Restricted Boltzmann Machine (RBM) to model each layer of a Deep Belief Network (DBN), which effectively solved the difficulty of training multiple hidden-layer neural networks. Later works showed that a similar pre-training scheme could also be used by stacked auto-encoders 24

45 Chapter 2. Literature Review (a) (b) Figure 2.4: Two RBM-based deep models. Blue circles represent input units and red units represent hidden units. (a): An RBM. (b): A Stacked RBM-based Auto-Encoder. [7] and Convolutional Neural Networks (CNN) [93]. These techniques achieved great success in various classification tasks, such as acoustic speech recognition and image set classification [41, 42]. After deep learning techniques had been successfully applied to a single modality for the task of feature learning, Ngiam et al. [73] used it for a bimodal (i.e., audio and video) task. This was the first deep learning work in the domain of VSR and Audio- Visual Speech Recognition (AVSR). Since then, a number of other methods have been proposed that employed deep learning techniques to learn visual features for visual 25

46 Chapter 2. Literature Review speech classification. Deep learning techniques used for VSR and AVSR can be categorised into three types: RBM-based deep models, stacked denoising auto-encoderbased methods and CNN-based methods. The RBM is a particular type of Markov random field with hidden variables h and visible variables v (Fig. 2.4a). The connections W ij between the visible and hidden variables are symmetrical, but there are no connections within the hidden and visible variables. The model defines the probability distribution P(v, h) over v and h via an energy function, which can be formulated by (2.5). The log-likelihood of P(v, h) can be maximised by minimising the energy function in (2.5): E(v, h; θ) = m n i=1 j=1 W ij v i h j m i=1 b i v i n a j h j, (2.5) j=1 where a j and b i are the biases of the hidden units and visible unit respectively, m and n are the numbers of hidden units and visible units, and θ includes the parameters of the model. As the computation of the gradient of the log-likelihood is intractable, the parameters of the model have usually been learned using contrastive divergence [44]. With the proper configurations of the RBM, the visual feature is fed to the first layer of the RBM, the posteriors of the hidden variables (given the visible variables) are obtained using p(h j v) = sigmoid(b j + Wj Tv), and p(h j v) can be used as the new training data for the successive layers of the RBM-based deep networks. This process is repeated until the subsequent layers are all pre-trained. In Ngiam et al. s work [73], the deep auto-encoder, which consisted of multiple layers of sparsity RBMs [62], was used to learn a shared representation of the audio and visual modalities for speech recognition. The authors discussed two learning architectures in their paper. The first model investigated was cross-modality learning, 26

47 Chapter 2. Literature Review where the model learned to reconstruct both the audio and video modalities, while only the video was used as an input during the training and testing stage. The second model was used for the training of the multimodal deep auto-encoder with both audio and video data. However, two-thirds of the used data had zero values in one of the input modalities (e.g., video), and the original values were used in the other input modality (e.g., audio). Experimental results in [73] showed an improvement over previous handcrafted visual features [67, 40, 129]. However, their bimodal deep autoencoder did not outperform their video-only deep auto-encoder, because the bimodal auto-encoder might not have been optimal when only the visual input was provided. Given the inefficiency of the bimodal auto-encoders proposed in [73], Srivastava et al. [102] used a Deep Boltzmann Machine (DBM), which was first proposed in [97], for AVSR. Like the deep learning models introduced above, the DBM is also a method from the Boltzmann machine family of models, and it has the potential to learn the complex and non-linear representations of the data. Moreover, it can also exploit information from a large amount of unlabelled data for pre-training purposes. The major difference between the DBM and other RBM-based models is shown in Fig Unlike other RBM-based models, which only employ a top-down approximation inference procedure, the DBM incorporates a bottom-up pass with a top-down feedback. Given that the approximation inference procedure of the DBM has two directions, the DBM model is an undirected model (Fig. 2.5b), while other RBM-based models are directed (Fig. 2.4b and Fig. 2.5a). Because of the undirected characteristics of the DBM models, the DBM is more capable of handling uncertainty in the data, and it is more robust to ambiguous inputs [97]. Before applying the DBM model to AVSR, Srivastava et al. [103] first applied the DBM on image and text classification, which is also a multimodal learning task. In 27

48 Chapter 2. Literature Review their work, the image and text data were trained separately using two single-stream DBMs, and the outputs of these two single-stream DBMs were then merged to train joint representations of the image and text data. As the image and text data are highly correlated, it is difficult for the model proposed in [73] to learn these correlations and produce multimodal representations. In fact, as the approximation inference procedure is directed, the responsibility of the multimodal modelling falls entirely on the joint layer [103]. In contrast, the model introduced in [103] solved this challenge effectively because the DBM can approximate the learning model both from the top-up pass and the bottom-down feedback, which makes the multimodal modelling responsibility spread out over the entire network [103]. Moreover, as shown in Fig. 2.5a, the top two layers of the DBN consist of an RBM (which is an undirected model), while the remaining lower layers form a directed generative model. Hence, the directed DBN model is not capable of modelling the missing inputs. Conversely, as the DBM is an undirected generative model and employs a two-way approximate inference procedure, it can be used to generate a missing modality by clamping the observed modality at the inputs and running the standard Gibbs sampler. In [103], the DBM was shown to be capable of generating missing text tags from the corresponding images. Srivastava et al. [102] then used this model for the task of AVSR. Experimental results on the CUAVE [85] and AVLetters [67] datasets showed that the multimodal DBM can effectively combine features across modalities and achieve slightly better results than the video deep auto-encoder proposed in [73]. Although this work demonstrated that the DBM could combine features effectively for speech recognition across audio and visual modalities, the inference of audio from the visual feature was not discussed. However, it provides a method that may be able to solve the problem proposed in [73]-that is, how to generate the missing audio from the 28

49 Chapter 2. Literature Review video. Despite these promising results, it should be noted that all of the aforementioned deep learning-based VSR methods have the objective of learning a more informative spatio-temporal descriptor that extracts speech-relevant information directly from the video. However, in order to use deep learning techniques for real-world VSR applications, sequential inference-based approaches, which are widely used by the acoustic speech recognition community, need to be developed. In terms of acoustic continuous speech recognition, Mohamed et al. [72] developed acoustic phone recognition using a DBN. In this work, MFCCs were used as an input to the DBN. The DBN was pre-trained layer by layer, followed by a fine-tuning process that used 183 target class labels (i.e., three states for each of the 61 phonemes). The output of the DBN represents the probability distribution over possible classes. The probability distribution yielded by the DBN was fed to a Viterbi decoder to generate the final phone recognition results. Inspired by this method, Huang and Kingsbury [49] presented a similar framework for AVSR. Compared with the Hidden Markov Model/Gaussian Mixture Model (HMM/GMM) framework, the DBN achieved a 7 per cent relative improvement on the audio-visual continuously spoken digit recognition task. This work also presented a mid-level feature fusion method that concatenated the hidden representations from the audio and visual DBN, and the LDA was then used to reduce the dimensionality of the original concatenated hidden representations. At the last stage, the LDA projected representations were used as inputs to train a HMM/GMM model, and achieved a 21 per cent relative gain over the baseline system. However, using the DBN for visual-only speech recognition did not produce any improvements over the standard HMM/GMM model in [49]. In addition to the RBM-based deep learning techniques introduced above, Vincent 29

50 Chapter 2. Literature Review (a) (b) (c) Figure 2.5: Different deep models. The blue and orange circles represent input units, the red units represent hidden units, and the green circles represent representation units. (a): A DBN. (b): A DBM. (c): A multimodal DBM. When we compare (a) with (b), one can note that the DBN model is a directed model, while the DBM model is undirected. 30

51 Chapter 2. Literature Review et al. [116] proposed a Stacked Denoising Auto-encoder (SDA) based on a new scheme to pre-train a multi-layer neural network. Instead of training the RBM to initialise the hidden units, the hidden units are learned by reconstructing input data from artificial corruption. Palecek [82] explored the possibility of using the auto-encoder to learn useful feature representations for the VSR task. The learned features were further processed by a hierarchical LDA to capture the speech dynamics before feeding them into the HMM for classification. The auto-encoder-learned features produced a 4-8 per cent improvement in accuracy over the standard DCT feature in the case of isolated word recognition. However, only the single-layer auto-encoder was discussed in their paper [82], suggesting that the superiority of the stacked auto-encoder was not fully analysed. In addition to the conventional SDA, deep bottleneck feature extraction methods based on SDA [124, 96, 33] were extensively used in acoustic speech recognition. Inspired by the deep bottleneck audio features for continuous speech recognition, Sui et al. [111] developed a deep bottleneck feature learning scheme for VSR. This technique was successfully used with the connected word VSR, and it demonstrated superior performance over handcrafted features such as DCT and LBP-TOP [111]. Although RBM-based deep networks and SDA-based methods achieved an impressive performance for various tasks, these techniques did not take the topological structure of the input data into account (e.g., the 2D layout of images and the 3D structure of videos). However, topological information is very important for visual-driven tasks, because a large amount of speech-relevant information is embedded in the topological structure of the video data. Hence, developing a method to explore the topological structure of the input should help to boost VSR performance. The CNN model proposed by Lecun et al. [59] can exploit the spatial correlation that is presented in input images. This model has achieved great success in visual-driven tasks in recent years 31

52 Chapter 2. Literature Review [60]. Noda et al. [76, 75] developed a lipreading system based on a CNN to recognise isolated Japanese words. In their paper, the CNN was trained using mouth images as input to recognise the phonemes. The parameters of the fully trained CNN were used as features for the HMM/GMM models. The experimental results showed that their proposed CNN-learned features significantly outperformed those acquired by PCA. A number of deep learning-based methods have achieved promising results in the case of acoustic speech recognition. However, their use in the task of VSR has not yet been explored. For example, deep recurrent neural networks [37] have been recently proposed for acoustic speech recognition. It would be interesting to explore their applications to VSR in future research. 2.5 Discussion This chapter provides an overview of some handcrafted, graph-based and deep learningbased visual features that have recently been proposed. To compare the VSR performance achieved by the different visual feature representations, we list the performance of these methods for three popular publicly available visual speech corpora in Table 2.1. The table shows that graph-based and deep learning-based methods generally perform better than handcrafted feature-based approaches. Although some geometricbased handcrafted features [66, 83, 84] achieved more accurate results compared to the graph-based and deep learning-based methods, it is required that the landmarks on the facial area are laboriously labelled beforehand. On this basis, the VSR research community generally recognises that graph-based and deep learning-based methods should be the focus of future research. Most graph-based and deep learning-based methods have been developed in an at- 32

53 Chapter 2. Literature Review Table 2.1: Summary of the recently proposed multi-speaker and speaker-independent visual-only speech recognition performance on popular and publicly available visual speech corpora. Data Corpus Feature Category Feature Extraction Methods Classifier Accuracy ASM [67] HMM 26.91% Optical Flow SVM 32.31% Hand crafted AAM [67] HMM 41.9% AVLetters MSA [67] HMM 44.6% DCT SVM 53.46% LBP-TOP [129] SVM 58.85% Graph-based Bakry and Elgammal [3] SVM 65.64% Deep learning Ngiam et al. [73] SVM 64.4% Srivastava et al. [102] SVM 64.7% OuluVS Hand crafted LBP-TOP [129] SVM 62.4% Ong and Bowden [79] SVM 65.6% Zhou et al. [133] SVM 81.3% Graph-based Bakry and Elgammal [3] SVM 84.84% Zhou et al. [131] SVM 85.6% Pei et al. [86] SVM 89.7% CUAVE Hand crafted DCT [40] HMM 64% AAM [83] HMM 75.7% Lucey and Sridharan [66] HMM 77.08% Visemic AAM [84] HMM 83% Deep learning Ngiam et al. [73] SVM 66.7% Srivastava et al. [102] SVM 69.0% 33

54 Chapter 2. Literature Review tempt to pose lipreading as a classification problem. However, in order to employ VSR for connected and continuous speech applications, the VSR problem should be tackled in a similar way to a speaker-independent acoustic speech recognition task [51]. In terms of continuous speech recognition, instead of extracting holistic visual features from the videos, visual information needs to be represented in a frame-wise mannerthat is, the spatio-temporal visual features should be extracted frame by frame, and the temporal dynamic information needs to be captured by the classifiers (e.g., HMM). Given that acoustic modelling for speech recognition using deep learning techniques has been extensively investigated by the speech community, and given that some of these systems have already been commercialised in recent years [46], it is worth investigating whether these methods can be used for VSR. Another challenge in the area of VSR is that a large-scale and comprehensive data corpus needs to be available. Although there are a large number of data corpora available for VSR research, all of the existing ones cannot satisfy the ultimate goal, which is to build a practical lipreading system that can be used in real-life applications. That is, in order to treat the VSR problem in a way that is similar to continuous speech recognition, which needs to capture the temporal dynamics of the data (e.g., by using HMMs), a large-scale audio-visual data corpus needs to be established, as this will provide visual speech in the same context as audio speech. Currently, popular benchmark corpora such as AVLetters [67], CUAVE [85] and OuluVS [129] are not fully useful because they are limited in both speaker number and speech content. In addition, some largescale data corpora such as AVTIMIT [43], IBMSR [66], IBMIH [50] are not publicly accessible. Although the publicly available XM2VTSDB [69] has 200 speakers, the speech is limited to simple sequences of isolated word and digit utterances. A large-scale and comprehensive data corpus called AusTalk was recently created [12, 119, 13, 110]. 34

55 Chapter 2. Literature Review AusTalk is a large 3D audio-visual database of spoken Australian English recorded at 15 different locations in all states and territories of Australia. The contemporary voices of one thousand Australian English speakers of all ages have been recorded in order to capture the variability in their accent, linguistic characteristics and speech patterns. To satisfy a variety of speech-driven tasks, several types of data have been recorded, including isolated words, digit sequences and sentences. Given that the AusTalk data corpus is a relatively new dataset, only a few works have used this data corpus to date [112, 109, 111]. A comprehensive review on the availability of data corpora can also be found in [132]. This chapter reviewed the recent advances in the area of visual speech feature representation. One can conclude from this survey that graph-based and deep learningbased feature representations are generally considered state-of-the-art. Instead of directly using handcrafted visual features for the VSR task, handcrafted visual feature extraction methods are widely used during the pre-processing phase before the extraction of visual features that are finally used for graph-based and deep learning techniques. Despite the exciting recent achievements by the VSR community, several challenges still need to be addressed before a system is developed that can fulfil the specifications of real-life applications. We have summarised the major challenges and proposed possible solutions in this chapter. 35

56 Chapter 2. Literature Review 36

57 Chapter 3 A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition 1 Abstract Although stereo information has been extensively used in computer vision tasks recently, the incorporation of stereo visual information in Audio-Visual Speech Recognition (AVSR) systems and whether it can boost the speech accuracy still remains a largely undeveloped area. This paper addresses three fundamental issues in this area: 1) Will the stereo features benefit visual and audio-visual speech recognition? 2) If so, how much information is embedded in stereo features? 3) How to encode both planar and stereo information in a compact feature vector? In this study, we propose a comprehensive study on the characteristics of both planar and stereo visual features, and extensively analyse why the stereo information can boost the visual speech recogni- 1 Submitted in Speech Communication (accept with minor revision). 37

58 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition tion. Based on the different information embedded in planar and stereo features, we present a new Cascade Hybrid Appearance Visual Feature (CHAVF) extraction scheme which successfully combines planar and stereo visual information into a compact feature vector, and evaluate this novel feature on visual and audio-visual connected digit recognition and isolated phrase recognition. The results show that stereo information is capable of significantly boosting the speech recognition, and the performance of our proposed visual feature outperforms the other commonly used appearance-based visual features on both the visual and audio-visual speech recognition tasks. Particularly, our proposed planar-stereo visual feature yields approximately 21% relative improvement over the planar visual feature. To the best of our knowledge, this is the first paper that extensively evaluates the different characteristics of planar and stereo visual features, and we first show that using the stereo feature along with the planar feature can significantly boost the accuracy on a large-scale audio-visual data corpus. Keywords: Audio-visual speech recognition, planar-stereo visual information, hybridlevel visual feature 3.1 Introduction Speech has long been acknowledged as one of the most effective and natural means of communication between human beings. In recent decades, continuous and substantial progress has been made in the development of Automatic Speech Recognition (ASR) systems. However, in most practical applications the accuracy of ASR systems is negatively affected by exposure to noisy environments. Research on audio-visual speech recognition has been undertaken to overcome the recognition degradation that occurs 38

59 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition in the presence of acoustic noise [91]. Despite the promising application perspective, state-of-the-art audio-visual speech recognition systems still cannot achieve adequate performance in practical applications, because most Visual Speech Recognition (VSR) systems use grey or colour information which is highly sensitive to a number of variables, such as varying illumination and head poses. Given the limitations of the current texture information based VSR systems, exploiting stereo information can be an effective option in overcoming these challenges. Furthermore, with the availability of affordable stereo cameras, the utilisation of 3D information has led to some great successes in the computer vision community [63, 38, 100]. However, using 3D visual information on VSR to boost speech recognition accuracy has not been sufficiently studied. Motivated by the encouraging performance of 3D computer vision based methods, this paper proposes a visual and audio-visual speech recognition system that combines planar and stereo information to boost visual and audio-visual speech accuracy. This paper makes the following two major contributions: First, it develops a novel cascade feature extraction method that can effectively encode both planar and stereo visual information into a compact visual feature. Moreover, this new visual feature carries both global and local appearance-based visual information, and our paper shows that both local and global appearancebased visual information is able to contribute to speech recognition. Experimental results demonstrate that the performance of this proposed feature extraction method achieves promising accuracy in different speech recognition tasks. To the best of our knowledge, this is also the first comprehensive work that experimentally demonstrates the efficacy of stereo visual features for speaker- 39

60 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition independent continuous speech recognition on a large-scale (162 speakers) audiovisual corpus. The experimental results also demonstrate an improved performance with the integration of planar features. Since using stereo visual information for VSR is still a largely undeveloped area but has promising potential applications, this paper is expected to provide the community a new perspective to overcome the limitations of planar visual speech features. The rest of this paper is organised as follows: Section 3.2 introduces some related works. Section 3.3 introduces two widely used appearance-based feature extraction methods, followed by an introduction of our proposed feature extraction scheme. The system performance is extensively evaluated in Section 3.4. Finally, Section 3.5 sets out the relevant conclusions that can be drawn from this research. 3.2 Related Works In this section, a brief and up-to-date overview of some recent works relating to visual and audio-visual speech recognition is presented. A more comprehensive review can be found in [91, 90, 132]. The visual features used in visual and audio-visual speech recognition can be divided into three main categories: i) lip appearance-based features; ii) lip shape-based features; and iii) lip motion-based features [14]. In most visual and audio-visual speech recognition systems, lip motion-based features work collaboratively with the first two types of visual features to represent both temporal and spatial information of the lips. Appearance-based visual feature extraction methods usually consider the whole lip or the lower face region as the most informative region for visual speech recognition. Among appearance-based visual features, the Discrete Cosine Transform (DCT) 40

61 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition is the most widely used [40, 27, 104]. In terms of the utilisation of depth information, Galatas et al. [31, 30] have conducted encouraging pioneering research that employs depth DCT information for visual and audio-visual speech recognition. However, the integration of the depth and planar DCT features did not show significant improvement over the planar DCT features. Wang et al. [120, 121] also conducted a similar research using 3D data acquired using a Kinect. However, these works used a small audio-visual data corpus limited in both speaker number and speech content. Furthermore, the differences between stereo and planar features were not analysed. Given the current limitations of the stereo based visual speech recognition research, in our work, we analysed the different characteristics of planar and stereo visual features, and we propose a new feature combination method that can integrate both planar and stereo features into a compact feature vector, and show that the use of stereo information is able to significantly boost speech accuracy on a large-scale audio-visual data corpus. In addition to DCT features, Local Binary Pattern (LBP) features have also been widely used in the computer vision community. This feature representation method has been shown to boost accuracy on various computer vision tasks [127, 10]. Based on the success of the LBP feature, Zhao et al. [129] introduced an LBP based spatiotemporal visual feature, called LBP-Three Orthogonal Planes (LBP-TOP), and this feature achieved impressive results in both speaker-independent and speaker-dependent visual speech recognition tasks. Given the great success of the LBP-TOP features, numerous graph based methods have been proposed in recent years [133, 79, 80, 4, 86, 107]. These methods were able to non-linearly map the original LBP-TOP feature to a more compact and discriminative feature space and achieved very promising classification results. However, it should be noted that none of these graph-based methods have been tested for continuous 41

62 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition speech recognition. Zhou et al. [131] reported that their method achieved promising results on classifying visemes; however, it is still unclear whether the graph based method can be used for continuous speech recognition. On the other hand, this paper proposes a method that embeds LBP-TOP features into a compact feature vector and experimentally shows that it is effective for continuous speech recognition tasks. For lip shape-based features, the lip contours of speakers are first extracted from an image sequence and a parametric or statistical lip contour model is obtained. The parameters of the lip model are then used as visual features; typical methods used in this category include the Active Contour Model (ACM), the Active Shape Model (ASM) and the hybrid appearance and shape model, i.e., Active Appearance Model (AAM) [68, 67]. These shape-based visual features are able to explicitly capture the shape variations of the lips; however, it should be noted that the training process of the shape-based features is very time-consuming, as a large number of lip landmarks need to be laboriously labelled. This manual labelling process becomes infeasible when the corpus has a large number of recordings. Furthermore, speech accuracy degrades if the model is not appropriately and sufficiently trained [91]. Conversely, appearance-based feature extraction methods are computationally efficient and do not require any training processes. Thus, appearance-based methods are more suitable for robust visual and audio-visual speech recognition tasks with a large number of speakers. Given the advantages of the appearance-based visual features mentioned above, a new framework is presented in this paper that successfully includes both planar and stereo appearance-based features. This approach is shown to boost speech accuracy for visual and audio-visual speech recognition systems. 42

63 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition 3.3 Visual Feature Extraction In this section, two of the most widely used appearance-based visual features of recent years are reviewed [40, 27, 104, 129, 133, 79, 80, 73, 131], (i.e., DCT and LBP-TOP). Then, the different information types embedded in these two appearance-based visual features are discussed. This provides a justification for the motivation behind the proposed approach. Motivated by the different characteristics of appearance-based features, a proposed Cascade Hybrid Appearance Visual Feature (CHAVF) is also introduced in this section DCT The DCT has been widely used in many visual and audio-visual speech recognition systems, as it can preserve speech relevant information in a feature vector of low dimension. In this study, this is applied to a sequence of frames of the mouth region. The DCT definition of one frame of the mouth region video is given by: Y(i, j) = N 1 y=0 N 1 x=0 f(x, y) cos ( π(2y+1)j 2N ) (π(2x+ 1)i) cos, (3.1) 2N for i, j, x, y = 0, 1, 2,..., N 1, where N is the width and height of the mouth ROI. The function f(x, y) is the planar and stereo intensity values of the mouth ROI. To reduce the computational cost and retain feature discrimination, 32 low-frequency DCT coefficients were selected in a Zig-Zag left to right scanning pattern. Each of these 32 DCT coefficients lie in the even columns of the DCT images, due to the lateral symmetry of the mouth region [92]. The 32 first and 32 second temporal derivatives were computed to capture the dynamic information of the utterances. For both planar and 43

64 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition stereo DCT features, these static and dynamic features were used to constitute a 96 dimensional feature vector to represent the speech-related information. Furthermore, a feature mean normalisation at an utterance-level was used to compensate for illumination variations LBP-TOP As an alternative to DCT, Zhao et al. [129] introduced a spatio-temporal local texture feature extraction based on LBP and used it for visual speech recognition. This feature extraction method extracts LBP information from both the spatial and the temporal domains (i.e., Three Orthogonal Planes (TOP)). It is referred to as the LBP extracted from TOP or LBP-TOP. Figure 3.1: Lip spatio-temporal feature extraction using the LBP-TOP feature extraction. (a) Lip block volumes; (b) Lip images from three orthogonal planes; (c) LBP features from three orthogonal planes; (d) Concatenated features for one block volume with the appearance and motion. Unlike the basic LBP for static images, LBP-TOP extends feature extraction to the spatio-temporal domain, which makes LBP an effective dynamic texture descriptor. 44

65 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Given the 2D spatial coordinates and the temporal coordinates, X, Y at time, T, a histogram is generated to accumulate the presence of different binary patterns across the XY, XT and YT planes (see Figure 3.1). Once the LBP histograms are generated from the three planes, a feature vector is constructed by concatenating the three histograms to represent both the lip appearance and its motion. Uniform patterns [78] were used in this study to reduce the dimension of the LBP-TOP feature vector, and the dimensionality of the LBP-TOP feature is 177. To improve speech recognition performance, the mouth region was further divided into several subregions, as elaborated in [129], to extract LBP-TOP features from each subregion and concatenate the respective features. We experimentally found that the mouth image needs to be divided into 2 5 regions (see Figure 3.2) to achieve the best results. Hence, extracting the planar and stereo LBP-TOP features from the 2 5 regions results in a 1770-dimensional feature ( ), and a 3540-dimensional hybrid feature is formed after concatenating planar and stereo features. Figure 3.2: The mouth region is divided into 10 subregions. 45

66 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Cascade Hybrid Appearance Visual Feature Extraction Comparing two most commonly used appearance features (i.e., DCT and LBP-TOP) introduced above, one can note that these two feature methods extract features from two different information representation perspectives. As detailed in Eq. 3.1, each component Y(i, j) of the DCT feature is a representation of the entire mouth region at a particular frequency. Thus, the DCT is a global feature representation method. Conversely, the LBP-TOP feature uses a descriptor to represent the local information in a small neighbourhood. Therefore, the LBP-TOP is a local feature representation. Also, as experimentally analysed in Section 3.4, these two types of features carry different kinds of information. Therefore, finding a way to embed both global and local information into a compact visual feature vector should achieve better speech accuracy compared to the individual application of these two widely used visual features. However, simply concatenating these two types of features is not practical for speech recognition, as the feature vector would be too large and make the system succumb to the curse of dimensionality. Hence, a feature dimension reduction process is essential to represent both local and global information in a single compact feature vector. The above analysis motivated the development of a cascade feature extraction framework that was able to combine both global and local information using a compact feature vector. Figure 3.3 provides an overview of the proposed system. In the first stage, the mouth image sequence is fed into the feature extraction block that consisted of two parallel procedures; that is, global appearance-based and local appearance-based feature extraction. The global appearance-based feature extraction uses DCT to preserve speech relevant information with a 192 dimensional feature vector (see Section 3.3.1). 46

67 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Figure 3.3: Cascade Hybrid Appearance Visual Feature (CHAVF) extraction. LDA is then used to reduce the dimensionality of the DCT feature vector. However, it was found that LDA usually fails to obtain a proper transformation for feature reduction when the raw 3540-dimensional LBP-TOP feature was applied on large-scale continuous speech recognition. Consider the continuous digit sequence recognition introduced in Section 3.4; in this experiment, HMMs were employed to model each of the 11 digits (two pronunciations for zero). Thus, the total target classification for LDA was 330 (i.e., 30 11). However, modelling LDA on 3540-dimensional features for 330 classes would require an extremely large amount of video data. The data corpus used in this work is one of the largest digit sequence corpus available; however, the amount of data (approximately 300,000 frames) was still insufficient for LDA modelling. Besides the conventional LDA, a kernel LDA [99] that nonlinearly maps the original LBP-TOP features to a feature space using the kernel trick could solve this insufficient data problem, a kernel LDA is computationally expensive and, in this case, would have been intractable, as the number of training examples would have been too large for a kernel LDA to process. Yu and Yang [125] proposed a direct LDA to overcome the training difficulty when the number of training samples is smaller than the feature dimension. In our work, we compared our proposed method with the direct LDA 47

68 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition based method, and report the results in Section In addition to LDA and its variants, in recent years, numerous graph-based feature reduction methods for LBP-TOP have been proposed [133, 79, 80, 4, 86, 107]; however, all of these works focus on the speech classification problem that represents a simpler data reduction problem than the one used for the continuous speech recognition task. Furthermore, the high computational requirements limit its applicability to this task. Fortunately, Gurban et al. [40] proposed less computationally demanding methods, called Mutual Information Feature Selectors (MIFS) for visual speech recognition. The MIFS for LBP-TOP feature reduction was chosen for two reasons. First, MIFS has a strong theoretical justification that comes from the Fanos inequality [28]. The Fano s inequality gives a lower bound probability of error in the system. Thus, a feature with high mutual information of classes is more helpful for classification. Second, the MIFS is computationally efficient because no complex training process is required. Thus, given the amount of data in the corpus and the task being performed, MIFS is an effective option for solving this problem, as it creates a good balance between the quality of features and the computational time. Among the different types of MIFSs, the simplest method for selecting feature components is the Maximum Mutual Information (MMI). Let x i M be the visual feature of the frame i (i = 1, 2,..., n) in the M-dimensional space, S M (in our system, M is the dimension of the raw LBP-TOP feature, i.e., 3540). Here n is the total number of frames over the entire collection of video sequences used for training. A feature subset S k (k M, in this work, k = 310) is selected using MMI: S k = S k 1 { arg max I(x j ; C)}, (3.2) x j (S M \S k 1 ) 48

69 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition where C is one of the 330 classes used in this work, and x j (S M \ S k 1 ) means that x j is in the feature space S M, but does not belong to the subset S k 1. The mutual information I(x j ; C) can be estimated as: I(x j ; C) = p(x, c) log x x j c C p(x, c) p(x)p(c), (3.3) where p(x, c) = p(x j = x, C = c) is the joint probability density function of x j and C. In our system, a histogram with 100 bins is used to compute the mutual information. Therefore, p(x) can be estimated using the total number of training samples and the total number of training samples that falls into the interval x x x. Obviously, MMI (Eq. 3.2) finds the k most informative feature dimensions from the original space in the information theoretic sense. However, it does not necessarily follow that this set of k feature components is the most informative feature set for target classification; rather, this feature set could have a rich redundancy. Peng et al. [87] proposed another type of MIFS called minimal-redundancy-maximal-relevance (mrmr) that takes the feature redundancy into consideration: S k = S k 1 { arg max [I(x j ; C) 1 x j (S M \S k 1 ) k 1 I(x j ; x l )]}, (3.4) x l S k 1 where I(x j ; x l ) is the mutual information between the feature components x j and x l, which can be calculated using Eq I(x j ; x l ) acts as a penalty term to approximate the feature redundancy between different feature components. I(x j ; x l ) = p(x 1, x 2 ) log p(x 1, x 2 ) x 1 x j x 2 x l p(x 1 )p(x 2 ), (3.5) 49

70 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Another MIFS which has been widely used is the Conditional Mutual Information (CMI) [115]. It uses the relevant redundancy between features when the class labels are given: S k = S k 1 { arg max [I(x j ; C) max I(x j ; x l C)]}, (3.6) x j (S M \S k 1 ) x l S k 1 where the penalty term I(x j ; x l C) takes the class into account, and is given by: I(x j ; x l C) = x 1 x j p(x 1, x 2, c) log p(c)p(x 1, x 2, c) x 2 x l p(x c C 1, c)p(x 2, c), (3.7) Comparing Eq. 3.4 with Eq. 3.6, the CMI appears to be more pertinent to the classification task than mrmr, as the penalty term of CMI would be relevant to the classification task. However, calculating the penalty term I(x j ; x l ; C) would require a much larger amount of data compared to the estimation of the penalty term I(x j ; x l ) of mrmr [40]. As explained in Section 3.4, it was found that mrmr performs much better on the LBP-TOP feature than MMI and CMI. Consequently, mrmr was chosen as the feature selector for the proposed CHAVF. After the extraction of the relatively compact DCT and LBP-TOP features, their concatenation remained very large for the classifier to process. Thus, LDA was again used to further reduce the dimensionality of the features. With the application of the proposed novel cascade feature extraction framework, information representing different characteristics (i.e., global and local) and modalities (i.e., texture and stereo) was successfully embedded into a compact feature vector. To the best of our knowledge, this is the first compact visual feature type that can represent speech relevant information from multiple information representation aspects. 50

71 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition 3.4 Performance Evaluation and Results Two data corpora were used in this work. The AusTalk (see Section 3.4.1) was used for speaker-independent visual speech recognition (Section 3.4.2) and speaker-independent audio-visual speech recognition (Section 3.4.4). The OuluVS (see Section 3.4.1) was used for the speaker-independent visual phrase classification experiments (Section 3.4.3) The AusTalk and OuluVS Corpora AusTalk Since this paper mainly focuses on how to employ stereo visual information for speech recognition, a suitable corpus that contains stereo data needs to be used. In terms of existing corpora which contain stereo data, the data from AV@CAR [81] only contains stereo facial expression information and does not have any 3D visual speech related content. Hence, it cannot be used for stereo visual speech recognition. Although the WAPUSK20 [118] and the AVOZES [35] contain 3D speech data, these data corpora are limited in either the number of speakers or speech content. Hence, we used a recently developed data corpus that addresses these limitations. The main data corpus used in this paper was collected by an Australia wide research project called AusTalk, funded by the Australian Research Council [12, 119, 13, 110]. This research project involved more than 30 scientists from 11 Australian universities and resulted in the creation of a large-scale data corpus that can be used in audio-visual speech recognition research. The AusTalk corpus consists of a large 3D audio-visual database of spoken Australian English recorded at 15 different locations 51

Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Figure 3.4: The recording environments and devices used to collect the AusTalk data.

72 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Figure 3.4: The recording environments and devices used to collect the AusTalk data. in each of Australias states and territories; the contemporary voices of 1,000 Australian English speakers of all ages were recorded to capture variations in accents, linguistic characteristics and speech patterns. To satisfy a variety of speech driven tasks, several types of data was recorded, including isolated words, digit sequences and sentences. To collect the AusTalk data, a Standard Speech Science Infrastructure Black Box (SS- SIBB) was designed [13]. The recording equipment includes head-worn and desktop microphones, digital audio acquisition devices and a stereo camera (see Figure 3.4). A Bumblebee stereo camera, mounted approximately 50cm from the speakers, was used to collect the 3D information of the speakers in addition to the texture (RGB) information. Although the accuracy of this stereo camera is not as high as some more expensive cameras used in other 3D driven tasks (e.g., those used by the 3dMD company [1] for precise surface imaging based applications), it is a low cost stereo camera which can be used in a much wider range of real-life applications. The details for the extraction of the audio and visual features can be found in [110] and the configuration parameters of the Bumblebee camera are listed in Table 3.1. Figure 3.5 represents a few video data samples. To generate the required planar-level and stereo-level information, face detection 52

Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Table 3.1: Bumblebee camera configuration used for building our system.

73 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Table 3.1: Bumblebee camera configuration used for building our system. Attribute Value Resolution Disparity Range [41, 137] Stereo Mask 11 Edge Correlation On Edge Mask 7 Sub-pixel Interpolation On Surface Validation On Surface Validation Size 400 Surface Validation Difference On Uniqueness Validation On Uniqueness Validation Threshold 1.44 Texture Validation On Texture Validation Threshold 0.4 Back Forth Validation On Figure 3.5: Sample RGB images and meshes from the AusTalk visual dataset. 53

74 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition is first performed, and the face ROIs are cropped from the original stream using the Haar features and Adaboost [117]. In order to align the face to the same position and to correct the head pose, the Iterative Closest Point (ICP) algorithm [9] is applied on the depth data. To reduce the computational load of the face alignment, only the upper faces are used, because the upper face can be considered as rigid and is therefore less affected by facial expressions [70]. The upper face of the first frame is used as the reference model, and the upper faces of the remaining frames are registered to the reference model. Then, a cubic interpolation is performed to fill in holes and reduce noise (e.g., spikes). Given the aligned point cloud faces, the mouth region can then be easily cropped by applying the Haar features and Adaboost. We use a square of 50mm 50mm centred at the mouth centre to crop the mouth. Since there is a oneto-one correspondence between the points in the face point cloud and the pixels in its corresponding texture image, the 2D mouth images (50 50) can be extracted in this step as well. To show the effectiveness of the proposed method in a large-scale, speaker independent, continuous speech recognition task, experiments were conducted using the digit sequence session of the AusTalk data corpus. In this session, 12 4-digit sequences were collected from each of the speakers (see Table 3.2). For digit 0, two pronunciations (i.e., zero and oh) were used to capture the different speech habits. This set of digit strings was carefully designed to ensure that each digit (i.e., 0-9) occurred at least once in each serial position. Take digit 1 for example, as listed in Table 3.2, it occurs in the 4th recording 123z (in the first position), the 1st recording z123 (in the second position), the 10th recording 3z12 (in the third position), and the 7th recording 23z1 (in the fourth position). The digits in the data corpus were read in a random manner without any unnatural pause to simulate PIN recognition and telephone dialing tasks. 54

75 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition This configuration differs to the popular audio-visual data corpus like CUAVE [85] that reads digits in a sequential manner with a relatively long pause between digits. The random digit sequences that were recorded in AusTalk made the recognition task more difficult; moreover, this configuration also ensured that the digit recording of the data corpus was more balanced. To capture any within-speaker variability over time, each participant speaker was encouraged to attend two separate recording sessions. There was at least a one-month gap between the two recording sessions; however, the speech contents recorded in these two sessions were identical. As not all speakers attended a second session, some recordings only contain data from the first session. In this study, data recordings from 162 speakers (around 1,900 utterances) were used. Of these 162 speakers, 148 speakers attended one recording session and 14 speakers attended both recording sessions. However, a small number of recordings were not used for speech recognition due to several technical issues. For example, the speakers involuntarily moved their head and body during the recording, so that the stereo camera failed to extract useful depth information. Given the large amount of data that we used in this research, it is not feasible to manually crop the mouth region from the image. Hence, an Adaboost based face and mouth detection algorithm was used, and in some cases the face and mouth detection failed to detect the mouth from the videos. Like some other works that have had similar issues [129, 133, 79, 80, 131], these recordings are removed from the data corpus, and 1861 (out of 1992) recordings were used in our experiments. In the proposed work, the Hidden Markov Toolkit (HTK) [123] was used to implement HMMs for digit sequence recognition. In this experiment, the digit recognition task was treated as a connected word speech recognition problem with a simple syntax (i.e., any combination of digits and silence was allowed in any order). With respect to 55

76 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Table 3.2: Digit sequences in the AusTalk data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ). No. Digits No. Digits No. Digits 01 z o z o z o z the HMM model, 11 word models were used with 30 states to model 11 digit pronunciations, a 5-state HMM was used to model the silence of the beginning and the end of the recording, and a 3-state HMM was used to model the short pause between the digit utterances. Each HMM state was modelled by nine Gaussian Mixtures with diagonal covariance. In relation to the experimental setup, the 162-speaker digit data recordings were divided into 10 groups; the speakers in the different groups did not overlap. A 10-fold cross validation was then employed to increase the statistical significance of the results. The average speech accuracy of 10 runs was then reported. OuluVS In recent years, the majority of high-quality work in this area has focused on speech classification [132]. OuluVS [129] was used to compare the proposed CHAVF with other state-of-the-art systems. OuluVS, a widely used data corpus that performs speech classification, is a visual-only data corpus comprising of 10 English phrases (see Table 3.3) uttered by 17 male speakers and 3 female speakers. The data in this corpus was collected using a SONY DSR-200AP 3CCD-camera with a frame rate of 25 fps. Each phrase was repeated nine times by each speaker. As in Zhao et al. work [129], 817 sequences from 20 speakers were used in our experiments. The second degree polynomial kernel Support Vector Machine (SVM) was used as the classifier. This is the same 56

77 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition classifier as the one used in [129, 133]. In terms of the verification process, a leave-onespeaker-out approach was adopted, such that the recordings of 19 speakers was used for the training dataset and the left out speaker was used to test the data for each of the 20 runs. Table 3.3: The 10 phrases in OuluVS data corpus. No. Phrase No. Phrase 01 Excuse me 02 Goodbye 03 Hello 04 How are you 05 Nice to meet you 06 See you 07 I am sorry 08 Thank you 09 Have a good time 10 You are welcome Speaker-Independent Visual Speech Recognition To examine the amount of relevant information for planar and stereo visual features, the mutual information (see Eq. 3.3) of the different visual components are plotted in Figure 3.6. In relation to the DCT features, Figures 3.6a and 3.6b show that the amount of information carried by planar and stereo is quite different. This is an interesting observation firstly revealed by this work, and explains why the integration of planar and stereo features by our proposed method is capable of boosting speech recognition accuracy. Furthermore, these two kinds of visual features also share a common characteristic. The DCT static coefficients had more discriminative information about the classes (i.e., the states of digit HMM models) than the dynamic coefficients. Previous studies found similar results for the hvd words recognition from a linear discriminative perspective [112]. The hvd words are a set of words starting with the letter h, a vowel in the middle and ending with the letter d and they are important for acoustic-phonetic analysis. Both works show that the DCT static features were more 57

78 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition discriminative than the dynamic ones. In this study, it was found that the mutual information of feature components that contribute to both visual and audio-visual speech recognition is usually larger than 0.3. Figure 3.6a and Figure 3.6b show that the number of stereo DCT components that have considerably large mutual information I (I 0.3) was much smaller than the planar counterpart (31 vs. 10). The difference between the planar and the stereo features does not necessarily mean that the planar features were more informative than the stereo features, as high information redundancy may exist in the planar features. For the LBP-TOP features, the mutual information of each mouth subregion (see Figure 3.2) has been listed in Figures 3.6c and 3.6d. From these figures, it can be observed that the planar LBP-TOP components of the lower mouth regions (i.e., region slowromancapvi@ to region slowromancapx@ in Figure 3.2) were more informative than the components of the upper mouth regions (i.e., region slowromancapi@ to region slowromancapv@ in Figure 3.2). This is consistent with the observation that in human speech production most movements for talking occur in the lower lip and the jaw. An experiment on LBP-TOP features was also carried out (see Table 3.4 for the results). As displayed in Table 3.4, the accuracy was about 10% higher after the mouth region was divided into 2 5 blocks. Thus, the mouth region subdivision scheme introduced in Section boosted speech accuracy. Interestingly, unlike the DCT feature introduced above, for the LBP-TOP features, the temporal features (i.e., the features extracted from the XT and YT planes) were generally more informative than the spatial features (i.e., the features extracted from the XY plane). These complementary behaviours of DCT and LBP-TOP features explain why our proposed method is effective, as our method is able to automatically obtain more static speech-relevant information from the DCT, while incorporating more dynamic information from the LBP-TOP. 58

79 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Gray level DCT Gray level DCT Static Features First Temporal Derivatives of Gray level DCT Second Temporal Derivatives of Gray level DCT Mutual Information Feature Dimension (a) Depth level DCT Depth level DCT Static Features First Temporal Derivatives of Depth level DCT Second Temporal Derivatives of Depth level DCT Mutual Information Feature Dimension (b) Gray level LBP TOP Gray level LBP TOP XY Plane Gray level LBP TOP XT Plane Gray level LBP TOP YT Plane Mutual Information Region I Region II Region III Region IV Region V Region VI Region VII Region VIII Region IX Region X Feature Dimension (c) Depth level LBP TOP Depth level LBP TOP XY Plane Depth level LBP TOP XT Plane Depth level LBP TOP YT Plane Mutual Information Region I Region II Region III Region IV Region V Region VI Region VII Region VIII Region IX Region X Feature Dimension (d) Figure 3.6: The comparison of the amount of relevant information (I(x j ; C)) embedded in different types of feature dimensions. 59

80 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Accuracy (%) Gray Level DCT+MMI 20 Gray Level DCT+mRMR Gray Level DCT+CMI Gray Level DCT+LDA Dimension 70 (a) Accuracy (%) Depth Level DCT+MMI 20 Depth Level DCT+mRMR Depth Level DCT+CMI Depth Level DCT+LDA Dimension 70 (b) Accuracy (%) Hybrid Level DCT+MMI 20 Hybrid Level DCT+mRMR Hybrid Level DCT+CMI Hybrid Level DCT+LDA Dimension (c) Accuracy (%) Gray Level LBP+MMI Gray Level LBP+mRMR Gray Level LBP+CMI Dimension (d) Accuracy (%) Accuracy (%) Depth Level LBP+MMI Depth Level LBP+mRMR Depth Level LBP+CMI Dimension (e) 20 Hybrid Level LBP+MMI Hybrid Level LBP+mRMR Hybrid Level LBP+CMI Dimension (f) Figure 3.7: The performance of visual-only speech recognition using various feature types and feature reduction techniques: planar (gray-level) DCT (Figure 3.7a), stereo (depth-level) DCT (Figure 3.7b), hybrid-level DCT (Figure 3.7c), planar (gray-level) LBP-TOP (Figure 3.7d), stereo (depth-level) LBP-TOP (Figure 3.7e), hybrid-level LBP- TOP (Figure 3.7c). 60

81 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Table 3.4: Planar LBP-TOP feature. The superscript 1 1 which represents the entire lip region are fed into the feature extraction procedure without subdivision, while the superscript 2 5 which represents the mouth region is divided into 10 subregions as shown in Figure 3.2. The subscript represents the radii of the spatial and temporal axes (i.e., X, Y and T ) and the number of neighbouring points in these three orthogonal planes are 8 and 3, respectively. Feature Type Feature Extraction Visual Speech Accuracy Planar LBP TOP8, % LBP TOP8, % Stereo 35.31% LBP TOP8, % LBP TOP Figure 3.7 shows the visual-only speech recognition results using the hybrid-level, planar (grey-level) and stereo (depth-level) DCT and LBP-TOP features with LDA and the MIFS selection methods (i.e., MMI, mrmr and CMI). For the DCT visual features, the use of LDA achieved the highest accuracy for planar, stereo and hybrid-level features (i.e., 54.66%, 55.19%, and 64.93%, respectively). Also, it is interesting to note that with the application of LDA, the accuracy achieved by the stereo DCT feature was almost the same as that achieved by the planar DCT feature (i.e., 55.19% versus 54.66%). This indicates that while the stereo images were quite noisy and the stereo lip regions were barely visible to human eyes (see Figure 3.5), the stereo DCT feature was still capable of representing relevant geometrical information. One reason for the promising accuracy yielded by the low quality stereo image sequences is that the stereo DCT is a global feature representation method that is insensitive to image noise. In relation to the hybrid DCT feature that combines both planar and stereo global information, the visual speech accuracy was approximately 10% higher than either of the corresponding planar and stereo features. In relation to the LBP-TOP visual feature, with the application of mrmr, the

82 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition dimensional planar, 310-dimensional stereo and 310-dimensional hybrid-level features yielded 53.06%, 45.28% and 52.04%, respectively accuracy (see Figure 3.7). Note that the stereo LBP-TOP feature did not perform as well as the planar LBP-TOP feature. As illustrated in Figure 3.5, the quality of the stereo images was much worse than the quality of the RGB images. As the LBP-TOP is a local information representation method, it is sensitive to the noise of the image. Thus, the performance of the planar LBP-TOP feature was better than that of its stereo counterpart. As can be seen in Figure 3.7, the different speech recognition performance demonstrated by planar and stereo features explains the superiority of our proposed method, i.e., to boost speech recognition accuracy, our method can automatically encode more planar information using a global extraction method, while encoding less stereo information from a local perspective. As explained in Section 3.3.3, after MIFS was applied to select the most informative components from the raw LBP-TOP feature, LDA was used to further reduce the feature dimensionality. In relation to MIFS, mrmr achieved the highest visual speech accuracy for the hybrid-level LBP-TOP with a feature dimension of 310 as evident in Figure 3.7. Thus, a 310-dimensional hybrid LBP-TOP feature vector was used that was selected by mrmr for further dimensionality reduction. Figure 8 shows the visual-only speech recognition performance of the mrmr selected LBP-TOP features followed by LDA for the further feature dimension reduction. As shown in Table 3.5, the accuracy of the planar, stereo and hybrid-level LBP-TOP features with mrmr were 53.06%, 45.28% and 53.54%, respectively. Using LDA for further feature reduction, the visual speech accuracy for the planar, stereo and hybrid-level features were 57.23%, 52.17%, 59.11%, respectively. These results show that the novel cascade feature reduction framework proposed in this paper is very effective for visual speech recognition. As summarised in Table 3.5, for the planar features, both the DCT and the LBP-TOP 62

83 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Table 3.5: Visual speech recognition performance comparison between popular appearance features and our proposed method. Modality Feature Dimension Accuracy DCT + MMI [40] % DCT + mrmr [40] % DCT + CMI [40] % Planar DCT + LDA % LBP-TOP [129] + MMI [40] % LBP-TOP [129] + mrmr [40] % LBP-TOP [129] + CMI [40] % LBP-TOP + mrmr + LDA % DCT + MMI [40] % DCT + mrmr [129] % DCT + CMI [40] % Stereo DCT + LDA % LBP-TOP [129] + MMI [40] % LBP-TOP [129] + mrmr [40] % LBP-TOP [129] + CMI [40] % LBP-TOP + mrmr + LDA % DCT + MMI [40] % DCT + mrmr [40] % DCT + CMI [40] % DCT + LDA % Hybrid LBP-TOP [129] + CCA [48] % LBP-TOP [129] + MMI [40] % LBP-TOP [129] + mrmr [40] % LBP-TOP [129] + CMI [40] % LBP-TOP [129] + direct LDA[125] % LBP-TOP + mrmr + LDA % CHAVF using CCA % CHAVF % 63

84 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Accuracy (%) 50 Gray Level LBP TOP+mRMR+LDA Depth Level LBP TOP+mRMR+LDA Hybrid Level LBP TOP+mRMR+LDA Dimension Figure 3.8: The visual-only speech recognition performance of the mrmr selected LBP-TOP features followed by LDA for further feature dimension reduction. feature achieved promising accuracy and the LBP-TOP feature (at 57.24%) performed better than its DCT counterpart (at 54.66%). Zhao et al. [129] also found that the planar LBP-TOP feature was superior to the DCT feature. In relation to the stereo visual features, unlike their planar counterparts, the DCT feature outperformed the LBP-TOP feature (55.19% versus 52.17%), as the global information representation (i.e., DCT) was more suitable to extracting information from noisy stereo images. Obviously, the planar, the stereo DCT and LBP-TOP visual features contained considerable speech related information. Thus, combining these information sources to form a more discriminate visual feature should have boosted speech accuracy. It is clear that for both DCT and LBP-TOP features the integration of the planar and the stereo information yielded better speech accuracy as compared to any of the single modalities (see Table 3.5). Specifically, the accuracy obtained by the hybrid DCT feature was 64.93% (i.e., 10 % higher than the standard planar DCT feature that was 54.66%). 64

85 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition The stereo LBP-TOP feature did not perform as well as its planar counterpart; however, the integration of the stereo local information with the planar local information still yielded a better accuracy (i.e., 59.11%) and was approximately 2% higher than the standard planar LBP-TOP feature (i.e., 57.24%). This confirms one significant aspect of this study; that is, even low quality stereo visual data can be used for speech recognition. Furthermore, visual speech accuracy significantly increases by integrating both stereo and planar visual features. As introduced in Section 3.3.3, the conventional LDA cannot be used for the LBP- TOP feature because of the high dimensionality of LBP-TOP. Hence, in our experiments we compared our method with an LDA variant, i.e., the direct LDA. Our proposed cascade feature reduction scheme showed superiority over the direct LDA method (59.11% vs 52.32%). Since the direct LDA is a special case of LDA, it only works well in applications with well-separated classes [32], and therefore did not outperform our proposed method. In relation to the proposed hybrid visual feature extraction scheme, it not only combined two separate information resources (i.e., the planar modality and the stereo modality), it also took into account two complementary information representation methods (i.e., local and global information). Thus, the proposed feature extraction framework was able to produce an even higher level of speech accuracy. The visual speech accuracy of the proposed feature (i.e., CHAVF) was 69.18% and outperformed all the listed visual features (see Table 3.5,). Compared with the DCT+LDA, the proposed CHAVF yielded an overall improvement of approximately 4.25%. The major contribution of our work is a feature extraction scheme that can represent visual speech information from different views. We also compared our results with Canonical Correlation Analysis (CCA), which is one of the most popular tech- 65

86 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition niques used for multi-view feature learning [113]. The same with the LDA training, introduced in Section 3.3.3, 330 HMM states for 11 digits were used as targets in the classification for CCA. After employing CCA, two 20-dimensional feature vectors were produced for each of the DCT and LBP-TOP features. Concatenating these two feature vectors results in a 40-dimensional feature, which constitutes the hybrid CCA based feature. The first experiment we performed used CCA on planar and stereo LBP-TOP features to learn a hybrid LBP-TOP feature. Experimental results showed that the hybrid LBP-TOP feature with our proposed cascade feature dimension reduction scheme yielded a significantly better result compared with the CCA learning scheme (59.11% vs 41.53%). The second experiment used CCA to learn a combined local-global hybrid feature. We replaced the LDA at the second stage of our proposed scheme (as shown in Fig. 3.3) with CCA, and our proposed method with LDA outperformed the CCA variants (69.18% vs 64.34%). To gain insight into why the proposed feature outperformed other visual features, a recently introduced visualisation method called t-sne [114] was used to produce 2D embeddings of visual features. Data points close in the high dimensional feature space are also close in the 2D space produced by t-sne. Figure 3.9 shows the 2D mapping of the proposed method and several features that achieved the best visual speech accuracy in their corresponding categories. The data points in Figure 3.9 represent video frames and different colours correspond to different classes (i.e., the different states of the HMM models). For clarity, the fifth state of each digit HMM model for visualisation was randomly chosen. Figure 3.9 shows the integration of planar and stereo features were more visually distinctive compared to conventional planar visual features (see Figures 3.9e and 3.9f) that exhibited more dispersion (see Figures 3.9a to 3.9d). Thus, from the data visualisation perspective, Figure 3.9 highlights a key finding 66

87 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition of this study; that is, the quality of a visual feature can be improved by the integration of both stereo and planar visual features. In addition to comparing our proposed method with other feature-level fusion methods, we also compared our method with classifier-level fusion methods. In this work, the Multi-Stream HMM (MSHMM) was used to fuse the classification results from the DCT and LBP-TOP features. The HMM training for DCT and LBP-TOP features was conducted separately. In the test step, the emission likelihood was computed as: log b j (o f use t ) = λ dct log b j (ot dct )+λ lbp log b j (o lbp t ), (3.8) where b j (o f use t ), b j (o dct t ) and b j (o lbp t ) are the joint emission probability, the DCT stream emission probability and the LBP-TOP stream emission probability, respectively. The λ dct and λ lbp are the weights of the DCT and the LBP-TOP streams, and λ dct + λ lbp = 1. The transition probability of the MSHMM was estimated by the weighted sum of the transitions for each stream. In our experiments, the weight of each stream was carefully adjusted to ensure the best performance can be achieved. In our experiment, we used the hybrid DCT+LDA (64.93%) and the hybrid LBP- TOP+mRMR (54.28%) as the two streams of the MSHMM, because they achieved the best results on the VSR task using a single-stream HMM. After feeding these into multistream HMM using suitably adjusted weights, this yielded an accuracy of 65.24%. Next, we employed the same multi-stream HMM for DCT + LDA and LBP-TOP with our proposed cascade feature extraction scheme, and achieved a very promising result (66.49%). This indicates that our proposed scheme cannot only be used for featurelevel VSR, but is also effective for classifier-level VSR. Despite the impressive accuracy from the classifier-level fusion, our proposed CHAVF achieved a better result (69.18% 67

88 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition (a) (c) (e) (d) (b) (f) Figure 3.9: 2D t-sne visualisation of different visual features with various feature reduction techniques. Figure 3.9a: Planar DCT+LDA; Figure 3.9b: Planar LBP+mRMR+LDA; Figure 3.9c: Stereo DCT+LDA; Figure 3.9d: Stereo LBP+mRMR+LDA; Figure 3.9e: Hybrid-level DCT+LDA; Figure 3.9f: Our proposed CHAVF. 68

89 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition vs 66.49%). Table 3.6: Comparison between our proposed method and the classifier-level fusion methods Feature Weight Accuracy DCT + LDA (64.93%) 0.9 LBPTOP + mrmr (54.28%) % DCT + LDA (64.93%) 0.8 LBPTOP + mrmr + LDA (59.11%) % Proposed CHAVF 69.18% Speaker-Independent Visual Phrase Classification To compare the proposed CHAVF with the state-of-the-art systems, a visual phrase classification task was performed using an SVM classifier on the popular OuluVS data corpus. In the previous continuous visual speech recognition task (see Section 3.4.2), both DCT and LBP features were extracted and fed into an HMM recogniser frame by frame. However, in this speech classification task, the visual features were extracted from each frame of each video, and average pooling, i.e., the mean vector of all features, was used to ensure that the visual feature has a fixed length. Table 3.7 lists the classification results on the OuluVS dataset comparing the proposed method to some of the state-of-the-art systems. It shows that the proposed CHAVF was able to outperform the LBP-TOP feature [129] and sequential pattern boosting [79]. The latent variable models [131] achieved a better accuracy than the proposed method; however, the training process was more computationally complex than that of the proposed method. Furthermore, this method could not be used on the continuous visual speech recognition task. The proposed method also had an accuracy level similar to that of the transported square-root vector field [107]. However, 69

90 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition that method was tested using a speaker-dependent condition [107] and the proposed method was tested using the more difficult speaker-independent condition. Table 3.7: Visual speech classification comparison on the OuluVS data corpus. The results is reported in terms of speaker-dependent speech classification. Method Results LBP-TOP (TMM2009 [129]) 62.4% Sequential Pattern Boosting (BMVC 2011 [79]) 65.6% Transported Square-Root Vector Field (CVPR 2014 [107]) 70.6% Latent Variable Models (PAMI2014 [131]) 76.6% Our proposed-chavf 68.9% Speaker-Independent Audio-Visual Speech Recognition To show that the proposed visual features could boost audio-only speech recognition, audio-visual speech recognition experiments were performed under varying noise levels. The MSHMM was used to model the audio and visual signals. The training of the MSHMM was conducted separately for the audio and visual stream under clean acoustic conditions. The test conditions were conducted under various SNR conditions representing the degradation of the audio stream. In this study, different levels of additive white noise were used to demonstrate the robustness to audio degradation of the audio-visual speech recognition system. The proposed CHAVF and the most commonly used DCT+LDA planar (grey-level) features were used for the experiment and the corresponding audio-visual speech recognition results are listed in Figure It should be noted that the main aim of this paper was to prove the superiority of the hybrid visual features. The automatic selection of the weights for the audio and visual streams according to different noise levels was beyond the scope of this paper. 70

91 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition Thus, the audio and visual weights (listed in Table 3.8) were empirically chosen to ensure that the best audio-visual fusion results were achieved. Table 3.8: The Audio Weight (AW) and the Video Weight (VW) of the MSHMM. SNR AW VW SNR AW VW clean dB dB dB dB dB In an acoustically clean environment (i.e., 30dB), the audio-visual fusion results were equivalent to those obtained by audio features only. However, assigning a small weight (i.e., ) to the visual stream audio-visual fusion results (i.e., 96.87%) still led to slightly better results than that achieved by the audio-only results (i.e., 96.52%). With an increase in the noise level, the recognition performance using audio-only features degraded significantly from 96.52% (i.e., 30dB) to 17.08% (i.e., -5dB). Conversely, the audio-visual fusion recognition performance experienced a relatively small decrease due to the utilisation of the video signals. Furthermore, it should be noted that under very noisy environments the audio-visual speech accuracy was only slightly above visual-only accuracy, as the visual modality then took on the dominant role in the recognition process. Encouragingly, this can achieve an improvement of more than 10% with audiovisual recognition over any individual modality using the correct choice of audio and visual weights; for example, with the proposed CHAVF features, the audio-only and visual-only speech accuracy was 67.86% and 69.18%, respectively. However, the combined audio-visual system yielded an impressive accuracy (i.e., 80.49%) under the 10dB SNR condition. Thus, confirming that the complementary information provided by the visual features could be used to improve the overall recognition performance in mod- 71

92 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition erate noise conditions (i.e., between 20dB and 0dB SNR) Accuracy(a%) Audio Only 30 Visual Only:Gray Level DCT+LDA Visual Only:Our Proposed CHAVF 20 Audio Visual:Gray Level DCT+LDA Audio Visual:Our Proposed CHAVF 10 clean 30dB 20dB 10dB 00dB 5dB Figure 3.10: Multistream HMM audio-visual digit classification results with various white noise SNR levels for different types of visual features. 3.5 Conclusion This study investigated the integration of stereo visual features with traditional planar features to boost audio-visual speech accuracy. Notably, it was shown that the proposed novel feature extraction scheme that successfully combined planar and stereo visual information outperformed the state-of-the-art appearance features. We also showed that, even with the application of low-quality stereo visual features, integrating both stereo and planar visual features led to a significant increase in visual speech accuracy. This study also showed the different characteristics of planar and stereo features using information theoretic techniques and explained how these characteristics could 72

93 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition benefit speech recognition. Furthermore, an analysis of the different characteristics of the planar and stereo features revealed the reasons why the stereo visual features significantly boosted the visual and audio-visual speech recognition results. After the fusion of the audio and visual signals, the experimental results showed that the proposed visual features markedly improve the audio-visual speech recognition performance in the presence of additive white noise interference. It appears that this is the first paper to comprehensively analyse the benefits of using the hybrid (i.e., a combination of planar and stereo) visual features for the audiovisual speech recognition task on a newly collected large-scale 3D audio-visual corpus. Thus, this study provides a new perspective that effectively solves the low visual speech accuracy problem in the area of visual and audio-visual speech recognition. 73

94 Chapter 3. A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition 74

95 Chapter 4 Visual Feature Learning Using Deep Bottleneck Networks 1 Abstract Motivated by the recent progresses in the use of deep learning techniques for acoustic speech recognition, we present in this paper a visual deep bottleneck feature (DBNF) learning scheme using a stacked auto-encoder combined with other techniques. Experimental results show that our proposed deep feature learning scheme yields approximately 24% relative improvement for visual speech accuracy. To the best of our knowledge, this is the first study which uses deep bottleneck feature on visual speech recognition. Our work firstly shows that the deep bottleneck visual feature is able to achieve a significant accuracy improvement on visual speech recognition. Keywords: Visual speech recognition, stacked denoising auto-encoder, deep bottle- 1 Published in Proceedings of 40th International Conference on Acoustics, Speech and Signal Processing,

96 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks neck feature. 4.1 Introduction Although audio-visual speech recognition has achieved significant improvements over audio-only speech recognition on both clean and noisy environments [90, 91, 132], how to encode speech related information in visual features is still a largely undeveloped area. Given the encouraging performance of deep learning techniques in acoustic speech recognition [24], in this paper, we propose a deep visual feature learning scheme that can replace existing hand-crafted visual features and boost visual speech accuracy. Deep learning techniques were first proposed by Hinton et al. [47], who used the greedy, unsupervised, layer-wise pre-training scheme to solve the training difficulty of multiple hidden layer neural networks. Hinton et al. used the restricted Boltzmann machine (RBM) to model each layer of a deep belief network (DBN). Later works showed that a similar pre-training scheme can also be used by stacked auto-encoders [7] and convolutional neural networks (CNN) to build the deep neural network [93]. Although the speech recognition community has witnessed some great successes in the utilisation of deep learning techniques, the progress of visual speech recognition (VSR) based on deep learning is still limited. Ngiam et al. [73] first explored the possibility of applying deep networks on VSR. In their work, however, the deep autoencoder features were used to train a support vector machine, which did not take the dynamic characteristics of speech into account. Consequently, their proposed feature learning scheme was not able to be used on practical speech recognition tasks. Huang et al. [49] trained a DBN to predict posterior probability of HMM states given the ob- 76

97 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks servations, which was further used for continuous speech recognition. However, the performance of their proposed visual feature learned by deep learning techniques did not show any improvements over the HMM/GMM model. Although the hand-crafted visual features still play a dominant role in VSR [132], deep learning techniques offer potential opportunities for replacing these hand-crafted features which will boost speech recognition accuracy. In this paper, we propose an augmented deep bottleneck feature (DBNF) extraction method for visual speech recognition. Although the DBNF was extensively evaluated for the acoustic speech recognition in recent years [124, 96, 33, 53], to the best of our knowledge, this method has never been explored in visual speech recognition. In this work, a DBNF is first learned by a stacked auto-encoder and fine-tuned by a feedforward neural network. Then, this DBNF is concatenated with the DCT feature vector, and the dimension of this concatenated feature vector is further reduced using LDA. Experimental results show that our proposed deep feature learning scheme is able to boost speech accuracy significantly. The rest of this paper is organised as follows: Section 4.2 describes the proposed model for visual feature learning. The system performance is evaluated in Section 4.3. Finally, the paper is concluded in Section Deep Bottleneck Features The proposed deep bottleneck visual feature extraction architecture is illustrated in Fig.4.1. The training process consists of three stages. The first stage is a stacked autoencoder which is pre-trained by the video data in a layer-wise, unsupervised manner. Then, this network is further fine-tuned by adding a hidden layer and a classification 77

98 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks layer to predict the class labels (i.e., the states of the HMMs). Finally, the deep bottleneck feature vector is concatenated with discriminant cosine transform (DCT) feature vector, followed by a linear discriminant analysis (LDA) to decorrelate the feature and reduce the feature dimension to 20. Figure 4.1: Proposed augmented deep multimodal bottleneck visual feature extraction scheme. In terms of feature extraction, two features are extracted: DCT [40] and LBP-TOP [129]. For the DCT feature, 32 low-frequency DCT coefficients are selected in a zig-zag left to right scanning pattern, along with the 32 first and 32 second temporal derivatives to capture the dynamic information of utterances. For the LBP-TOP feature, we use a mouth region subdivision scheme introduced in [129] to extract LBP-TOP features. To be more specific, the mouth region is divided into 2 5 subregions and the 177 dimensional LBP-TOP feature vector is extracted from each of these 10 subregions to form a 1770 dimensional LBP-TOP feature vector. Since each DCT feature element is a representation of the entire mouth region at a particular frequency, DCT is considered as a global feature representation. On the other hand, the LBP-TOP extracts local information within a small neighbourhood from both the spatial and the temporal domains. Hence, LBP-TOP is a local information represen- 78

99 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks tation [129]. Given the different characteristics of two appearance-based visual features in the sense of information representation, combining these two complementary information sources should be able to boost visual speech accuracy. However, compared with the 96 dimensional DCT feature, the 1770 dimensional LBP-TOP feature is not compact enough for the HMMs to perform classification. In this paper, we propose a deep feature learning based method to generate an augmented feature which embeds both global and local information into one single feature vector. The first stage of the stacked auto-encoder is a deep neural network consisting of multiple auto-encoders in which the output of each auto-encoder is wired to the input of the successive auto-encoder. For each individual layer of the stacked auto-encoder, we use a denoising auto-encoder [116] to capture the structure of the video data. The input x is firstly corrupted by using x q D ( x x) to yield a corrupted input x, where q D is a stochastic process which randomly sets a fraction of elements of the clean input to zero. With the corrupted input x, the latent representation y is constructed through the encoder using the weights W and the bias b of the hidden layer and non-linear activation function σ y : y = σ y (W x+b). (4.1) For the decoding process, the reconstruction of the input z is obtained by using Equation 4.1 with the transposed weight matrix W T as the new weight and the bias of the visible layer c. The training of the denoising auto-encoder is carried out using the back-propagation algorithm to minimize the loss function L(x, z) between the clean input x and the reconstruction z. For the first layer of the stacked auto-encoder, it models the LBP-TOP 79

100 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks feature, and the mean square error is used for the loss function: L(x, z) = n i (x i z i ) 2, (4.2) where i = {1, 2,.., n}, and n is the number of input samples. Since the following layers of the stacked auto-encoder model the probabilities of the hidden units of the corresponding previous layers, the cross-entropy error is used as a loss function: L(x, z) = n i [x i log z i +(1 x i ) log(1 z i )]. (4.3) The stacked auto-encoder is trained in a greedy layer-wise manner. To be specific, the first layer is first trained to minimize the error L(x, z) between the dimensional LBP features and the reconstruction of the corrupted input z. Then, the corrupted activations of the first hidden units are used as the input to train the second layer. This process is repeated until the subsequent layers are pre-trained. After the unsupervised pre-trained stage, we employ the network fine-tuning strategy proposed in [33]. More specifically, a feed-forward neural network is constructed by adding a hidden and a classification layer. In this network, the initial weights of the auto-encoder layers are obtained from the pre-training stage, and two newly added layers are initialised using random weights sampled uniformly. For the classification layer, it uses a softmax function to predict the class (i.e., the state of the HMM), and this feed-forward neural network is trained using the back-propagation algorithm. At the third stage of the training process, the bottleneck feature is then concatenated with the DCT features to form an augmented feature (DBNF+DCT). LDA is used to decorrelate the DBNF+DCT feature vector and to further reduce the feature dimension 80

101 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks to 20. Finally, this augmented feature is fed into an HMM recogniser. 4.3 Experiments Data Corpus The data corpus used in our paper was collected through an Australia wide research project called AusTalk [12, 119, 13]. It is a large 3D audio-visual database of spoken Australian English, including isolated words, digit sequences, and sentences, recorded at 15 different locations in all states and territories of Australia. In the proposed work, only the digit sequence data subset is used. This set of 12 four-digit strings, which are chosen randomly to simulate the PIN recognition and telephone dialling tasks (see Table 4.1), is carefully designed to ensure that each digit (0-9) occurs at least once in each serial position. Table 4.1: Digit sequences in the Big ASC data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ). No. Content No. Content No. Content 01 z o z o z o z Experimental Setup With the use of the method detailed in [110], the videos which capture the speakers lip movements can be obtained. Then, the corresponding visual features can be extracted. In our experiments, we partitioned the 125 speakers into 10 non-overlapping subsets, 81

102 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks and a 10-fold cross validation was employed. For each fold, 8 subsets of data are used for training and 2 subsets are used for testing. We run our experiments in a speakerindependent scenario; therefore the speakers in the training and test subsets do not overlap. In order to pre-train the stacked auto-encoder, a mini-batch gradient descent with a batch size of 64 and a learning rate of 0.01 is used. A random 20% of the input elements are corrupted to zero by applying masking noise. Each layer of the stacked auto-encoder has 1024 hidden neurons, and the training of each layer is performed in 50 epochs. After the pre-training of the stacked auto-encoder, another 1024-unit hidden layer and a classification layer are added. The whole network is then fine-tuned using a mini-batch gradient descent with a batch size of 256 and a learning rate of Both pre-training and fine-tuning processes are carried out on GPUs and implemented by the Theano toolkit [8]. With respect to the HMM model, we use 11 word models with 30 states to model 11 digit pronunciations. Each HMM state is modelled by 9-mixture GMMs with diagonal covariance. In our experiment, the digit recognition task is treated as a connected word speech recognition problem with a simple syntax, i.e., any combination of digits and silence is allowed in any order. The HMM is implemented by the HTK toolkit [123] Stacked Auto-Encoder Architecture In order to confirm whether the deep feature learning architecture is necessary and can learn a better information representation than the shallow feature learning techniques, we evaluate the features that are learned by different stacked auto-encoders with var- 82

103 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks ious numbers of hidden layers. Meanwhile, in order to confirm that the pre-training process can benefit the visual feature learning, a stacked auto-encoder without unsupervised pre-training is also evaluated. Table 4.2: Evaluation on various stacked denoising auto-encoder architectures. Auto-Encoder Layers Pre-training Accuracy 1 (200) Yes 43.2% 2 ( ) Yes 55.7% 3 ( ) Yes 57.3% 3 ( ) No 49.9% 4 ( ) Yes 57.1% Table 4.2 reports the visual speech accuracy using the features learned by various stacked auto-encoder architectures. From this table, one can observe that with an increase in the number of hidden layers, a better feature representation can be obtained. Meanwhile, one can also note that the use of pre-training results in a better accuracy. However, the table also shows that, with 3 hidden layers, increasing the hidden layers is not able to further boost speech accuracy. Similar results were also found in the acoustic speech recognition tasks [33]. Specifically, with a sufficiently large number of auto-encoder layers, increasing the number of layers cannot further boost the speech recognition accuracy. A possible explanation is that when the auto-encoder is large enough, adding new layers cannot increase the representative ability of the network. Moreover, adding new layers requires a larger amount of data to ensure the auto-encoder is sufficiently trained Performance of the Augmented Bottleneck Feature Unlike the standard bottleneck feature learning process, the learned bottleneck feature is concatenated with the DCT feature, and LDA is further used to decorrelate the fea- 83

104 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks Table 4.3: Visual speech recognition performance comparison between our proposed DBNF and other methods. Feature Reduction Dimension Accuracy DCT MMI % DCT mrmr % DCT CMI % DCT LDA % LBP-TOP MMI % LBP-TOP mrmr % LBP-TOP CMI % DBNF None % DBNF LDA % Augmented DBNF LDA % ture vector and to reduce the feature vector dimensionality. Hence, the superiority of this feature extraction scheme needs to be evaluated. As illustrated in Table 4.3, with the use of LDA, the accuracy of the DBNF increases from 57.3% (200 dimensions) to 63.3% (40 dimensions), which shows that LDA is able to decorrelate the feature learned by the stacked auto-encoder and reduce the feature dimension. Meanwhile, our proposed method yields an accuracy of 67.8 % by concatenating the DCT feature with the DBNF. It shows that our proposed augmented DBNF is able to produce an even higher accuracy because it embeds both local and global information into one single feature vector. In order to demonstrate the superiority of our proposed augmented DBNF, we list some other popular appearance-based visual features in Table 4.3. Particularly, we compare our proposed DBNF and augmented DBNF with two features (DCT [40] and LBP-TOP [129]) and two feature reduction techniques, i.e., LDA [90] and mutual information feature selector (MMI, mrmr, CMI) [40]. As shown in Table 4.3, the visual speech accuracy of our proposed augmented DBNF, which takes two complemen- 84

105 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks tary information representation methods (i.e., local and global information) into account, outperforms all the listed visual feature types and feature dimension reduction schemes. More specifically, as shown in Table 4.3, besides the proposed deep learning techniques in this work, DCT with LDA (DCT+LDA) yields the highest accuracy (54.7%). In our study, we also found that LDA failed to obtain a proper transformation on the raw 1770-dimensional LBP-TOP feature vectors, because modelling such a dimensional feature using LDA requires that there are at least 1770 training samples for each of the 308 classes (HMM states). Although the data corpus we used is a relatively larger audio-visual connected digit speech database, the amount of data is still not large enough to perform the LDA reduction on the LBP-TOP features. Compared with the mutual information feature selectors (MMI, mrmr and CMI), using the proposed stacked auto-encoders to learn features achieves a relative improvement of 8% (57.3%), because the deep learning techniques are able to make full use of the information in the LBP-TOP feature, while the mutual information selectors only select several relatively informative components from the original LBP-TOP. Since LDA is able to decorrelate the feature components and reduce the feature dimension, we use LDA to further optimize the DBNF. After the optimization, the visual speech accuracy further increases to 63.3%. The reason for the significant accuracy increase are two fold: 1) Compared with the 200-dimensional DBNF, the feature dimension is dramatically reduced to 40 to avoid the curse of dimensionality. 2) Since the units of the stacked auto-encoder are fully connected between the units in adjacent layers, the components of the DBNF are correlated. Employing the LDA is able to decorrelate the components in the DBNF. In this work, we also proposed an augmented DBNF, which is able to embed both local (LBP-TOP) and global (DCT) infor- 85

106 Chapter 4. Visual Feature Learning Using Deep Bottleneck Networks mation into a compact feature vector, and our proposed augmented DBNF yields 24% relative improvement, compared with DCT+LDA. 4.4 Conclusion In this paper, we propose an augmented DBNF for visual speech recognition. This augmented DBNF is first learned with a stacked denoising auto-encoder, followed by a fine-tune process using a feed forward neural network. The DBNF is then augmented by concatenating the DCT feature vector, and LDA is applied to decorrelate the feature and reduce the feature dimension. Experimental results show that our proposed augmented DBNF significantly boosts speech accuracy. Unlike the recently proposed works which solve the lipreading problems as a classification problem [3, 86, 131, 107], we tackled this problem similar to a speaker-independent acoustic speech recognition task, which needs to capture the temporal dynamic of the data (e.g. by using HMMs). To the best of our knowledge, this is the first work which explores the use of the deep bottleneck feature on visual speech recognition, and firstly show that the deep learned visual features can achieve a significant improvement than the hand-crafted features. 86

107 Chapter 5 Visual Feature Learning Using Deep Boltzmann Machine 1 Abstract This paper presents a novel feature learning method for visual speech recognition using Deep Boltzmann Machines (DBM). Unlike all existing visual feature extraction techniques which solely extract features from video sequences, our method is able to explore both acoustic information and visual information to learn a better visual feature representation. During the test stage, instead of using both audio and visual signals, only the videos are used to generate the missing audio features, and both the given visual and audio features are used to produce a joint representation. We carried out experiments on a new large scale audio-visual corpus, and experimental results show that our proposed techniques outperform the performance of hand-crafted features and previously learned features and can be adopted by other deep learning systems. 1 Published in Proceedings of The IEEE International Conference on Computer Vision,

108 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine 5.1 Introduction Continuous efforts have been made towards the development of Automatic Speech Recognition (ASR) systems in the recent years, and numerous ASR systems (e.g. Apple Siri and Microsoft Cortana) have come into use in our daily life. Although ASR research has made remarkable progress, practical ASR systems are still prone to environmental noises. A possible solution to overcome the recognition degradation in the presence of acoustic noises is to take advantage of the visual stream which is able to provide complementary information to the acoustic channel. Despite the promising application prospects of Audio-Visual Speech Recognition (AVSR), the problem on how to extract visual features from videos still remains a difficult one. In order to improve visual feature representation techniques, Visual Speech Recognition (VSR), also known as lipreading, have emerged as an attractive research area in the recent years [132]. In addition to the combination of the visual and audio features to boost speech accuracy, another promising aspect of VSR is in its wider potential real-life applications compared to acoustic based ASR. As shown in Fig. 5.1, in many practical applications, ASR systems are exposed to noisy environments, and the acoustic signals are almost unusable for speech recognition. On the other hand, with the availability of front and rear cameras on most mobile devices, users can easily record facial movements to improve speech accuracy. In these extremely noisy environments, the visual information becomes basically the only source that ASR systems can use for speech recognition. Although lipreading techniques provide an effective potential solution to overcome environmental noises for ASR systems, there are still several challenges in this area. Unlike the well-established audio features, such as the Mel Frequency Cepstral Coef- 88

109 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine Figure 5.1: Possible application scenarios of our proposed framework. In an noisy environment, visual features are a promising solution for automatic speech recognition. ficients (MFCC), how to encode the speech-related visual information into a compact feature vector is still a difficult problem, because the lip movements are not easily distinguishable compared to audio signals between different utterances. Another challenge is that the fusion of both audio and visual signals dramatically degrades the speech recognition performance in the presence of the noisy acoustic signals. Given the aforementioned visual speech recognition challenges, this paper provides a new perspective to solve both of these challenges. For the first challenge, since the audio features perform much better than the visual features, we use both audio and visual information to learn a more pertinent feature representation for speech recognition. Moreover, the trained feature representation model is also capable of inferring the audio features when visual information is available. Hence, during the test stage, the audio signals which may be severely corrupted by the environmental noises are not required. Instead, the visual feature is used to reconstruct the audio information, and both the given visual and inferred audio information are used to yield a joint feature representation. Therefore, the second aforementioned challenge can be solved. 89

110 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine The rest of this paper is arranged as follows: Section 5.2 introduces some related works, and based on the review of relevant recent works, the contributions of this paper are also given in this section. The feature learning scheme is presented in Section 5.3. We extensively evaluate the performances of different visual features in Section 5.4. Finally, we summarize our paper in Section Related Works and Our Contributions Compared with the well established audio features e.g. MFCC, there is no universally accepted visual feature to represent lipreading relevant information [132]. In this section we first review the recent visual feature extraction works. We then highlight the key contributions of our work to this area. Generally speaking, visual features can be divided into two categories: appearancebased features and shape-based features [91]. For appearance-based features, image transformation techniques are performed on raw image pixels to extract visual features, while the parameters of the lip shape models are used to extract the shape-based features. Although the shape-based visual features are able to explicitly capture the shape variations of the lips, an extremely large number of lip landmarks need to be laboriously labelled, which is infeasible for large-scale speech recognition tasks. On the other hand, appearance-based visual features are computationally efficient and do not require any training process. Hence, appearance-based features have been widely adopted in the recent years [132]. In terms of appearance-based visual features, Zhao et al. [129] introduced a Local Binary Pattern (LBP) based spatialtemporal visual feature, called LBP-TOP. This feature produced an impressive performance over other existing feature extraction tech- 90

111 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine niques on various lipreading tasks. Despite the promising performance of LBP-TOP features, the dimensionality of the raw LBP-TOP feature is very large, which makes the system succumb to the curse of dimensionality. Hence, a number of other works have been presented to encode the visual information of the LBP-TOP features by more informative representations [133, 79, 3, 86, 107, 131]. However, these works focused only on isolated words and phrase recognition. They did not consider connected words or continuous speech recognition, which is highly in demand by modern speech recognition systems using Hidden Markov Models (HMMs) [21]. Given the rich speech relevant visual information embedded in LBP-TOP visual features, this paper presents a novel feature learning technique which can explore the speech relevant information from the raw LBP-TOP features. Motivated by the great success achieved by deep learning techniques in the area of acoustic speech recognition [23], this paper introduces a new visual feature learning technique to improve lipreading accuracy. In this paper, we use Deep Boltzmann Machines (DBM) [97] to learn the visual features. Encouraging pioneering works, which employed deep learning techniques for visual speech recognition, have been carried out by Ngiam et al. [73] and Huang et al. [49]. However, in [73], the visual features trained by the deep Auto-Encoder (AE) were fed to a Support Vector Machine (SVM), which limited the work to mainly isolated word recognition. Huang et al. [49] trained a Deep Belief Network (DBN) to predict the posterior probability of HMM states given the observations, which can be used for continuous speech recognition. However, the performance of their proposed visual feature learned by deep learning techniques did not show marginal improvements over the benchmark HMM/GMM model. Meanwhile, Ngiam et al. [73] proposed a cross modality learning framework that used both audio and visual information to train a shared representation. However, 91

112 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine this framework failed to yield a better accuracy than the visual-only learning framework. This cross modality learning framework provides, however, a new perspective to overcome the low lipreading accuracy problem. More specifically, although practical ASR systems are usually exposed to acoustic noisy environments, it is always easy to collect both clean audio and visual data in controlled lab environments. This means that we can train a feature learning model that uses both clean audio and visual signals to learn a better shared representation. When this well trained system is used under noisy environment, instead of relying on the noisy acoustic signals, only the captured video signal are used to generate the joint feature representation for visual speech recognition. The feature learning techniques used in [73, 49] are not able to generate an adequate shared representation when one modality (i.e. audio) is missing. Fortunately, Salakhutdinov and Hinton [97] proposed a Deep Boltzmann Machine (DBM) which is a deep Restricted Boltzmann Machine (RBM) based model with the ability to infer a missing modality. Unlike DBN [47] which is a directed model based on RBM, DBM is an undirected graphical model with bipartite connections within adjacent hidden layers. The undirected structure of the DBM allows this model to infer a missing modality by a Gibbs sampler [97]. Srivastava and Salakhutdinov firstly used DBM on multimedia images and text tags multimodal classification [103], and later demonstrated the superior performance of DBM in the case of audio-visual speech recognition [102]. However, how to infer the missing audio from the video was not considered in their paper. In this paper we propose a novel formulation of the multimodal DBM to the audiovisual connected word speech recognition task and propose the following key contributions: 92

113 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine Figure 5.2: Block diagram of our proposed system. The left side of the figure shows the training phase.the visual feature is learned from both the audio and video stream using a multimodal DBM. The right side of the figure shows the testing phase, where the audio signal is not used. In the testing phase, the audio is generated by clamping the video input and sampling the audio input from the conditional distribution. Unlike previous works that only extract visual features from video data, we propose a novel framework that uses both the audio and visual signals to enrich the visual feature representation. Although both audio and visual features are required in the training, we only use the visual features for the testing since our feature learning framework is capable of inferring the missing (i.e. degraded) audio modality. Hence our proposed framework provides a promising solution for practical automatic speech recognition systems. It is deployed in very noisy environments and exploits the more reliable visual modality instead of audio signals. 5.3 Proposed Feature Learning Scheme The block diagram of proposed system is shown in Fig The visual feature learned by the Deep Boltzmann Machine (DBM) is concatenated with Discrete Cosine Trans- 93

114 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine form (DCT) feature vector, followed by a Linear Discriminant Analysis (LDA) to decorrelate the feature and reduce the feature dimension. Then, the Gaussian Mixture Model- Hidden Markov Model (GMM-HMM) is used as a classifier for visual speech recognition. In the following section, the DBM model is first introduced Multimodal Deep Boltzmann Machine The DBM consists of a series of Restricted Boltzmann Machine (RBM) stacked on top of each others. The energy of the joint configuration of the visible units v and hidden units h can be formulated as: E(v, h; θ) = v W (1) h (1) n i=2 h (i 1) W (i 1) h (i), (5.1) where θ = {W (1), W (2),..., W (n 1) } is the model parameter, which is the set of weights between the different layers. The joint distribution of the model can be formulated as: P(v; θ) = h (1),...,h (n) P(v, h (1),..., h (n) ; θ) = 1 Z(θ) h (1),...,h (n) exp( E(v, h (1),..., h (n) ; θ)), (5.2) where Z(θ) is the partition function. The training process of DBM can be divided into two steps: pre-training and fine-tuning, which will be introduced in the following DBM Pre-training The pre-training of the DBM is carried out by training RBMs in a greedy layer-wise manner. Since the inputs of the DBM are real-valued and all the hidden units are binary, the RBMs between the input layer and the hidden layer are Gaussian RBMs, 94

115 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine while the RBMs between the adjacent hidden layers are binary RBMs. The DBM has two real-valued input streams: the visual input v v and the audio input v a, and a sequence of binary-valued hidden layers. For the D-dimensional input v i {v v, v a } of each stream, the energy of the D-dimensional input v and the first layer h (1), which consists of F hidden units, can be modelled as follows: E(v, h 1 ; θ) = D (v i b i ) 2 i=1 σi 2 D i=1 F (1) v i W ij h (1) σ j j=1 i F (1) a j h (1) j, (5.3) j=1 where θ = (W, a, b) are the model parameters, W is the weight between two adjacent layers, a is the bias of the hidden layer, b is the bias of the visible layer, and σ is the standard deviation of the input. The energy between the k hidden layer and the k + 1 hidden layer is defined by Eq This process is continued until all RBMs layers are pre-trained using Contrastive Divergence (CD) [47]. Once both the audio and visual streams are pre-trained separately, an additional hidden layer is added on top of audio and visual streams: E(h (k), h (k+1) ; θ) = F (k) i=1 F (k+1) j=1 h (k) i W ij h (k+1) F (k) j i=1 h (k) F (k+1) i b i j=1 a j h (k+1) j. (5.4) DBM Fine-tuning Once the units in each layer are pre-trained, the joint distribution of the multimodal DBM, as shown in Fig. 5.2, is formulated by applying the visual (1st) and visual-hidden (2nd) terms in Eq. 5.3, and the hidden-hidden(1st) term from Eq. 5.4, and using Eq

116 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine to yield the joint distribution over the audio and visual inputs: P(v; θ) = 1 Z(θ) h (2),h (3) exp( k (a,v) + F (1) k i=1 F (2) k h (1) ki W kij h (2) kj + j=1 F (2) k i=1 ( D k i=1 (v ki b ki ) 2 + σ 2 ki D k i=1 F (1) k v ki W σ kij h (1) kj j=1 i F (3) k h (2) ki W kij h (3) kj )), (5.5) j=1 where k {a, v} represents the audio(a) and visual(v) streams. The parameters of the model are fine-tuned by approximating the gradient of the log-likelihood of the probability that the model assigns to the visible vectors v, i.e. L(P(v; θ)), with respect to the model parameters θ. In [47], this process can be formulated as: L(P(v; θ)) θ = α(e Pdata [vh ] E Pmodel [vh ]), (5.6) where α is the learning rate. E Pdata [ ] represents the data-dependent expectation, which is the expectation of the P(v; θ) with respect to the training data set. E Pdata [ ] represents the model expectation, which is the expectation of the P(v; θ) defined by the model (Eq. 5.5). The model parameters approximation process can be divided into two separate procedures. For the data-dependent expectation estimation, the mean-field inference is used, followed by a Markov Chain Monte Carlo (MCMC) based stochastic approximation procedure to approximate the model-dependent expectation. Further details of the training process can be found in [97]. 96

117 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine Generating Missing Audio Modality One highlight of this paper is the introduction of a new perspective for visual feature extraction, wherein the visual feature is learned from both the visual and audio modalities. This technique provides a more practical solution for many speech recognition scenarios where the audio signals are almost unusable because of the environmental noises. However, in order to make this system feasible, the missing audio signals need to be generated by the trained DBM model during the classification. Fortunately, since the DBM is an undirected generative model, audio signals can be inferred from the visual modality. Then, the reconstructed audio feature can be used as an augmented input to perform visual speech recognition. More specifically, given the observed visual features, the audio feature is inferred by clamping the visual feature at the input units, and applying a standard alternating Gibbs sampler [97] to sample the hidden units from the conditional distribution using the following equations: P(h k j = 1 h k 1, h k+1 ) = σ( Wij k hk 1 i i + m W k+1 jm hk+1 m ), (5.7) P(h n m = 1 h n 1 ) = σ( Wjm n hn 1 j ), (5.8) j P(v i = 1 h 1 ) = σ( Wij i h1 j ), (5.9) i where i, j and m are the indices of the units in the corresponding layers. Once the audio feature is reconstructed, both the generated audio and the given visual features are used together to generate a joint representation for speech recognition. This process is illustrated in Fig. 5.3, and the details of this process are also shown 97

118 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine Figure 5.3: The generation of the missing audio signals can be divided into two steps: 1. Infer the audio signal from the given visual features. 2. Generate a joint representation using both the reconstructed audio and the given visual features. in Algorithm 1. Algorithm 1 Process of generating the missing audio feature Clamp the observed visual feature v v at the input. for Each hidden layer k in visual stream. do Gibbs Sample the hidden layer state in a bottom-up manner, and estimate P(h k i = 1 h k 1, h k+1 ) using Eq end for Gibbs sample the joint layer state, and estimate P(h n m = 1 h n 1 ) using Eq for Each hidden layer k in audio stream do Gibbs Sample the hidden layer state in a top-down manner, and estimate P(h k j = 1 h k 1, h k+1 ) using Eq end for Infer the missing audio feature using Eq Gibbs Sample the joint representation in a bottom-up manner by feeding both the reconstructed audio and observed visual features into the network. 98

119 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine Figure 5.4: Discriminative fine-tuning of our proposed DBM model Deterministic Fine-Tuning Once the DBM model is fully trained, its weights are used to initialise a deterministic multilayer neural network as in [97]. For each input vector v i {v v, v a }, the approximate posterior distribution Q(h i v i ) is estimated by the mean-field inference. The marginals of Q(h i v i ) are then used with the input vector v i to form an augmented input for the deterministic multilayer neural network, as shown in Fig The standard backpropagation is used to discriminatingly fine-tune the model Augmented Visual Feature Generation In the last step of our proposed method, the visual feature learned from the DBM model is then concatenated with the DCT feature to form an augmented visual fea- 99

120 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine ture (as shown in Fig. 5.2). The LDA is used to decorrelate the augmented visual feature vector and to further reduce the feature dimension. Similar feature augmentation techniques have already been used in both acoustic speech recognition [124] and visual speech recognition [111]. Finally, this augmented visual feature is fed into an HMM recogniser. 5.4 Experiments Data Corpus and Experimental Setup Many high quality recent papers mainly focused on the task of isolated word/letter recognition or phrase classification [133, 79, 73, 3, 86, 107, 131, 102]. In contrast, we propose a more practical lipreading system that can perform connected speech recognition. In this case, the popular benchmark corpora, such as AVLetters [67], CUAVE [85] and OuluVS [129] or the combination of these corpora (as used in [73, 102]) are not fully useful because they are limited in both speaker number and speech content. In addition, some large-scale data corpora such as AVTIMIT AVTIMIT [43], IBMIH [49] are not publicly accessible. Although XM2VTSDB [69] (which is publicly available) has 200 speakers, the speech is limited to simple sequences of isolated word and digit utterances. In order to evaluate our VSR system on the more difficult connected speech recognition problem, a new large-scale audio-visual data corpus was established and used in our paper. The data corpus used in our paper was collected through an Australia wide research project called AusTalk [119]. It is a large-scale audio-visual database of spoken Australian English, including isolated words, digit sequences, and sentences, recorded 100

121 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine at 15 different locations in all states and territories of Australia. In the proposed work, only the digit sequence data subset is used. In the digit data subset, 12 four-digit strings are provided for people to read. This set of digit strings, which are organised in a random manner without any unnatural pause to simulate the PIN recognition and telephone dialling tasks. Moreover, the digit selection was carefully designed to ensure that each digit (0-9) occurs at least once in each serial position (see Table 5.1). Table 5.1: Digit sequences in the AusTalk data corpus. For the digit 0, there are two possible pronunciations: zero ( z ) and oh ( o ). No. Content No. Content No. Content 01 z o z o z o z Some recording examples can be seen in Fig. 5.5a. To generate the required visual information, the mouth ROIs, as illustrated in Fig. 5.5b, are cropped from the original videos using the Haar features and Adaboost [117]. In order to increase the statistical significance of our results, we split all the 125 speakers digit session recording data into three groups, i.e., the training set, the validation set, and the test set. The speakers in the different groups are not overlapped. A three-fold cross validation is then used, and the average word accuracy of the three runs are reported Audio and Visual Features In terms of the audio feature, a 13 Mel Frequency Cepstral Coefficients (MFCC) with Cepstral Mean Normalisation (CMN) was extracted, and the zero-th coefficient ap- 101

Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine (a) (b) Figure 5.5: Examples in the AusTalk Corpus. Fig. 5.5a: Original recordings in the corpus.

122 Chapter 5. Visual Feature Learning Using Deep Boltzmann Machine (a) (b) Figure 5.5: Examples in the AusTalk Corpus. Fig. 5.5a: Original recordings in the corpus. Fig. 5.5b: Corresponding mouth ROI examples extracted from the original examples in Fig. 5.5a. 102

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

PROBLEM FORMULATION AND RESEARCH METHODOLOGY ON THE SOFT COMPUTING BASED APPROACHES FOR OBJECT DETECTION AND TRACKING IN VIDEOS CHAPTER 3 PROBLEM FORMULATION AND RESEARCH METHODOLOGY The foregoing chapter