Facial expression recognition for multi-player on-line games

Size: px

Start display at page:

Download "Facial expression recognition for multi-player on-line games"

Octavia Price
5 years ago
Views:

1 University of Wollongong Theses Collection University of Wollongong Theses Collection University of Wollongong Year 2008 Facial expression recognition for multi-player on-line games Ce Zhan University of Wollongong Zhan, Ce, Facial expression recognition for multi-player on-line games, MCompSc thesis, School of Computer Science and Software Engineering, University of Wollongong, This paper is posted at Research Online.

3 Facial Expression Recognition for Multi-player On-line Games A thesis submitted in fulfillment of the requirements for the award of the degree Master of Computer Science from UNIVERSITY OF WOLLONGONG by Ce Zhan School of Computer Science and Software Engineering February 2008

5 Dedicated to My grandparents, Hongchao Li and Junhui Dong iii

6 Declaration This is to certify that the work reported in this thesis was done by the author, unless specified otherwise, and that no part of it has been submitted in a thesis to any other university or similar institution. Ce Zhan February 22, 2008 iv

7 Abstract Multi-player on-line games (MOGs) have become increasingly popular because of the opportunity they provide for collaboration, communications and interactions. However, compared with ordinary human communication, MOG still has several limitations, especially in the communication using facial expressions. Although detailed facial animation has already been achieved in a number of MOGs, players have to use text commands to control the expressions of avatars. This thesis proposes an automatic expression recognition system that can be integrated into a MOG to control the facial expressions of avatars. To meet the specific requirements of such a system, a number of algorithms are studied, tailored and extended. In particular, Viola-Jones face detection method is modified in several aspects to detect small scale key facial components with wide shape variations. In addition a new coarse-to-fine method is proposed for extracting 20 facial landmarks from image sequences. The proposed system has been evaluated on a number of databases that are different from the training database and achieved 83% recognition rate for 4 emotional state expressions. During the real-time test, the system achieved an average frame rate of 13 fps for images on a PC with 2.80 GHz Intel Pentium. Testing results have shown that the system has a practical range of working distances (from user to camera), and is robust against variations in lighting and backgrounds. v

8 Acknowledgments I would like to take this opportunity to express my sincere gratitude to my supervisors, Dr. Wanqing Li, Prof. Philip Ogunbona and Prof. Farzad Safaei for their invaluable guidance, advice, criticism and encouragement. I am also very grateful to my colleague Gang Zheng, for the help in programming and for numerous discussions which have given me tremendous confidence and inspiration. I wish to thank Yiyu and Xiaodong, they take care of me just like elder sister and brother. This work is partly supported by Smart Internet Technology (SIT) CRC Australia. I would like to thank SIT for providing a research scholarship. vi

9 Contents Abstract v Acknowledgments vi 1 Introduction Motivation and Objectives Contributions Publication List Organization of Thesis Facial Expression Recognition: Literature Review Overview Face Detection Knowledge-based Methods Learning-based Methods Discussion Facial Feature Extraction Geometric Feature Extraction Appearance Features Extraction D Modeling Discussion Facial Expression Classification vii

10 2.4.1 Frame-based Methods Sequence-based Methods Discussion Issues and Challenges Training and Testing Database Face Resolution Environment variation Pose of the Head Individual Differences Summary System Overview System Requirements and Specifications System Design Key Facial Component Detection Extended Haar-like Feature Set High Hit Rate Cascade Training Regional Scanning With a Fixed Classifier Candidate sub-window selection Specialized classifiers Facial Landmark Localization Facial Landmark Detection Facial Landmark Tracking Facial Feature Extraction Gabor Filters Gabor Wavelets Based feature extraction viii

11 7 Facial Expression Classification Support Vector Machines Linear Case Nonlinear Case Non Separable Data Multiple Decision Making Experimental Results Database Face detection Parameter Tuning Gabor Kernel SVM Kernel Key Facial Component Detection Facial Landmark Localization System Performance Accuracy Computational Cost Conclusion Summary and Conclusion Future Works A Viola-Jones Face Detection Method 84 A.1 Boosting A.2 Haar-like Feature Classifiers A.3 Cascade Architecture Bibliography 88 ix

12 List of Tables 2.1 Miscellaneous Actions [21] The approximate relationship between distance of user to camera and facial component resolution Mouth detection results for different face resolutions by using cascade classifiers with different stages Mouth detection results on resized faces and corresponding original faces The performances of face detector on different databases Coding scheme for combination of frequencies The best recognition rates achieved by using different SVM kernels The performance of facial landmark localization module on BioID database (see Figure 8.9 as a reference for landmark locations) Recognition results for 7 expressions Recognition results for 4 expressions Failure samples with corresponding expressions in training database The average processing CPU time for each module x

13 List of Figures 2.1 General processing stages of facial expression recognition The shape model used in [42] Outline of the facial feature detection method proposed in [89] Haar-like rectangle features used in [90] Selected Haar-like rectangle features for different expressions: (a) neutral, (b) happiness, (c) anger, (d) sadness, (e) surprise, (f) disgust, (g) fear Six prototypic emotional expressions: anger, surprise, sadness, disgust, fear, and happiness Action units (AUs) [21] The architecture of the proposed system The Haar-like feature set used in Viola-Jones face detection method [88] The extended Haar-like feature set [54] Rotated Summed Area Table (RSAT) Calculation scheme of the pixel sum of rotated rectangle Calculation scheme for rotated features The coarse-to-fine facial landmark localization process Facial landmark estimation Illustration of the corner refining method Gabor filters with the orientation of 0, π 4, π 2 and 3π xi

14 6.2 Gabor filters with the frequency of 0.25, 0.5, 1 and Gabor filters with σ value of π, 2π, 3π and 4π A family of Gabor kernels with three different frequencies and six different orientations A face image after convolution with Gabor filters shown in Figure Separating hyperplanes. The left is a random one, right one maximizes the margin of separability Sample images from JAFFE database Sample images from AR Face database Sample images from AT&T Face database Sample images from BioID Face database Sample images from BioID Face database Sample images from CIT Face database Sample images from FG-NET Face database Face detection results from different databases facial landmark points which represent the facial geometry The recognition performance when one of the 34 landmarks is removed The recognition rate for different frequencies The recognition rate for different number of orientations Facial component detection results Facial component detection results from BioID database Facial component detection results from FG-NET database Real-time facial component detection results Average Facial landmark detection rates for different face resolutions Facial landmark localization results from BioID database Facial landmark localization results from FG-NET database Real-time Facial landmark localization results Recognition rates at different distances xii

15 8.22 Recognition results for real-time test A.1 Haar-like feature computation with the integral image. The feature value is: S 1 S 2, with S 1 = E B D + A and S 2 = F C E + B A.2 The cascade architecture xiii

16 Chapter 1 Introduction 1.1 Motivation and Objectives A Multi-player On-line Game (MOG) is a type of computer game that is capable of supporting multiple users simultaneously, and is played on the Internet. The collaboration, communication and interaction ability of MOGs enables players to cooperate or compete with each other on a large scale. Thus, players could experience relationships as real as those in the real world. This real feeling makes MOGs more popular than any other types of computer games, despite significant amounts of time and money required for playing. Taking youths in China for example, according to Pacific Epoch s 2006 On-line Game Report [40], China had 30.4 million on-line gamers by the end of Despite the advances in interactive realism of MOGs, when compared with real world human communication, the interfaces are still primitive. For example, in most existing MOGs, text-chat is used rather than real-time voice chatting; during a conversation, avatars have no activities related to natural body gestures, facial expressions, etc. This thesis focuses on facial communication. In everyday life, the manifestation of facial expressions is a significant part of our social communication. Birdwhistell s linguistic analogy [7] suggests that the information conveyed by words amounts to only 20-30% of the information conveyed in a conversation. The underlying emotions conveyed by different facial expressions often make the same word have different meanings. In order to feel immersed and socially 1

17 1.1. Motivation and Objectives 2 aware like in the real world, players must have an efficient method for conveying and observing changes in emotional states. Game designers use the following approaches to deal with facial expressions in existing MOGs [63, 62, 64, 93, 1]: Text expressions: pure text based expressions are provided. Slash commands, such as /smile, /frown, and /wink, are used to convey different expressions. The output expressions are described by text. For example, when a player, John, types the command /smile while having a conversation with another player, Bruce, in a game, John smiles at Bruce appears on the screen. Text expressions plus facial animation: facial expression animation is provided. The animation is also triggered by slash commands, and the corresponding text expressions accompany the animation. For example, when John types the command /smile while having a conversation with Bruce, John s avatar smiles, and at the same time, John smiles at Bruce appears on the screen. Text facial expressions plus body animation: body gesture animation is provided to represent facial expressions. For example, (the same case as above) the slash command /smile produces the text expression John smiles at Bruce plus a swing of an arm, but the avatar does not smile. Although a number of existing MOGs have already achieved detailed animation, text commands do not offer an efficient way to control the avatar s expressions easily and naturally. They are simple and straightforward, but not easy to use. First, players have to memorize all the commands. Thus the more sophisticated the facial system is, the harder it is to use. Second, humans convey emotions by expressions in real time. Players cannot type text commands every few seconds to update their current mood. Thirdly, facial communication should happen naturally and effortlessly; typing commands ruins the realism. The most natural way to control avatar s expressions is to use player s expression information. Thus, the goal of this work is to automatically recognize the player s facial

18 1.2. Contributions 3 expressions, so that the recognition results can be used to drive the facial expression engine of a MOG instead of text commands. 1.2 Contributions In this work, a real-time automatic facial expression recognition system is proposed and implemented for multi-player on-line games. In particular: The requirements and specifications of a facial expression recognition system is analyzed in the context of a MOG. Based on the requirements and specifications, some existing algorithms are examined, selected and tailored for an efficient implementation of the system. The Viola-Jones face detection method is modified in several aspects to detect small scale facial components with wide shape variations. A new coarse-to-fine method is proposed for extracting 20 facial landmarks from image sequences. 1.3 Publication List 1. Ce Zhan, Wanqing Li, Philip Ogunbona, and Farzad Safaei. A real-time facial expression recognition system for on-line games. International Journal of Computer Games Technology, Special Issue on Cyber Games and Interactive Entertainment (invited paper, under review) 2. Ce Zhan, Wanqing Li, Philip Ogunbona, and Farzad Safaei. Real-time facial feature point extraction. Advances in Multimedia Information Processing - PCM 2007, LNCS 4810, pp , Springer-Verlag Berlin Heidelberg, Ce Zhan, Wanqing Li, Farzad Safaei and Philip Ogunbona. Emotional states control for on-line game avatars. Proceedings of the 6th ACM SIGCOMM workshop

19 1.4. Organization of Thesis 4 on Network and system support for games, pp , Ce Zhan, Wanqing Li, Farzad Safaei and Philip Ogunbona. Face to face communications in multiplayer online games: a real-time system. Human-Computer Interaction, Part IV, HCII 2007, LNCS 4553, pp , Springer-Verlag Berlin Heidelberg, Ce Zhan, Wanqing Li, Philip Ogunbona, and Farzad Safaei. Facial expression recognition for multiplayer online game. ACM International Conference Proceeding Series, Vol. 207, Proceedings of the 3rd Australasian conference on Interactive entertainment, pp , Organization of Thesis The thesis is organized into nine chapters. Chapter 1 gives the motivation and objective of this research and lists the major contributions. A literature review on facial expression recognition is conducted in Chapter 2. In the review, related approaches and techniques are surveyed in terms of three main processing stages of facial expression recognition: face detection, feature extraction and expression classification. In Chapter 3, the requirements of an automatic facial expression recognition system for MOGs are analyzed and specified. Based on the requirements, a real-time facial expression recognition system is proposed and an overview of the proposed system is given. In Chapter 4 through 7, the proposed algorithms for each module of the system are described in detail. The key facial component detection module is a modified version of Viola-Jones face detector. All the modifications are introduced in Chapter 4. Chapter 5 describes the proposed facial landmark localization method, which is a coarse-tofine process and consists of estimation, detection, refinement and tracking. After a brief introduction of Gabor filters, details of the feature extraction module are given in Chapter 6. Chapter 7 presents the mathematical principles of the support vector machine and its application to the proposed facial expression recognition system.

20 1.4. Organization of Thesis 5 Chapter 8 presents the experimental results. A number of experiments were conducted on different databases to select the parameters of the algorithms and evaluate the performances of the system. The system performance is analyzed in terms of accuracy as well as computational cost. In Chapter 9, conclusions are given and some future works are suggested.

21 Chapter 2 Facial Expression Recognition: Literature Review 2.1 Overview In computer vision, a facial expression is usually considered as the deformations of facial components and their spatial relations, or changes in the pigmentation of the face. A facial expression recognition system (FERS) is a computer system that attempts to automatically classify these changes or deformations into abstract classes. Distinct from emotion recognition which is often done in the behavioral science domain, computer facial expression recognition is purely based on visual information. While for emotion recognition, besides visual information conveyed by facial expressions, many other different factors like voice, pose, gestures, gaze direction as well as context, culture, gender also need to be taken into account to interpret the emotions [8, 73]. As a research topic for psychologists, facial expression recognition has been studied since the work of Darwin in 1872 [14, 75, 19]. The earliest investigation on automatic facial expression recognition was presented by Suwa et al. [78] in They analyzed facial expressions by tracking the motion of 20 identified spots on an image sequence. After that, for its potential utility in application domains such as multimodal human-computer interface (HCI), image understanding and retrieval, talking heads and virtual reality, facial expression recognition attracted the interest of more computer vision researchers. Since 1990s, research in facial expression recognition has 6

22 2.2. Face Detection 7 Figure 2.1: General processing stages of facial expression recognition produced significant results, which are mainly due to the development of related technologies such as face detection, face recognition and human movement analysis as well as the availability of relatively cheap computational power and image capture devices. Untill now, a large number of approaches have been proposed to analyze facial expressions from face images or an image sequence. The early works have been surveyed by Samal and Iyengar [74] in Fasel et al. [24] and Pantic et al. [67] published two comprehensive survey papers which summarized the approaches proposed before Recently, Tian et al. [86] presented the recent advances in facial expression recognition. In general, three main processing stages can be distinguished in a facial expression recognition system (see in Figure 2.1). First, the face must be detected before a expression can be analyzed; then, the facial expression information should be extracted from the detected faces; finally, based on the extracted information, input images are classified into abstract expression categories. In the following sections, we elaborate on the processing stages of face detection, facial feature extraction and expression classification. 2.2 Face Detection Face detection is a processing stage to automatically locate the face region in an input image or image sequence. As the first step of facial expression recognition, its reliability has a major influence on the performance and usability of the entire system. Numerous face detection methods have been proposed in the past [98, 32], here we briefly classify them into two categories: knowledge-based methods and learning based methods.

23 2.2. Face Detection Knowledge-based Methods The knowledge-based face detection methods attempt to depict our prior knowledge about the face pattern with some explicit rules or models. The rules and models can be based on facial components such as eyes, nose and mouth and their relationships, face contour, texture of the face, skin color, etc. However, it is impossible to encode all human knowledge exactly into certain criteria that could be accurately comprehended by computers. Thus, when identifying face candidates based on the predefined rules or models, mismatches often occur due to pose variations, clustered scenes or the presence of unusual faces. On the other hand, these heuristic methods are usually simple to implement, and work well in detecting frontal faces in uncluttered scenes. Yang and Huang [97] proposed a hierarchical rule-based method for face detection. The proposed system consists of three levels of rules. The rules at a higher level are general descriptions of what a face looks like while the rules at lower levels rely on details of facial features. First, a multi-resolution hierarchy of the input is created by averaging and downsampling. Then, the lowest resolution (Level 1) image is searched for all possible face candidates and the candidates are further processed at finer resolutions. At Level 2, local histogram equalization is performed on the face candidates received from Level 1, followed by edge detection. Candidate regions that passed through the first two levels are then examined at Level 3 with another set of rules that respond to facial features such as the eyes and mouth. Although this method does not achieve very high detection rate, the coarse-to-fine strategy and the ideas of using a multi-resolution hierarchy and rules to guide searches have been used in later face detection works. Chetverikov and Lerch [10] presented a simple face detection method using blobs and streaks. Their face model consists of two dark blobs and three light blobs to represent eyes, cheekbones, and nose. The model uses streaks to represent the outlines of the faces, eyebrows, and lips. Two triangular configurations are utilized to encode the spatial relationship among the blobs. A low resolution Laplacian image is generated to

24 2.2. Face Detection 9 facilitate blob detection. Next, the image is scanned to find specific triangular occurrences as candidates. A face is detected if streaks are identified around a candidate. Although the performance of this method is limited, it is a successful early attempt for face detection in cluttered background. Miao et al. [60] detect faces by a hierarchical template matching method. At the first stage, an input image is rotated from 20 to 20 in steps of 5, in order to handle rotated faces. A multi-resolution image hierarchy is formed and edges are extracted using the Laplacian operator. The face template consists of the edges produced by six facial components: two eyebrows, two eyes, one nose, and one mouth. Finally, heuristics are applied to determine the existence of a face. This method could handle limited head rotations and show better results on images containing a single face than images with multiple faces. The face detection method presented by Tang and Acton [79] is based on Gaussian distributions of the generic color of skin and hair. First, a skin color model which is estimated by a 2D Gaussian distribution is used to segment the skin colors from the complex background. After the image that only contains the skin colored pixel regions is produced, a curve called the division curve that delineates the boundary between the hair region and the face region is detected based on a hair model. The hair model is still based on 2D Gaussian distribution, and could also provide a secondary image that only contains the hair colored pixel regions. Finally, the detected division curve is used to estimate the face region. This method is very effective in up-front face detection, however, the obvious limitation is that the method cannot handle faces without hair Learning-based Methods Learning-based methods try to model face patterns with distribution functions or discriminant functions under a probabilistic framework. Methods of this kind are not limited by our describable knowledge of faces but determined by the capability of learning model and training samples. Thus, learning-based methods are able to deal with more complex cases compared with knowledge-based approaches. Meanwhile, powerful

25 2.2. Face Detection 10 learning models and large training dataset with great variations lead to high dimensionality for computation. The compromise between detection rates and computational load is always a main issue for learning-based face detection methods. Rowley et al. [72] use a multilayer neural network to learn the face and non-face patterns from the intensities and spatial relationships of pixels in face and non-face images. Each networks is trained with a base resolution of pixels, where each window is preprocessed to equalize the intensity value across it. The intensity is normalized with a function that also takes into account a simple elliptical model. The elliptical model makes the edges of the training window so as to mask out potential background pixels. The neutral networks have retinal connections as their individual input layers and output a single real number that defines the presence or absence of a face. The decision of output is made based on observation of hidden units, which are selected to identify the local features of a human face. Then, the window is scaled and the outputs of all the networks are merged to make the final decision if the searched pattern is a face or a non-face. By using multiple differently trained networks, this method could achieve more accurate detections. However, the computational expenses are also high due to the complex structure of the networks, and only upright, frontal faces can be detect by the proposed system. Moghaddam and Pentland [61] developed a probabilistic visual learning method based on density estimation in a high-dimensional space using an eigenspace decomposition. Principal component analysis (PCA) is used to define the subspace and represent a set of face patterns. These principal components preserve the major linear correlations in the data and discard the minor ones. This method decomposes the vector space into two mutually exclusive and complementary subspaces: the principal subspace and its orthogonal complement. Therefore, the target density is decomposed into two components: the density in the principal subspace and its orthogonal complement. A multivariate Gaussian and a mixture of Gaussians are used to learn the statistics of the local features of a face. These probability densities are then used for object detection based on maximum likelihood estimation. This approach can detect

26 2.2. Face Detection 11 faces in variable pose, but it only works well for input images containing one face. Schneiderman and Kanade [76] presented a naive Bayes classifier to estimate the joint probability of local appearance and position of face patterns at multiple resolutions. At each scale, a face image is decomposed into four rectangular subregions. These subregions are then projected to a lower dimensional space using PCA and quantized into a finite set of patterns, and the statistics of each projected subregion are estimated from the projected samples to encode local appearance. The method finally decides that a face is present when the likelihood ratio is larger than the ratio of prior probabilities. This Bayesian approach shows acceptable detection rate and is able to handle some degree of rotated and profile faces. However, its processing speed is relatively slow. The real-time face detector proposed by Viola and Jones [88] is considered to be the breakthrough of learning-based face detection methodology. It achieves the best compromise between detection rate and speed. The method achieves real time detection by using very simple and easy to compute Haar-like features. High detection rates were obtained by the use of a fundamental boosting algorithm, AdaBoost [27], which selects the most representative features in a large set to train a cascade of classifiers. The face detector scans an image by a sub-window at different scales. Each sub-window is tested by the trained cascade classifier. If the sub-window is clearly not a face, it will be rejected by one of the early stage classifiers in the cascade while a more specific classifier (later in the cascade) will classify the sub-window, if it is more difficult to discriminate. The original Viola-Jones face detector is proposed for frontal or near frontal face detection. Later, Jones and Viola extended their work to multi-view face detection [43]. They employed separate detectors for different views of the face using the original method with different training samples. Then a decision tree is trained to find a head pose. When the pose is known, the face detector corresponding to this pose is used to find the face area. The original Viola-Jones algorithm is also extended to multi-view face detection by Li et al. [49]. In their method, instead of using discrete

27 2.2. Face Detection 12 AdaBoost, the real version AdaBoost is used and the algorithm is modified to exclude some of already found weak classifiers to overcome non-monotonicity problem of greedy selection method. To detect a face independent of head pose, a pyramid of coarse-tofine detectors is trained. In [55, 89, 25], Viola-Jones detection method is extended to save training time. First, a single generation genetic algorithm is used to simplify the exhaustive search for the best classifier among all possible features. Secondly, GentleBoost [28] is adopted instead of AdaBoost for feature selection. Finally, the training procedure is modified so that after each single feature, the system can decide whether to test another feature or to make a decision Discussion An ideal face detector for facial expression recognition should be able to locate faces regardless of clutter, occlusions, and variations in face pose and lighting conditions. However, for most of the existing work in face detection, the conditions under which an image is obtained are controlled. Especially, most of the proposed systems can detect only upright faces in frontal or near-frontal view. For those approaches which could handle complex variations, the detection rate or processing speed is always sacrificed. Facial expression analysis can be sensitive to face pose variations and illumination changes. To combat the effect of these unwanted factors, normalization may be conducted on the detected faces prior to later processing stages. The lighting variations could be reduced by signal conditioning such as noise removal and filtering. The pose variations are often caused by scale changes as well as head rotations. Scale changes of faces are easy to handle by resizing after determining the size of present faces. The center positions of distinctive facial features such as the eyes, nose and mouth are often used as reference points to normalize the face geometrically. Faces with in-plane rotations can be directly aligned based on the reference points. In situations involving out-of-plane rotations, it is much more complex since the faces are distorted in comparison to frontal face displays or may even become partly invisible. Limited out-of-plane rotations can be handled by warping techniques, where the normalization is based on

28 2.3. Facial Feature Extraction 13 some generic face models (e.g. [23, 4]). Although face normalization may be a reasonable processing stage in facial expression recognition, it is not mandatory. A number of approaches such as appearance-based model [47] and local motion model [58] achieved good recognition performances without relying on face normalization. 2.3 Facial Feature Extraction After the face has been detected, the next step is to extract and represent the information about the facial expression to be recognized. During this stage, pixel data of images are converted into a higher-level representation, known as feature vectors, which is then used for the subsequent expression classification. The usually extracted facial features are either geometric features such as the shapes of the facial components (eyes, mouth, etc.) and the locations of facial landmarks (corners of the eyes, mouth, etc.) or appearance features representing the texture of the facial skin including wrinkles, bulges, and furrows. The shape and position information of facial features is directly used to form the geometric-based feature vectors. When extracting appearance-based feature vectors, image filters, such as Gabor wavelets and Haar-like filters, are often applied to a face image Geometric Feature Extraction Cootes et al. [13] proposed the Active Shape Model(ASM) algorithm, which detects facial landmarks through a local-based search constrained by a global shape model statistically learned from training data. This method was extensively used for facial deformation tracking, but may fail under large expression transformations. To solve this problem, the algorithm is extended by Hu et al. [42] by using specific ASM models for each cluster in the embedded space, the shape model is shown in Figure 2.2. In their method, on-line model selection is conducted probabilistically in a cooperative manner with expression classification so as to improve tracking reliability. In the works presented by Tian et al. [85, 83, 84], to detect and track changes

29 2.3. Facial Feature Extraction 14 Figure 2.2: The shape model used in [42] of facial components in near frontal face images, multi-state models are developed to extract the geometric facial features. A three-state lip model describes the lip state: open, closed, tightly closed. A two-state model (open or closed) is used for each of the eyes. Each brow and cheek has a one-state model. Given an image sequence, the region of the face and approximate location of individual face features are detected automatically in the initial frame by employing the method proposed in [72]. The contours of the face features and components are adjusted manually in the initial frame. After the initialization, all face feature changes are automatically detected and tracked in the image sequence. The system groups 15 parameters for the upper face and 9 parameters for the lower face, which describe shape, motion, and state of face components and furrows. To remove the effects of variation in planar head motion and scale between image sequences in face size, all parameters are computed as ratios of their current values to that in the reference frame. Some recognition systems directly use the positions of facial landmarks as the features of facial expressions. Typical facial point detection methods such as [33] which is based on log-gabor wavelets and [9] that use AdaBoost detection method cannot provide accurate enough results for such systems. They usually regard the localization of a point as a SUCCESS if the distance between the automatically labeled point and the manually labeled point is less than 30% of the true inter-ocular distance. To handle

30 2.3. Facial Feature Extraction 15 Figure 2.3: Outline of the facial feature detection method proposed in [89] this, Vukadinovic and Pantic [89] developed high accuracy facial feature point detector. The method is illustrated in Figure 2.3, it consists of 4 steps: Face Detection, Region Of Interest (ROI) Detection, Feature Extraction, and Feature Classification. The Vioal-Jones face detection framework is employed by the method to locate the face region. Then, the face region is divided into 20 relevant ROIs, each one corresponding to one facial landmark to be detected. In each ROI, a point is detected by feature patch templates which are pixels GentleBoost templates built from both gray level intensities and Gabor wavelet features. Although the method provides accurate facial landmark positions automatically, the fact that it employs a three-layer classification process imposes a high computational load, which is unaffordable in realtime applications Appearance Features Extraction Gabor wavelets [16] are widely used to extract the facial appearance changes as a set of multi-scale and multi-orientation coefficients. Ford [26] and Bartlett et al. [4] applied a family of Gabor wavelets at five spatial frequencies and eight orientations to the whole face image. In order to provide robustness to lighting conditions and to image shifts they employed a representation in which the outputs of two Gabor filters in quadrature are squared and then summed. This representation is known as Gabor energy filters

31 2.3. Facial Feature Extraction 16 and it models complex cells of the primary visual cortex. In [56, 85, 84, 99], similar Gabor wavelets are used for feature extraction, while the Gabor filters are applied to specific locations on a face. Wen and Huang [91] use the ratio-image based method to extract appearance features, which is independent of a person s face albedo. To limit the effects of the noise in tracking and individual variation, they extracted the appearance features in facial regions instead of points, and then used the weighted average as the final feature for each region. Eleven regions were defined on the geometric-motion-free texture map of the face. Gabor wavelets with two spatial frequency and six orientations are used to calculate Gabor coefficients. A 12-dimension appearance feature vector is computed in each of the 11 selected regions by weighted averaging of the Gabor coefficients. To tackle the face appearance variations, an appearance model is trained using a Gaussian mixture model based on exemplars. Then an on-line adaption algorithm is employed to progressively adapt the appearance model to new conditions such as lighting changes or differences in new individuals. Wang et al. [90] employ the Viola-Jones face detection framework [88] for facial expression recognition. Four kinds of Haar-like rectangle features (Figure 2.4) are used to represent the image information. The feature value in each case is the difference between the sum of the pixel intensities in the white section and the sum of the intensities in the black section. Since these type of features are proposed for face detection task, their ability for distinguishing different facial expressions is limited. Jung et al. [44] proposed new types of Harr-like rectangle features to extract facial features of different expressions more efficiently. They suggest a selection method of discernible rectangle feature types for each facial expression among all possible types in 3 3 matrix form. The selected rectangle features then are further selected based on training samples using the AdaBoost algorithm. Some of the selected rectangle features for different expressions are presented in Figure 2.5.

32 2.3. Facial Feature Extraction 17 Figure 2.4: Haar-like rectangle features used in [90] Figure 2.5: Selected Haar-like rectangle features for different expressions: (a) neutral, (b) happiness, (c) anger, (d) sadness, (e) surprise, (f) disgust, (g) fear.

33 2.3. Facial Feature Extraction D Modeling Apart from 2D spatiotemporal facial feature extraction, there are also few facial expression recognition approaches based on 3D face modeling. Gokturk et al.[29] proposed a method to recognize facial expressions such as brow flashes and smiles based on 3D deformations of the face tracked on stereo image streams using a 19-point face mesh and standard optical flow techniques. Cohen et al.[11] use Bayesian network classifiers for emotion recognition from face video based on facial features tracked by piecewise Bezier volume deformation tracking [80]. The tracker employs an explicit 3D wireframe model consisting of 16 surface patches embedded in Bezier volumes. Cohn et al. [12] focus on automatic analysis of brow actions and head movements from face video and use a cylindrical head model to estimate the 6 degrees of freedom of head motion [96]. Baker and his colleagues developed several algorithms for fitting 2D and combined 2D+3D Active Appearance Models to images of faces [94] [30]. By employing 3D face modeling, the proposed systems achieved relatively good performances especially when head rotations occurred. However, almost all of these methods need a large amount of manually annotated training data. And at the same time, the manual selection of landmark facial points in the first frame of the input video is always required by the systems using 3D face modeling Discussion It has been reported that methods based on geometric features are often outperformed by those based on appearance features [6]. However, recent studies [66] [87] show that in some cases geometric features can outperform appearance-based ones. Yet, hybrid feature vectors can be formed using both geometric and appearance features [17, 83, 84]. It seems that using hybrid features might be the best choice in the case of certain facial expressions [66]. Certainly, this may depend on the classification method and/or machine learning approach that takes the features as input. Meanwhile, the presence of facial hair, glasses, etc.; the variation in size and orientation of the face

34 2.4. Facial Expression Classification 19 Figure 2.6: Six prototypic emotional expressions: anger, surprise, sadness, disgust, fear, and happiness in input images; the presence of noise and occlusions would all affect the efficiency of feature extraction methods. Irrespective of the type of feature extraction approach used, the essential information about the displayed expressions should be preserved. The extracted features should possess high discriminative power and high stability against different expressions. 2.4 Facial Expression Classification Facial expression classification is the last stage of facial expression recognition. It is a decision procedure performed by a classifier based on the extracted features. The facial changes can be classified as facial action units (AUs) [17] or prototypic emotional expressions [18]. Introduced by Ekman and Friesen [20], each of the six prototypic emotional expressions (basic emotions) possesses a distinctive content and can be uniquely characterized by a facial expression. These prototypic expressions are also referred to as so-called basic emotions. They are claimed to be universal across human ethnicities and cultures and comprise happiness, sadness, fear, disgust, surprise and anger (e.g, see Figure 2.6). An AU is one of the 44 atomic elements of visible facial movement or its associated deformation. Ekman and Friesen first introduced AUs in their facial action coding system (FACS) [21] with the goal to describe all possible perceptible changes that may

35 2.4. Facial Expression Classification 20 Figure 2.7: Action units (AUs) [21] occur on the face. The FACS was then revised [22] to allow more accurate representations of facial behaviors. Thirty of the 44 Action Units are anatomically related to contraction of a specific set of facial muscles (see Figure 2.7). The remaining 14 AUs are referred to in FACS as miscellaneous actions (see Table 2.1), whose anatomic basis is unspecified. In applications, a facial expression is represented using a single AU or a combination of AUs with respect to the locations and intensities of corresponding facial actions. The expression classification methods can be categorized into either frame-based or sequence-based. The frame-based classification method uses only the information of current frame with or without a reference image to recognize the expressions of the frame. On the other hand, the sequence-based classification method uses the temporal information of the image sequences to recognize the expressions for one or more frames.

36 2.4. Facial Expression Classification Frame-based Methods Table 2.1: Miscellaneous Actions [21] Zhang et al. [99] employ a neural network for facial expression classification into 7 categories (six prototypic emotional expressions plus a neutral one). The input to the network consists of the geometric position of the 34 facial points and 612 Gabor wavelet coefficients. The neural network performs a nonlinear reduction of the input dimensionality and makes a statistical decision about the category of the observed expression. Each output unit gives an estimation of the probability of the examined expression belonging to the associated category. The network has been trained using a resilient propagation [71]. JAFFE database [38] was used to train and test the network, and 90.1 percent recognition rate was achieved. The performance of the network is not tested for recognition of expression of a novel subject. Lyons et al. [57] sample the amplitude of the complex valued Gabor transform coefficients on a fiducial grid of manually positioned nodes and combine these data into a single vector which they call labeled-graph vector (LG vector). The ensemble of LG vectors from a training set of images is further subjected to Principal Components Analysis (PCA). The ensemble of LG-PCA vectors from the training set is then analyzed using Linear Discriminant Analysis (LDA) in order to separate vectors

37 2.4. Facial Expression Classification 22 into clusters having different facial attributes. They built six binary classifiers, each of which decides the presence or absence of a particular basic emotion, and combined them into a single facial expression classifier. An input LG vector is classified by being projected along the discriminant vectors calculated for each independently trained binary classifier. For an input image that is positively classified for two or more emotion categories, the normalized distances to the cluster centers are used as a deciding factor. The input sample is classified as a member of the nearest cluster. An input image that is not positively classified for any category is classified as neutral. The proposed method is also trained and tested on JAFFE database [38], and obtained a 92 percent recognition rate. When tested on new subjects who are not included in the training set, the average recognition rate of the method dropped to 75 percent. The works of Tian et al. [85, 83, 84] use three-layer neural networks with one hidden layer to recognize AUs by a standard back-propagation method [72]. Separate networks are used for the upper and lower face. The inputs can be the normalized geometric features, the appearance feature, or both of them. And the outputs are the recognized AUs. The network is trained to respond to the designated AUs whether they occur alone or in combination. When AUs occur in combination, multiple output nodes are excited. The proposed system achieved an overall recognition rate of 95.5% for one neutral expression and 16 AUs whether they occurred individually or in combinations Sequence-based Methods Essa and Pentland [23] used a control-theoretic method to extract the spatio-temporal motion-energy representation of facial motion for an observed expression. By learning 2D motion views for each expression category, they generated the spatio-temporal templates for six different expressions: two facial actions (smile and raised eyebrows) and four emotional expressions (surprise, sadness, anger, and disgust). Each template has been delimited by averaging the patterns of motion generated by two subjects showing a certain expression. The Euclidean norm of the difference between the motion energy template and the observed image motion energy is used as a metric for measuring

38 2.4. Facial Expression Classification 23 similarity/dissimilarity. When tested on 52 frontal-view image sequences of eight people showing six distinct expressions, a recognition rate of 98% was achieved. Otsuka and Ohya [65] match the temporal sequence of the 15 dimensional feature vector to the models of the six basic emotional expressions by using a left-to-right hidden Markov model (HMM). The used HMM consists of five states, namely, relaxed (S1, S5), contracted (S2), apex (S3), and relaxing (S4). To facilitate recognition of a single image sequence, the transition from the final state to the initial state is added. To make recognition of multiple sequences of expression images feasible, the transition from a final state to the initial states of other categories is added. The transition probability and the output probability of each state are obtained from sample data by using the Baum-Welch algorithm. The initial probability is estimated by applying a K-means clustering algorithm on the sample data in which a squared sum of vector components is added as an extra component. The HMM was trained on 120 image sequences, captured from two male subjects. The method was tested on image sequences shown by the same subjects. Although Otsuka and Ohya claim that the recognition performance was good, the results were not reported for unknown subjects Discussion Besides the above mentioned classification methods, many other classical classifiers in pattern recognition have been applied to expression recognition such as support vector machines (SVMs), linear discriminant analysis (LDA), K-nearest neighbor (KNN), multinomial logistic ridge regression (MLR) and tree augmented naive Bayes. Each classifier has its own advantages and shortcomings. The performance of a classifier depends on many factors such as the characteristics of the training datasets and the adopted features. Thus, the choice of facial expression classification method has to be application context dependent.

39 2.5. Issues and Challenges Issues and Challenges As reviewed in the previous sections, many methods have been proposed for facial expression recognition. However, issues and challenges remain in order to build a truly automatic and robust facial expression recognition system Training and Testing Database Since databases used for facial expression recognition are often relatively limited, it is hard to evaluate the performance of different systems, and the generalizability of proposed approaches remains unknown. In most databases found in the literature, only a few facial expressions from a small number of subjects are considered, the subjects are always of similar age and ethnic background, and the recording conditions are controlled. Practical facial expression recognition cannot be achieved purely based on those training data that are not able to represent the possible variation in expressions, subjects, contexts and image properties. Without comparative tests on common data, the relative strengths and weaknesses of different approaches are also difficult to determine. Thus, as foundations of facial expression recognition research, more well-defined training databases and a representative benchmark database for testing are needed Face Resolution Most systems proposed in the literature attempt to recognize facial expressions from high resolution faces (face regions are always greater than pixels). However, for real-life applications, face resolutions can be affected by the quality of camera or the distance of user to camera, high resolution input cannot be guaranteed. Since facial images with coarse resolution can provide less information about facial features, algorithms that work well for high resolution face images can be expected to perform poorly when the resolution of input degrades. Although post-processing may improve image resolution, the degree of improvement is limited. According to the analysis by Tian et al.[81], most existing face detection methods work for face regions at resolution

40 2.5. Issues and Challenges 25 of pixels or higher; recognition systems using geometric features are not able to achieve good performances when the face resolutions are lower than pixels, while the resolution boundary for appearance based methods is pixels. Only few works attempt to recognize facial expressions from low resolution faces (e.g. [82, 77, 52, 50]). Although some of these works reported high recognition rates, the approaches are all tested by different images of the same individuals included in training data, and the resolutions are in fact not low enough Environment variation The variations of recoding environment such as complex background pattern, presence of other people and uncontrolled lighting conditions have a potentially negative effect on expression recognition. As discussed above, in most of the training data sets, background of the images is neutral or has a consistent pattern and only a single person is present in the scene. When input images are captured in a clustered scene, face detector trained by data set without corresponding variations are difficult to perform reliably. Similar to low resolution input, images acquired in low lighting conditions may also provide less information about facial features. Furthermore, a light source above the subject s head causes shadows to fall below the brows, which can obscure the eyes. Methods that work well in studio lighting conditions may experience bad performance in more natural lighting when the angle of lighting changes across an image sequence. To avoid the influence of all the environment variations, researchers often use uncluttered backgrounds and controlled illumination, although such conditions do not match the operational environment of practical applications Pose of the Head Since most systems use single fixed camera setups, constraints are often imposed on the position and orientation of the head relative to the camera to ensure the input image has the face in frontal view or near frontal view[6] [53]. However, in reality, head

41 2.5. Issues and Challenges 26 rotations occur frequently, pose invariant expression recognition methods need to be developed. As introduced above, in-plane rotations and limited out-of-plane rotations of the head may be partly handled by normalization before facial feature extraction. For large out-of-plane rotations, multiple cameras may be required to support 3D modeling. Pantic and Rothkrantz [68] were the first to use two cameras mounted on a headphone-like device; one camera is placed in front of the face and the other on the right side of the face. The cameras are moving together with the head to eliminate the scale and orientation variance of the acquired face images. Without multiple camera setups, there are also few efforts that have been made to handle large out-of-plane rotation in head position [82, 95] Individual Differences Facial characteristics such as shape, texture and color vary with gender, ethnic background, and age. For example, young people have smoother skin and often lack facial hair. The eye opening and contrast between iris and sclera differ markedly between Asians and Northern Europeans, which may affect the robustness of facial feature extraction more generally. Beards, eyeglasses, or make-up may obscure facial features. Such individual differences in appearance may have important consequences for facial expression recognition. Few attempts have been made to study their influence. An exception was a study by Zlochower et al. [100], who found that algorithms for optical flow and high-gradient component detection that had been optimized for young adults performed less well when used in infants. The reduced texture of infants skin, their increased fatty tissue, juvenile facial conformation, and lack of transient furrows may all have contributed to the differences observed in face analysis between infants and adults. In addition to individual differences in appearance, there are individual differences in expressiveness, including the degree of facial plasticity, morphology, frequency of intense expression, and overall rate of expression. Individual differences in these characteristics are well established and are an important aspect of individual identity [59]. To develop algorithms that are robust to individual differences in facial features

42 2.5. Issues and Challenges 27 and behavior, it is essential to include a large training sample set that are able to represent all kinds of variations Summary In summary, to attain successful recognition performance, most existing expression recognition approaches require some control over the imaging conditions. At the same time, some assumptions are made by the researchers to simplify the problem domain, thus, several practical issues are ignored. However, real-world applications require operational flexibility and such assumptions can rarely be satisfied. Although deployment of existing methods in fully unconstrained environments is still in the relatively distant future, integrating and extending these algorithms to develop a facial expression recognition system for a specific application context is achievable. In this thesis, a real-time facial expression recognition system is proposed and implemented with the intention to provide a natural interface to communicate facial expressions in multi-player on-line games.

43 Chapter 3 System Overview In this Chapter, the requirements of an automatic facial expression recognition system for multi-player on-line games are analyzed and specified. The system serves as a client application to communicate facial expressions to the game engine by capturing and analyzing the player s face images. Based on the requirements, a real-time facial expression recognition system is proposed. Overview of the proposed system is given in this Chapter and detailed algorithms and their implementation will be described in the following Chapters. 3.1 System Requirements and Specifications For a multi-player on-line game, the facial expression information of players should be obtained in a natural way without any constraints, the acquisition process should perform automatically and the acquired images should be image sequences in realtime. The obvious solution for these requirements is to use video cameras. Multiple camera techniques [68] and facial colored markers [45] methods cannot be used in this case, since it is impractical to expect a game player to buy more than one camera or paint his face in order to show expressions in a game. Therefore, facial expressions are expected to be captured by a fixed single camera. The recognition process including all the processing stages (face detection, facial feature extraction, and facial expression recognition) have to be performed automatically and in real-time. Long delays and the involvement of manual efforts will ruin the 28

44 3.1. System Requirements and Specifications 29 interaction realism. Serving as a client application of a multi-player on-line game, the recognition system cannot afford too heavy computational load, since limited system resources are available when the main game application is running. On the other hand, due to entertainment purpose of the system, the recognition accuracy rate need not be overly conservative. Meanwhile, the system should be person-independent, so that it could handle players with different sexes, ages, and ethnicities. The recognition should be robust against various lighting conditions, complex background, and distractions like glasses. The property of players camera also needs to be taken into account. For example, professional grade PAL cameras provide very high resolution images. By contrast, web cameras provide images that are seriously degraded. The face resolution also varies with the distance of the player to the camera. When wireless input devices and big screen monitors are used, players may be far away from the camera sometimes, and the face region could be small. Thus, the recognition system should be insensitive to the variation of face resolution. In other words, the system has to offer a practical range of working distance to allow the user to move back and forth in front of the camera. The recognition process is taking place when a player is engaged in an on-line game. During the game, the camera is often attached to the computer screen, facing the player. We can assume that the camera s field of view allows a frontal pose of the player most of the time. This is also justified by the fact that the players must stare at the screen to play a game. However, complete immovability of the player cannot be assumed. The system has to be able to deal with in-plane rotations and small scale out-of-plane rotations of the face. As discussed in Chapter 2, facial expressions can be classified as either facial action units (AU) or six prototypical emotional expressions. Some researchers [24, 67] consider that AUs perform better as classification classes since they can describe almost all possible facial changes especially the subtle ones. However, in the context of multiplayer on-line game, subtle facial changes may not be required. Players may not have enough time to perceive those small scale changes. We believe the simple but meaningful

3.2. System Design 30 prototypic emotional expressions would serve a MOG well in most cases. 3.2 System Design Based on the specific requirements discussed above, an automatic facial expression recognition system is proposed for MOGs.

45 3.2. System Design 30 prototypic emotional expressions would serve a MOG well in most cases. 3.2 System Design Based on the specific requirements discussed above, an automatic facial expression recognition system is proposed for MOGs. The system categorizes each frame of the facial video sequence into a neutral expression or one of the six basic emotions (happiness, sadness, fear, disgust, surprise and anger). Figure 3.1 shows the block diagram of the system with its five components: face detection, key facial component detection, facial landmark localization, feature extraction and classification of the expressions. Figure 3.1: The architecture of the proposed system In the face detection module, according to the system requirements, face regions are expected to be automatically located in the input images in real-time with low computational cost. The knowledge-based face detection methods seem to be able to meet the real-time requirement due to their simplicity. However, these methods are unable to detect faces in complex environment with different expressions since it is impossible to make rules that cover all possible cases. The predefined rules based on facial features would be violated by variations in scale, shape, lighting condition and background, whereas these variations can be handled by learning-based methods through well defined training sample set. Though learning-based methods are more complicated, most of the time and computation expenses are consumed during the off-line training process. In detection phase, less system resources are needed and the

46 3.2. System Design 31 real-time requirement could be met. Among the proposed learning-based face detection methods, Viola-Jones method [88] is adopted in our system to build the face detector. As discussed in Chapter 2, the method achieves the best compromise between detection efficiency and speed. The speed of the detection is notably given by the simplicity of the features chosen. In fact the response of Haar-like rectangle features used in the method is nothing more than the difference of two, three or four rectangular regions at different scales and shapes. Moreover, computation time of the features is further improved by the representation of Integral Image which permits to compute a rectangle area with only 4 elementary operations, i.e additions and subtractions. On the other hand, in the training process, AdaBoost is used to select a small set of features which best separates training examples of faces and non-faces. Each selected feature is used to construct a weak classifier with a simple binary thresholding decision. Then these simple weak classifiers are combined to build more powerful strong classifiers. Finally, such strong classifiers constitute a cascade classifier, which focuses the detection on critical regions of interest. Thus, during detection, non-face area can be quickly eliminated by the first few stages in the cascade based on limited features and the combination of simple weak classifiers give a rapid detection without degrading the detection rates. Although adapted versions of Viola-Jones face detection method have been proposed to detect multi-view faces or save training time, the original Viola-Jones method is employed in the proposed system. Because we want to detect frontal or near frontal view faces with minimum computational cost and more training time is affordable for higher detection rate. For facial feature extraction, among a number of algorithms proposed in the literature, comparative evaluation [17, 99, 24] has demonstrated that Gabor filters are most discriminative for facial expressions. Gabor filters closely model the receptive field properties of cells in the primary visual cortex [69]. They are able to detect line endings and edge borders over multiple scales and with different orientations. These features reveal much about facial expressions, since both transient and intransient facial features often give rise to a contrast change with regard to the ambient facial

47 3.2. System Design 32 tissue. Moreover, Gabor filters remove most of the variability in images that occur due to lighting changes. However, applying Gabor filters to the whole face area is too costly for MOGs. In the proposed system, following Zhang et al. [99] and Tian et al. [84], Gabor filters with different frequencies and orientations are applied to a particular set of facial landmark locations. Thus, not only the real-time requirement can be met due to the reduced amount of data to be processed, the limited localization in space and frequency yields a certain amount of robustness against translation, distortion, rotation and scaling of the images. At the same time, face cropping or alignment is not necessary in the whole recognition process, since feature extraction is conducted at specific locations. Despite the use of sparse sampling, features extracted by Gabor filters still have a large dimensionality. Since system resources are limited, the parameters of Gabor kernel (location, frequency and orientation) should be optimized, so that the extracted feature vectors only contain the most important components with high discriminative power. Thus, experiments are conducted to choose the most useful facial landmarks, frequencies and orientations of Gabor kernels for expression recognition, which are then used as parameters of the feature extraction module. Since Gabor filters are only applied at selected facial landmarks to extract facial information, the landmark localization is crucial for the entire system. As mentioned in Chapter 2, facial point detection methods such as [33, 9] cannot provide results with sufficient accuracy. While, methods which could find accurate position of facial landmarks such as [89] involve multiple classification steps, their high computational costs are not affordable by our system. Furthermore, most of the existing methods locate facial landmarks from images/video captured in a highly controlled laboratory environment and with high spatial resolution, the face regions are always larger than pixels. Such methods are obviously unable to serve a MOG well. Thus, a new method is proposed to locate the facial landmarks more robustly, especially when low resolution face images are presented to the system. By adopting coarseto-fine strategy, the proposed facial landmark localization method is also designed to

48 3.2. System Design 33 achieve a good compromise between computation load and accuracy. As a pre-stage of landmark localization, a key facial component detection module is added to the system. In this detection module, the Viola-Jones face detection method is modified to find the areas of mouth, nose and eyes. Still, the modification is focused on robust detection of the facial components against resolution and shape variations. In the final classification module, support vector machines (SVMs) [34] are employed to categorize different expressions based on extracted Gabor features. SVMs belong to the class of kernel-based supervised learning machines and have been successfully employed in general purpose pattern recognition tasks. Based on statistical learning theory, SVMs find the biggest margin to separate different classes. The kernel functions employed in SVMs are used to efficiently map input data which may not be linearly separable to a high dimensional feature space where linear methods can then be applied. Since there are often only subtle differences between different expressions posed by different people. The high discrimination ability of SVMs plays a major role in designing classifiers that can distinguish such expressions. SVMs also demonstrate relatively good performance when only a modest amount of training data is available, and this also makes SVMs suitable for the system under consideration. Furthermore, only inner products are involved in SVMs computation; the learning and prediction process are much faster than some traditional classifiers such as a multi-layer neural network. In the following Chapters, the proposed algorithms for each module of the system are described in detail. For face detection module, the original Viola-Jones face detection method is adopted and it is described in Appendix A.

49 Chapter 4 Key Facial Component Detection In order to take advantages of the low computation of Haar-like features and highly efficient cascade structure used in Viola-Jones Adaboost face detection method, AdaBoost detection principle is adopted to search key facial components (nose, mouth and eyes) within the detected face area. However, low detection rate was observed when the conventional Viola-Jones method was trained with the facial components and employed in the detection process. This is probably due to the lack of significant structure information of the facial components (compared to the entire face). In general, the structure of the facial components becomes less detectable when the detected face is of low resolution. Table 4.1 shows approximate size of facial components at different distances for a webcam with focal length of 3cm and resolution of Another cause of the low detection rate is probably the substantial variations in the shape of the facial components, especially mouth, among the different expressions conveyed by the same or different people. This is also true for high resolution face images. To solve these problems, we modify the AdaBoost detection method by employing: extended Haar-like features, modified training criteria, regional scanning, probabilistic selection of candidate sub-window and specialized classifiers. 4.1 Extended Haar-like Feature Set An extended feature set with 14 Haar-like features (Figure 4.2) based on [54] is used in the facial component detection. Besides the basic upright rectangle features employed 34

50 4.1. Extended Haar-like Feature Set 35 Table 4.1: The approximate relationship between distance of user to camera and facial component resolution Figure 4.1: The Haar-like feature set used in Viola-Jones face detection method [88] in face detection (Figure 4.1), 45 rotated rectangle features and center-surround features are added into the feature pool. The additional features are more representative for different shapes than the original feature set, thus they would improve the detection performance. The value of each kind of Haar-like feature is the difference between the sum of the pixel intensities in the white section and the sum of the intensities in the black section. In the Viola-Jones face detection method, upright rectangle features are computed very efficiently by integral image representation [88]. Following the insight of integral image, rotated features are also can be computed rapidly by means of Rotated Summed Area Table RSAT (x, y), which is defined as the sum of the pixels of a 45 rotated rectangle with the right most corner at (x, y) and extending till the boundaries of the image (see Figure 4.3): RSAT (x, y) = I (x, y ) (4.1) y y,y y x x It can be calculated with two passes over all pixels. The first pass from left to right Figure 4.2: The extended Haar-like feature set [54]

51 4.1. Extended Haar-like Feature Set 36 Figure 4.3: Rotated Summed Area Table (RSAT) Figure 4.4: Calculation scheme of the pixel sum of rotated rectangle and top to bottom determines RSAT (x, y) = RSAT (x 1, y 1)+RSAT (x 1, y)+i(x, y) RSAT (x 2, y 1) (4.2) with RSAT ( 1, y) = RSAT ( 2, y) = RSAT (x, 1) = 0, (4.3) whereas the second pass from the right to left and bottom to top calculates RSAT (x, y) = RSAT (x, y) + RSAT (x 1, y + 1) RSAT (x 2, y) (4.4) From this the pixel sum of any rotated rectangle r = (x, y, w, h, 45 ) can be determined by four table lookups (see Figure 4.4 and Figure 4.5): Sum (r) = RSAT (x+w, y+w)+rsat (x h, y+h) RSAT (x, y) RSAT (x+w h, y+w+h) (4.5)

52 4.2. High Hit Rate Cascade Training 37 Figure 4.5: Calculation scheme for rotated features 4.2 High Hit Rate Cascade Training In the conventional Viola-Jones detection method, each stage classifier in the cascade is trained to detect almost all positive (object of interest) training samples while rejecting a certain fraction of the negative (non-object) training samples. Thus, later stage classifier faces a more difficult task since it has to reject the same fraction of negative samples among training samples which passed through all the previous stages. To handle negative samples misclassified by previous stages, classifiers in the later stages are more complex and use more subtle features. In other words, adding a stage in the cascade reduces the false positives. However, at the same time, some positives will be missed since more specified features are used and thus, the hit rate reduces with more stages. The number of stages of the cascade is decided based on the overall false positive rate, which is determined by testing the current detector on the training data set. If the overall target false positive rate is not yet met, another stage is added to cascade. When facial components are small, the subtle information is missing and only major features are retained. They can pass through the first few stages of the trained cascade but will be rejected by more complex classifiers in the later stages of the cascade, if the cascade classifier is trained with low false positive rate.

53 4.3. Regional Scanning With a Fixed Classifier 38 Table 4.2 shows mouth detection results for different face resolutions by using cascade classifiers with different stages. The classifiers were trained using 1521 mouth images cropped from BioID [36] database as positive training samples, the only difference between each two classifiers is the number of cascade stages. As can be seen in the table, the classifiers with 11 and 16 stages fail to detect the mouth region. By using a classifier with fewer stages (e.g. 3 and 5) small scale mouths can be detected, however, false positive windows increase when detecting mouth in bigger scale face area. To ensure that small scale (low resolution) facial components could be detected, a minimum overall hit rate is set before training. For each stage in the training, the training goal is set to achieve a high hit rate and an acceptable false positive rate. The number of features used is then increased until the target hit rate and false positive rate are met for the stage. If the overall hit rate is still greater than the minimum value, another stage is added to the cascade to reduce the overall false positive rate. In this way, the trained detectors will detect the facial components at a guaranteed hit rate though some false positives will occur, which can be reduced or removed by the modifications introduced below. 4.3 Regional Scanning With a Fixed Classifier Rather than rescaling the classifier as proposed by Viola and Jones, to achieve multiscale searching, input face images are resized to a range of predicted sizes and a fixed classifier is used for facial component detection. Due to the structure of face, we predict the face size according to the size of facial component used for training. In this way, the computation of the whole image pyramid is avoided. Furthermore, if the facial component size is bigger than the training size, fewer false positives would be produced due to down sampling; when the facial component is smaller than the training sample, expansion of the input image make it still detectable (see Table 4.3 for examples, where the same mouth detector is used to find mouths on resized face images and corresponding original face images).

54 4.3. Regional Scanning With a Fixed Classifier 39 Stages Face size: Face size: Face size: Table 4.2: Mouth detection results for different face resolutions by using cascade classifiers with different stages

The top region of the face image is used for eye detection; the central region of the face area is used for nose detection; and mouth is searched in the lower region of the face.

55 4.4. Candidate sub-window selection 40 Original face Resized face Table 4.3: Mouth detection results on resized faces and corresponding original faces In addition, prior knowledge of the face structure is used to partition the region of scanning. The top region of the face image is used for eye detection; the central region of the face area is used for nose detection; and mouth is searched in the lower region of the face. By regional scanning, fewer area exists that can produce false positives. It also increases efficiency since fewer features need to be computed. 4.4 Candidate sub-window selection Though the schemes described above are employed to reduce the false positives, due to the high hit rate training, there are often still more than one candidate sub-windows detected by one facial component detector. To select the true sub-window which contains the facial component, it is assumed that the central position of the facial components among different persons follows a normal distribution. Thus, the probability that a [ ] T candidate component at k = x y is the true position can be calculated as: P (k) = ( 1 exp 1 ) 1/2 (2π) sσ 2 (k sm)t sσ 1 (k sm) (4.6)

56 4.5. Specialized classifiers 41 where the mean vector m and the covariance matrix Σ is estimated from normalized face image data set. s is the scale coefficient which can be computed as s = w d /w n, where w d is the width of detected face and w n is the width of normalized training faces. The candidate with maximum probability is selected as the true facial component. 4.5 Specialized classifiers Two cascade classifiers are trained for mouth to handle shape variations: one is trained to detect all kinds of closed mouths, and the other one is specialized for open mouth detection. During scanning, if the closed mouth detector failed to find a mouth, the open mouth detector is used. In addition, two eyes are treated as different objects, so a right eye classifier and a left eye classifier are trained separately.

Chapter 5 Facial Landmark Localization The facial landmark localization approach is proposed based on the key facial component detectors described in Chapter 4. Figure 5.

57 Chapter 5 Facial Landmark Localization The facial landmark localization approach is proposed based on the key facial component detectors described in Chapter 4. Figure 5.1 shows the whole localization process, which automatically extracts 20 facial landmarks from face image sequences. A coarseto-fine strategy is adopted in the approach to reduce the computation. The tracking stage at the end of the process allows the localization to perform more robustly. Details of the localization method are given in this chapter. Figure 5.1: The coarse-to-fine facial landmark localization process 5.1 Facial Landmark Detection As shown in Figure 5.1, the facial landmark detection process entails estimation, localization and refinement. First, approximate positions of 20 facial landmarks are estimated based on the boundary box of detected facial components (eyes, mouth, and 42

58 5.1. Facial Landmark Detection 43 Figure 5.2: Facial landmark estimation nose). The estimation scheme is presented in Figure 5.2. The accuracy of the estimations can be improved by using positive training samples that are cropped tightly around eyes, mouth and nose, when training the facial component detectors. The corresponding actual landmark is considered to be localized within a D D neighborhood of the estimated landmark, D is determined by the size of facial components (4 neighbourhoods of mouth landmarks are indicated on Figure 5.2 for examples). To find the accurate landmark positions, images are converted into grey scale, thus each image can be represented by an intensity function f(x, y). Within a D D neighborhood, the location with the highest variation of intensity function f(x, y) in both x and y directions is considered to be the position of a landmark. In essence the localization is implemented by finding the maximum eigenvalues of local structure matrix C within neighborhoods, where C = w G (r; σ) f x 2 f x f y f x f y f 2 y (5.1) and w G (r; σ) is the Gaussian filter for smoothing the matrix entries. A sub-pixel corner detector derived from [31] is applied to refine the detected landmark positions so as to achieve sub-pixel accuracy. The refinement method is based on

59 5.2. Facial Landmark Tracking 44 the observation (see Figure 5.3) that every vector from the sub-pixel accurate corner q (or a radial saddle point) to a point p located within a neighborhood of q is orthogonal to the image gradient at p and subject to image and measurement noise. Thus, the following equation can be defined: ε i = I T p ii (q p i ), (5.2) Where I pi is the image gradient at point p i in the neighbourhood of q. To find the best estimate for q, ε i needs to be minimized. Equation 5.2 defines a system of equations where ε i is set to zero: ( ) I pi Ip T i q i ( ) I pi Ip T i p i = 0, (5.3) where the gradients are summed up in the neighborhood of q. Calling the first gradient term G and the second gradient term b (b = Gp i ), equation 5.3 can be rewritten: i q = G 1 b. (5.4) Variable q defines a new neighborhood center, and the above process is iterated until q does not move more than a certain threshold. 5.2 Facial Landmark Tracking Sometimes due to out-of-plane rotations of the head, key facial components cannot be detected. And there are also some cases where the true facial landmarks are not located in the D D neighborhood of the estimated landmarks. With the goal of obtaining more accurate and smooth landmark positions, linear Kalman filters are employed to track landmarks detected from the above steps. The linear Kalman filter is a recursive procedure consisting of two stages: prediction and correction. During each iteration, the filter provides an optimal estimate of the current state using the current input measurement, and produces an estimate of the future state using the underlying state model. As we are interested in positional

60 5.2. Facial Landmark Tracking 45 Figure 5.3: Illustration of the corner refining method coordinates, the state vector is formulated as [ s = x y ẋ ẏ ẍ ÿ ] T and the measurement vector is formed as [ m = x y ] T where x, ẋ, ẍ are the landmark position, velocity and acceleration in x direction, and y, ẏ, ÿ are the landmark position, velocity and acceleration in y direction. Thus, according to Newton dynamics, we have: 1 0 t 0 t2 0 2 t t t 0 s k = t s k 1 + t 3 6 t 3 6 t 2 2 t 2 2 t t w k 1 (5.5) or s k = As k 1 + Gw k 1 (5.6)

61 5.2. Facial Landmark Tracking 46 and [ m k = ] s k + v k (5.7) or m k = Hs k + v k (5.8) where t is the sampling time interval, that is the reciprocal of frame rate. w k is the change rate of acceleration which is modeled as a white noise process. v k is the white noise of measurement. The random variables w k and v k are assumed to be independent of each other and have normal probability distributions: p(w) N (0, Q), (5.9) Then the prediction process is governed by the equations: p(v) N (0, R), (5.10) ŝ k = Aŝ k 1 (5.11) P k = AP k 1A T + Q (5.12) Where ŝ k is the prior state estimate at step k given knowledge of the process prior to step k; ŝ k is the posterior state estimate at step k given measurement m k ; given the definition of prior and posterior estimate errors as: e k s k ŝ k (5.13) and e k s k ŝ k (5.14) P k is the prior estimate error covariance: P k = E [ ] e k e T k (5.15) and P k is the posterior estimate error covariance: P k = E [ ] e k e T k (5.16)

62 5.2. Facial Landmark Tracking 47 And the correction process is governed by the equations: K k = P k HT ( HP k HT + R ) 1 ŝ k = ŝ k + K k ( mk Hŝ k ) (5.17) (5.18) P k = (1 K k H) P k (5.19) Kalman filters predict landmark positions in the next frame and correct the localization results in the current frame. The prediction makes the localization process more stable when previous processing stages failed or huge error occurred. At the same time, the correction enhances the accuracy.

63 Chapter 6 Facial Feature Extraction In the feature extraction module, Gabor filters are employed to extract and represent the expression information. The rationale for the choice of Gabor features has been discussed in Chapter 3. Rather than applying Gabor filters to the whole face area, Gabor filters with selected frequencies and orientations are applied to the facial landmarks that are detected by the landmark localization process described in the previous Chpater. After a brief introduction of Gabor filters, this chapter describes our facial feature extraction process. 6.1 Gabor Filters A Gabor kernel ψ j is a plane sinusoids restricted by a Gaussian envelope function [15]: ψ j (x) = k j 2 σ 2 2 e kj x 2 [ 2σ 2 e i kj,x e σ2 2 ] (6.1) The first term in the square brackets determines the oscillatory part of the kernel, and the second term compensates for the DC value of the kernel [46], where i is the complex operator; k j is the wave vector k j = k jx and k jy = k v cos (φ u ) k v sin (φ u ) k v = 2 v+2 2 π (6.2) φ u = µ π n (6.3) 48

64 6.1. Gabor Filters 49 Figure 6.1: Gabor filters with the orientation of 0, π 4, π 2 and 3π 4 Figure 6.2: Gabor filters with the frequency of 0.25, 0.5, 1 and 1.5 φ u specifies the orientation of the filter. This parameter rotates the filter about its center. The orientation of the filter dictates the angle of the edges or bars for which the filter will respond. Being defined as a fraction of π, where µ [0, n], φ u is a set of values from 0 to π. Values from π to 2π are redundant due to the symmetry of the filter. Figure 6.1 shows Gabor filters with φ u of 0, π, π and 3π v defines the frequency of the Gabor filter, or inverse wavelength of the cosine wave. Filters with low frequencies will respond to gradual changes in intensity in the image. Filters with high frequencies will respond to sharp edges and bars. Figure 6.2 shows the effect as the frequency is increased from 0.25 to 1.5. σ specifies the radius of the Gaussian. The size of the Gaussian is sometimes referred to as the filter s basis of support. The Gaussian size determines the amount of the image that effects convolution. In theory the entire image should effect the convolution; however, as the convolution moves further from the center of the Gaussian, the remaining computation becomes negligible. σ usually is set in a relation to the frequency. Figure 6.3 shows the Gabor filters with σ value of π, 2π, 3π and 4π.

65 6.2. Gabor Wavelets Based feature extraction 50 Figure 6.3: Gabor filters with σ value of π, 2π, 3π and 4π 6.2 Gabor Wavelets Based feature extraction By varying the orientation and frequency of Equation 6.1, means by varying k j, a family of Gabor kernels which build a so-called jet can be gained (a family of Gabor kernels with three different frequencies and six different orientations can be seen in Figure 6.4). Such a jet could be convoluted with a small patch of gray values in an image I (x) around a given pixel x = (x, y) J j (x) = I (x ) ψ j (x x ) dx (6.4) This is known as a wavelet transformation because the family of kernels is selfsimilar, as all kernels can be generated from one mother wavelet by dilation and rotation. Based on the Gabor wavelet transformation, an image can be represented by the responses of Gabor filters with different orientations and frequencies at every pixel. Figure 6.5 presents the results for convolving a face image with a family of Gabor filters shown in Figure 6.4. Although more information could be provided, applying Gabor wavelet transformation over the entire image region is too costly especially for our context. For a family of 40 Gabor filters (which are used for most face related feature extractions), for example, the combined Gabor responses over all image pixels consume 80 times as much memory as the single input image, if both the even (cosine) and odd (sine) parts of the response are used. At the same time, the total computational load for all the convolution operations is too heavy. Thus, in the proposed system, Gabor filters are used in a selective way for feature

6.2. Gabor Wavelets Based feature extraction 51 Real part imaginary part Figure 6.4: A family of Gabor kernels with three different frequencies and six different orientations extraction.

Then, each face image is convolved with a set of Gabor kernels only at the locations of the automatically localized 20 facial landmarks.

66 6.2. Gabor Wavelets Based feature extraction 51 Real part imaginary part Figure 6.4: A family of Gabor kernels with three different frequencies and six different orientations extraction. First, the detected face region of input images are normalized into gray scale images. Then, each face image is convolved with a set of Gabor kernels only at the locations of the automatically localized 20 facial landmarks. The kernel set comprise 3 frequencies and 6 different orientations, in specific, v = 2, 4, 6; n = 6; µ = 1,..., 6; σ = π. All the parameters are selected based on experiments, which are described in Chapter 8. Both even and odd Gabor kernels are used, therefore, 18 complex Gabor wavelet coefficients are gained at each facial landmark. In our system, only the magnitudes are used, because they vary slowly with the position while the phases are very sensitive. Upon that, each face image is finally represented by a vector of 360 (3 6 20) elements. This vector serves as an input to the expression classifier which will be described in the next chapter.

67 6.2. Gabor Wavelets Based feature extraction 52 Figure 6.5: A face image after convolution with Gabor filters shown in Figure 6.4

68 Chapter 7 Facial Expression Classification Support vector machines (SVMs) [34] are a machine learning system that is directly based on statistical learning theory. They have successfully been applied to a number of classification tasks such as speech recognition, text categorization and face detection, and tend to often outperform classic machine learning approaches like neural network. As discussed in Chapter 3, SVMs are employed in the proposed system as classifiers due to their high discrimination ability. An introduction to the mathematical principle of the support vector machine is given in this chapter. This is followed by its application to the proposed facial expression recognition system. 7.1 Support Vector Machines Linear Case SVMs are inherently binary classifiers. From the perspective of statistical learning theory, the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. Firstly, the error bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating two classes and the data points closest to the hyperplane. Secondly, the upper bound on the generalization error does not depend on the dimension of the space. First, we assume that the input data is linearly separable. Under this assumption, two classes can be separated by finding a linear hyperplane between them. Considering a binary classification task with input points x i (i = 1,... l ) having 53

69 7.1. Support Vector Machines 54 corresponding class labels y i = ±1 and let the decision function be: f (x) = sign (w x + b) (7.1) Where w x denotes the inner product and b denotes the bias of the decision function. A point x lying directly on the hyperplane satisfies the condition w x + b = 0 (7.2) The label y i of a data point x i can be determined by evaluating the left side of equation 7.2. The sign of the result tells us about the class-label. Points lying on the right or left side of the hyperplane must satisfy following conditions: x i w + b > 0 (7.3) x i w + b < 0 (7.4) Both equations can be implicitly formulated as: y i (x i w + b 0) (7.5) Classification is correct if Equation 7.5 holds for all input points. It is clear that we therefore have to optimize the parameter w and b. The number of possible combinations of weights w and bias b are large and may be not optimal. There is only one optimal separating hyperplane. An optimal hyperplane is one that maximizes the margin between two sets. The margin between two sets is given by the distances d + and d from the closet point of a set to the separating hyperplane. Figure 7.1 shows two separating hyperplanes, from which the right hyperplane is optimal. It is obvious that the margin (d + + d ) reaches its maximum for d + = d. Let us denote H is a hyperplane which satisfies x i w+b = 1 and H + is a hyperplane which satisfiesx i w + b = 1. Hence, with d = d + = 1, the margin becomes 2. Thus, the w w hyperplane that optimally separates the data is the one that minimizes w 2, subject to constraints in Equation 7.5.

70 7.1. Support Vector Machines 55 Figure 7.1: Separating hyperplanes. The left is a random one, right one maximizes the margin of separability We introduce Lagrangian multipliers α, i = 1,... l, one for each of the inequality constraints of Equation 7.5 and achieve the following Lagrangian, L P 1 2 w 2 l α i y i (x i w + b) + i=1 l α i (7.6) i=1 We must minimize L P with respect to w and b. This requires that the derivatives of L P with respect to all the α i vanish. Taking the derivatives with respect to b and w gives: l α i y i = 0 (7.7) i=1 l w = α i y i x i (7.8) i=1 And re-substituting them back into Equation 6 gives: L D = i α i 1 α i α j y i y j x i x j (7.9) 2 i,j Training is therefore accomplished by maximizing L D with respect to constraints 7.7 and 7.8. In the solution, each Lagrange multiplier α i is associated with a training point. Training points having α i > 0 are called support vectors, hence the name support vector machine. Support vectors lie closest to the separating hyperplane and

71 7.1. Support Vector Machines 56 are therefore critical elements in the dataset. Changing the support vectors gives a different hyperplane. All other points, with α i = 0 do not influence the shape of the decision boundary. Once we have found a solution to the optimization problem, the SVM can attempt to classify unseen instances. An instance x can be classified by determining the side of the decision boundary it falls, i.e. we compute: ( l ) f (x) = sign (w x + b) = sign α i y i (x i x) + b i=1 (7.10) Nonlinear Case We assumed the input data is to be linearly separable. However, real world data is generally not linear separable. To address this issue, SVMs separate non linear data using a trick. The trick is to extend linear SVMs to nonlinear SVMs by mapping the input data nonlinearly into a higher dimensional space called feature space. In sufficient high dimensions, the data becomes linearly separable. With this mapping in mind, the SVM can solve the optimization problem in the feature space as it would do in the input space and find an optimal separating hyperplane. Once the optimal hyperplane is found, it is mapped back into the input space resulting in a non-linear decision surface. For the objective function in Equation 7.9, we notice that the data points x i only appear in the form of an inner product. Thus, the mapping of features into higher dimensional space is achieved through a substitution of the inner product: x i x j φ (x i ) φ (x j ) (7.11) The mapping φ (x) and its dot product in Equation 7.11 must not be computed as it is computational expensive and storage intensive. Instead, the mapping is implicitly defined by a kernel: K (x i, x j ) = φ (x i ) φ (x j ) (7.12)

72 7.1. Support Vector Machines 57 Common kernels used for SVMs are: K (x i, x j ) = (x i x j + 1) d (7.13) K (x i, x j ) = e x j 2 i x 2σ 2 (7.14) K (x i, x j ) = tanh (βx i x j + b) (7.15) Equation 7.13 results in a polynomial classifier with degree d. Equation 7.14 gives Gaussian radial basis function classifiers. Equation 7.15 emulates two layer sigmoid neural networks. The learning task therefore involves maximization of the objective function: L D = i α i 1 α i α j y i y j K (x i, y i ) (7.16) 2 i,j subject to constraints 7.7 and 7.8. Classification remains the same, except the inner product becomes a kernel evaluation: ( l ) f (x) = sign (w x + b) = sign α i y i K (x i, x) + b i=1 (7.17) Non Separable Data Most real world data sets contain noise and cannot be separated by an optimal hyperplane without leading to poor generalization. If there are input points (x i, y i ) within the margin of separation, optimization of Equation 7.5 is not possible. These input points are points which either fall on the correct side of the decision surface, but within the region of separation, or fall on the wrong side of the decision surface. To take violating points into account, the definition of the optimal hyperplane has to be extended. Therefore, we introduce a slack variable ξ for each training sample and require that y i (w x i + b) 1 ξ i (7.18) be satisfied. This extension allows the old constraint 7.5 to be violated, but in a way the violation is penalized. The new optimization problem thus becomes maximization of the margin between two sets and the minimization of misclassifications caused by

73 7.2. Multiple Decision Making 58 points lying in the range of 0 ξ. In training, however, one wants to regularize those two aspects of optimization according to the type of problem one wants to solve. Regularization is done by weighting the error of misclassifications with a value C. A high C value causes a high penalty assigned to classification errors leading to better separation. A low C value causes the margin to be soft with lots of permitted errors and causes the separation to be fuzzy. The C value has to be found empirically in training because it is not know in advance what the training data looks like and how general or specific separation must be done. 7.2 Multiple Decision Making Recognition of seven expressions leads to a multi-class classification problem. However, SVMs are inherently binary classifiers. To construct a multi-class SVM, 21 SVMs are trained to discriminate all pairs of expressions. Then, when making multi-class decisions, SVM outputs are combined by voting. For example, if one SVM makes the decision that the input is happiness not sadness, then the class happiness gets +1 and sadness gets -1. The SVMs make decisions on each pair, and thus cast votes for each category. The votes are summed together and the expression with the highest score is considered to be the final decision.

74 Chapter 8 Experimental Results 8.1 Database A large number of training and testing images/video sequences are required as the basis for implementing and evaluating the proposed system. In this work the following database are used: The Japanese Female Facial Expression (JAFFE) Database [38]: The database contains 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects. Sample images are shown in Figure 8.1. The AR Face Database [35]: The database contains over 4,000 color images corresponding to 126 people s faces (70 men and 56 women). Images feature frontal view faces with different facial expressions, illumination conditions, and Figure 8.1: Sample images from JAFFE database 59

75 8.1. Database 60 Figure 8.2: Sample images from AR Face database Figure 8.3: Sample images from AT&T Face database occlusions (sun glasses and scarf). The pictures were taken at the Computer Vision Centre of University of Alabama at Birmingham under strictly controlled conditions. No restrictions on wear (clothes, glasses, etc.), make-up, hair style, etc. were imposed to participants. Sample images are shown in Figure 8.2. The AT&T Face Database [37]: The database contains face images of 40 distinct subjects. For each subject, ten different images are provided. The images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). Sample images are shown in Figure 8.3. The BioID Face Database [36]: The dataset consists of 1521 gray level images with a resolution of 384x286 pixels. Each one shows the frontal face of one of the 23 subjects. All the images are taken in uncontrolled conditions using a web camera in an office environment. For each image in the database, the ground truth of the 20 facial feature points were obtained through manual annotation

76 8.1. Database 61 Figure 8.4: Sample images from BioID Face database Figure 8.5: Sample images from BioID Face database and are supplied with the database. Sample images are shown in Figure 8.4 and the facial feature points are presented in Figure 8.5. The CIT Face Database [41]: The database contains 450 face images with the resolution of 896 x 592 pixels. The images were taken from 27 people, varying the lighting, facial expressions and backgrounds. Sample images are shown in Figure 8.6. The FG-NET Database Facial Expression Database [39]: The database contains 399 video sequences of 6 expressions and one neutral expression from 18 different individuals. Each individual performed all seven expressions three times. Depending on the kind of expression, a single recorded sequence can take up to several seconds. Sample images are shown in Figure 8.7. Figure 8.6: Sample images from CIT Face database

77 8.2. Face detection 62 Figure 8.7: Sample images from FG-NET Face database Table 8.1: The performances of face detector on different databases Database JAFFE CIT BioID FG-NET Detection Rate 100% 99% 97% 100% 8.2 Face detection The face detector is implemented based on training data included in the OpenCV library [51], and performed reliably during testings. Table 8.1 shows the testing results for different databases, some typical samples can be seen in Figure Parameter Tuning Gabor Kernel Recall the Gabor Kernel function introduced in Chapter 6: ψ j (x) = k j 2 σ 2 2 e kj x 2 [ 2σ 2 e i kj,x e σ2 2 ] where and k jy k j = k jx = k v = 2 v+2 2 π k v cos (φ u ) k v sin (φ u ) φ u = µ π n

78 8.3. Parameter Tuning 63 Figure 8.8: Face detection results from different databases Three parameters need to be specified when configuring the Gabor filter set: the orientation φ u, the frequency factor v, and the radius of the Gaussian σ. Since we only apply Gabor filters to facial landmarks, the landmark location is also an important parameter for the feature extraction. All these parameters decide the discriminative power and the dimensionality of extracted feature vectors. Selection of the parameters depends on the specific application, and there is yet no consensus on the best choice. Most of the successful Gabor feature based face recognition and facial expression recognition systems use Gabor filter set with 5 frequencies and 8 orientations, and set σ fixed to π or 2π [92, 17, 5, 84]. Following these works, in the proposed system, σ is set to π for the normalized face images. For φ u and v, although sampling is limited at facial landmarks, using a filter set with 5 frequencies and 8 orientations is still too costly for the limited system resources. Thus, experiments were carried out to select the most useful set of Gabor filters and facial landmarks for the system. JAFFE database were used for the experiments. In the database, every facial expression of each individual has three samples, indexed by 1, 2 and 3. Thus the database can be divided into three groups. Due to the limited number of training and

79 8.3. Parameter Tuning 64 test samples available, a leave-one-group-out strategy was adopted. Three iterations of training and testing were carried out and during each iteration, images from one group were set aside for testing purpose while images from the remaining two groups were used for training. The simple instance-based KNN classifier [3] was employed to evaluate the performance of different parameters, and the value of k was set to 3. During the test, all the facial landmarks were localized manually. First we focus on the effect of different facial landmarks. Figure 8.9 shows 34 facial landmark points which represent the facial geometry. When applying Gabor filter set with 5 frequencies and 8 orientations (v = 0, 2, 4, 6, 8; n = 8; µ = 1,..., 8) to all the 34 landmark locations, the average recognition rate was 78%. If one facial landmark is important for expression recognition, removing it from the 34 landmarks will cause a remarkable change to the final recognition rate. Figure 8.10 shows the experiment results. As can be seen from the curve, removing the landmark of 1,2,3,12,17,27,28,29,30,31 caused no change or very little change, when the landmark of 5,15,19,20,21,22,32,33,34 was removed, there was some degradation in the final recognition rate. Considering that landmark 21,32,33,34 are hard to be automatically localized, these four landmarks together with landmark 1,2,3,12,17,27,28,29,30 are discarded, and the rest 20 facial landmarks are selected. When using the selected landmarks as Gabor kernel locations, it was surprisingly found that the recognition rate was slightly improved to 80%. This may be due to the dimension reduction. In [48], it has been claimed that too many input variables with respect to the training sample size can harm the performance of a sample discriminant rule. The test results also indicate that the information of facial expressions are mainly conveyed by mouth, eyes and eyebrows. After the selection of facial landmarks, we try to find the most useful Gabor kernel frequencies for our task. While keeping the number of orientations (n = 8) and kernel locations (selected 20 facial landmarks) unchanged, all possible combinations from the frequency set {v = 0, 2, 4, 6, 8} were tested. Other frequencies were not considered, since the tested set has already covered a wide frequency band. We code the combinations as showed in Table Figure 8.11 presents the test results. According to

80 8.3. Parameter Tuning 65 Figure 8.9: 34 facial landmark points which represent the facial geometry Figure 8.10: The recognition performance when one of the 34 landmarks is removed

81 8.3. Parameter Tuning 66 Code v = Code v = Code v = ,6 21 0,6, ,8 22 2,4, ,6 23 2,4, ,8 24 2,6, ,8 25 4,6,8 6 0,2 16 0,2,4 26 0,2,4,6 7 0,4 17 0,2,6 27 0,2,4,8 8 0,6 18 0,2,8 28 0,2,6,8 9 0,8 19 0,4,6 39 0,4,6,8 10 2,4 20 0,4,8 30 2,4,6,8 Table 8.2: Coding scheme for combination of frequencies the results, the combination of 22 (v = 2, 4, 6) and 30 (v = 2, 4, 6, 8) achieved the best performance. The frequency defined by v = 6 is more significant to the expression recognition than other ones, and the Gabor filter with the highest frequency (v = 0) extracts useless information for our task. Without the filters at this high frequency, the recognition rate was improved to 82%. For the highest performance with less number of frequencies, v = 2, 4, 6 is selected as optimal parameters. For φ u, the value of µ is always set to 1,..., n, the only parameter to be determined is n. Thus in the test, Gabor filters with different number of orientations and 3 frequencies (v = 2, 4, 6) were applied to the selected 20 facial landmark locations, the results are shown in Figure It is noticed that, reducing the number of orientations from 8 to 6 only caused 1% degradation in recognition rate. That means 6 different orientations could be enough for our task. Thus, though best recognition rate was achieved by using 8 and 7 different orientations, to reduce the dimension and computational cost, n = 6 is selected as the parameter of orientations SVM Kernel The selection of an appropriate kernel function is the most important part on employing an SVM classifier to a particular application domain. Four commonly used kernels were considered for our task: Gaussian radial basis function (RBF) kernel (Equation 7.14)

82 8.3. Parameter Tuning 67 Figure 8.11: The recognition rate for different frequencies Figure 8.12: The recognition rate for different number of orientations

83 8.3. Parameter Tuning 68 Table 8.3: The best recognition rates achieved by using different SVM kernels Anger Surprise Sadness Disgust Fear Happiness Neutral Overall Linear 78% 95% 82% 84% 87% 92% 88% 87% RBF 82% 94% 81% 83% 87% 95% 91% 88% (σ = 5) Sigmoid 66% 72% 55% 61% 63% 70% 68% 65% (β = 0.2, b = 0) Polynomial (d = 1) 79% 89% 82% 80% 75% 92% 87% 83% polynomial kernel (Equation 7.13) sigmoid (Equation 7.15) and linear kernel K (x i, x j ) = x i x j (8.1) To select the most suitable kernel, different multi-class SVMs were trained as described in Chapter 7 to classify the extracted Gabor feature vectors. The same as Section 8.3.1, JAFFE database was used for training and testing, and all facial landmarks were localized manually. Only selected parameters were considered for each kernel, the choice of possible parameters was based on [2]. Table 8.3 presents the best performances achieved by using each kernel. The highest recognition rate was achieved by the RBF kernel with a width parameter (σ) value of 5. Comparing with the performance of simple KNN classifier, the recognition rate was improved from 81% to 88%. Surprisingly, the simple linear kernel almost achieved the same recognition rate as the RBF kernel. This may be due to the high dimensionality of Gabor feature vectors. When the number of features is large, it may be not needed to map data nonlinearly to a higher dimensional space. At least for our task, linear mapping is good enough, thus, linear kernel is selected for the SVMs. Although different parameters have been tried, the sigmoid kernel failed to achieve a recognition rate higher than 65%. The poor testing results of sigmoid kernel show that the choice of kernels plays a critical role in the employment of SVMs.

84 8.4. Key Facial Component Detection Key Facial Component Detection Five cascade classifiers were trained for the key facial component detection module, one each, for left eye, right eye and nose, and two for mouth. Positive training samples of eyes, mouths, noses and negative samples (non-facial components) were cropped from AR database and AT&T database. To accommodate low resolution facial components, the training samples were rescaled to small sizes: 10 6 for eyes, 16 8 for mouth and for nose. For each detector about 1200 positive samples and 5000 negative samples were used for training. The facial component detection module was first tested on BioID database. To evaluate the performance on low resolution input, the test images were downsized to different resolutions to simulate low resolution faces which are not included in the databases. In this way, 300 images were tested at each face resolution. In the testing phase, a detection was regarded as SUCCESS if and only if the distance between the center of the detected and actual facial component was less than 30% of the width of the actual facial component as well as the width of the detected facial component was within ±50% of the actual width. The original Viola-Jones AdaBoost detection method was also implemented and tested based on the same training and testing data for comparison. The testing results are presented in Figure As can be seen from the curves, our improved detection method achieved better performance, especially for low resolution faces. The detection rates of Viola-Jones method dropped dramatically when the face resolution was lower than , and failed to detect any facial components correctly for face images with resolution lower than In contrast, the improved method was able to detect facial components from low resolution face images. Furthermore, even for high resolution face images, our detection method outperformed the original Viola-Jones method, especially for the mouth detection. The improved method achieved an average detection rate of 95% for all the facial components. A few detection examples are shown in Figure The detectors were also tested on FG-NET database. The overall detection rate for all facial components is 94%. Figure

85 8.5. Facial Landmark Localization shows typical detection examples of FG-NET. Figure 8.16 shows snap shots from a real-time example. 8.5 Facial Landmark Localization The facial landmark localization module was tested on BioID database. In the testing phase, images from the same individual were reordered and treated as an image sequence. A detection is regarded as SUCCESS when the distance between the located landmark point and the annotated true point was less than 10% of the inter-ocular distance (distance between left and right eye pupils). Unfortunately, only 14 of the facial landmarks we detected are annotated in the BioID database. The testing results are presented in Table 8.4; the average detection rate for all of the 14 landmarks is 93%. The test results on different resolution faces are shown in Figure 8.17, and typical examples are presented in Figure When testing the localization module on FG- NET database, each of the 20 automatically detected facial landmarks was compared to manually labeled facial point. The average detection rate for all of the landmarks is 91%, and some examples are shown in Figure During the real-time test, the proposed facial landmark localization method exhibited robust performance against variation in face resolutions and facial expressions. The tracking process also enables the proposed method to handle some degree of in-plane and out-of-plane rotations. Figure 8.20 shows a few examples. 8.6 System Performance Accuracy To evaluate the system performance on providing user-independent operation, classifiers of the system were trained using the JAFFE database and the system was tested on the FG-NET database. Not all the sequences from the FG-NET database were used for testing, samples that failed to present an expression (e.g. interrupted by talking,

86 8.6. System Performance 71 Original AdaBoost Improved AdaBoost Figure 8.13: Facial component detection results

8.6. System Performance 72 Figure 8.14: Facial component detection results from BioID database Figure 8.

4: The performance of facial landmark localization module on BioID database (see Figure 8.

end of right eye brow 92% 6: Inner end of right eye brow 94% 7: Inner end of left eye brow 97% 9: Outer end of left eye brow 91% 10: Outer

87 8.6. System Performance 72 Figure 8.14: Facial component detection results from BioID database Figure 8.15: Facial component detection results from FG-NET database Figure 8.16: Real-time facial component detection results Table 8.4: The performance of facial landmark localization module on BioID database (see Figure 8.9 as a reference for landmark locations) Facial landmark Rate Facial landmark Rate 23: Right mouth corner 96% 25: Left mouth corner 91% 4: Outer end of right eye brow 92% 6: Inner end of right eye brow 94% 7: Inner end of left eye brow 97% 9: Outer end of left eye brow 91% 10: Outer corner of right eye 90% 14: Inner corner of right eye 96% 16: Inner corner of left eye 97% 19: Outer corner of left eye 88% 20: Right nostril 95% 22: Left nostril 94% 24: Center point on outer edge of upper lip 87% 26: Center point on outer edge of lower lip 85%

88 8.6. System Performance 73 Figure 8.17: Average Facial landmark detection rates for different face resolutions Figure 8.18: Facial landmark localization results from BioID database Figure 8.19: Facial landmark localization results from FG-NET database

5: Recognition results for 7 expressions laughing etc.) were excluded. The recognition results are presented in Table 8.5. The results show that Happiness, Surprise and Neutral were detected with relatively high accuracy while other more subtle emotions were harder to be recognized, especially the expression, sadness.

89 8.6. System Performance 74 Figure 8.20: Real-time Facial landmark localization results Emotion Recognition rate Happiness 85% Sadness 52% Fear 74% Disgust 63% Surprise 82% Anger 69% Neutral 80% Table 8.5: Recognition results for 7 expressions laughing etc.) were excluded. The recognition results are presented in Table 8.5. The results show that Happiness, Surprise and Neutral were detected with relatively high accuracy while other more subtle emotions were harder to be recognized, especially the expression, sadness. The low recognition rate is thought to be mainly due to people conveying their emotions differently, and for more subtle expressions, the variation is wide. Some failure samples with corresponding expressions in the training database are presented in Table 8.7. During testing, we found that sadness, anger, fear and disgust were frequently confused with each other, even human beings sometimes have difficulty in discriminating them. However, these expressions were seldom confused with other ones. Thus, if these four expressions are treated as one, together with happiness, surprise and neutral, we can estimate user s emotional state more accurately. Naming the new expression as unhappy, classification results for 4 expressions, i.e. happy, unhappy, surprise and neutral are presented in Table 8.6. In this way, the system is able to determine user s mood at an accuracy of 83%.

90 8.6. System Performance 75 Emotion Recognition rate Happy 85% Unhappy 86% Surprise 82% Neutral 80% Table 8.6: Recognition results for 4 expressions Figure 8.21: Recognition rates at different distances

91 8.6. System Performance 76 Table 8.7: Failure samples with corresponding expressions in training database Tests were also conducted on the system in practical conditions. During the test, user s expressions were classified into one of the four expressions (happy, unhappy, surprise and neutral), and a web camera with 3cm focal length was used. The recognition results at different distances between the user and the camera are presented in Figure 8.21, and typical samples are shown in Figure The results show that the system works well within a practical range of distances, and is robust against variation in lighting and background.

92 8.6. System Performance 77 Figure 8.22: Recognition results for real-time test

Emotional States Control for On-line Game Avatars

Emotional States Control for On-line Game Avatars Ce Zhan, Wanqing Li, Farzad Safaei, and Philip Ogunbona University of Wollongong Wollongong, NSW 2522, Australia {cz847, wanqing, farzad, philipo}@uow.edu.au