FACIAL EXPRESSION RECOGNITION AND EXPRESSION INTENSITY ESTIMATION

Size: px

Start display at page:

Download "FACIAL EXPRESSION RECOGNITION AND EXPRESSION INTENSITY ESTIMATION"

Mitchell Stanley Harris
6 years ago
Views:

1 FACIAL EXPRESSION RECOGNITION AND EXPRESSION INTENSITY ESTIMATION BY PENG YANG A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Professor Dimitris N. Metaxas and approved by New Brunswick, New Jersey May, 0

2 ABSTRACT OF THE DISSERTATION Facial Expression Recognition and Expression Intensity Estimation by Peng Yang Dissertation Director: Professor Dimitris N. Metaxas Seventy years ago, psychologist categorized the facial expression into seven categories: angry, disgust, fear, happiness, sadness, surprise and neutral. Through analyzing the expression, psychologists want to predict the emotions behind the expression. Due to all kinds of potential applications on human emotion analysis, automatical analysis of human affective expressions has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. Researchers have done lots of works on this topic in the past thirty years, and proposed lots of promising approaches. Although many works have been done on this topic, these existing methods typically handle deliberately displayed and exaggerated expression of prototypical emotions. There are still some hard problems not solved well for the real system to handle naturally occurring emotions such as exploring discriminative features, time wrapping, and expression intensity estimation. Our work focuses on these real problems, analyzes the challenges in the real system and proposes the sounded solutions for advancing human affect sensing technology. ii

3 Acknowledgements First and foremost, I would like to thank my advisor, Professor Dimitris Metaxas, for his guidance, support and encouragement throughout my Ph.D. years. He has always given me the right direction toward the exciting reseach areas in our field, also allowed me to pursue independent work with the great freedom. Without his support, I could not finish the work in this thesis. I want to thank Qingshan Liu, who has been working closely with me for more than four years. He contributed numerous ideas and insights to the work I have finished. Without the brainstorm through discussing with him, many works in my thesis would not happen. I also thank my thesis committee members, Professor Ahmed Elgammal, Professor Vladimir Pavlovic, and Professor Yingli Tian. I thank my qualify committee members Professor Ahmed Elgammal, Professor Vladimir Pavlovic, and Professor Liviu Iftode for their valuable suggestions regarding my early proposal and this thesis. Last and not the lest, I want to thank all my colleagues from the Center for Computational Biomedicine Imaging and Modeling (CBIM), and the Computer Science Department. Without their help, I could not finish the works in my thesis. iii

4 Dedication To my parents: Yiwen Yang and Jiansheng yang. Without their love and support, I can not achieve anything. To my wife and daughter: Hui Li and Isabella Yang. Without Hui s support, I can not focus on finishing my thesis. My daughter always sang twinkle, twinkle, little star when I was working on my thesis, and it provided me more energy to complete this thesis. iv

5 Table of Contents Abstract Acknowledgements Dedication ii iii iv List of Tables ix List of Figures x. Introduction Motivations The Current Approaches Vision based Methods Challenges in Vision Based Methods Dynamic Feature Extraction Time Resolution Problem Expression Intensity Estimation Mis-Alignment Problem Low Level Expression Recognition Our Approach and Contributions Encoded Dynamic Features Extension on Encoded Dynamic Features Ranking Model for Intensity Estimation Multiple Instance Feature Compositional Features Related work Feature extraction and Representation Appearance Features Gabor Wavelets v

6 Haar-like Features Local Binary Pattern Features Geometric Features Facial Expression Classification Template Based Methods Neural Network Based Methods Statistical Model Based Methods Support Vector Machines Based Methods Boosting Based Methods Facial Expression Databases Summary Encoded Dynamic Features Motivation Dynamic Feature Representation Coding the Dynamic Features Boosting the Coded Dynamic Features Experiment Analysis on the Characteristics of the proposed method Evaluation on the Dynamic Haar-like and the Dynamic Gabor Features Analysis of the Window Size L Analysis of the Threshold T Experiment on Automatically Cropped Data Set Experiment on Facial AUs Summary Encoded Dynamic Patterns on Time Resolution Problem Motivation Our Contribution Haar-like Facial Appearance Representation Clustering Intrinsic Temporal Patterns Dynamic Binary Pattern Mapping Boosting Classifier for Expression Recognition vi

7 4.. Experiments Comparison to D Haar-like Features Robustness Analysis Summary Facial Expression Intensity Estimation Motivation The Problems Related Works Our Contribution Data Organization Ranking Model RankBoost RankBoost with l Regularization Recognition by the Ranking Model Experiment Training Error Analysis Testing Error Analysis Summary Multiple Instance Features Motivation Our Contribution Face Appearance Representation Learn Multiple Instance Features Learning Facial Expression Classifier Experiment Evaluation on the Cohn-Kanade Database Evaluation on the MMI Database Summary Exploring Facial Expressions with Compositional Features Motivation Our Contribution vii

8 7.. Local Patches and Feature Description Building Compositional Features Boosting Learning Experiment Results on Apex Data Results on Extended Data Conclusion Summary Major Contribution Limitations Future Works References viii

9 List of Tables.. The Area under the ROC curves (Based on Haar-like Feature and Gabor Feature) The area under the ROC curves with different window size based on Haar-like feature The area under the ROC curves with different window size based on Gabor feature The Area under the ROC curves on different T The Area under the ROC curves (Expression) The Area under the ROC curves (AUs) The Area under the ROC curves (D haar-like feature and DBP) The Area under the ROC curves (Different sampling strategies) The Area under the ROC curves (Training on 7(xxxxxxx) frames) The Area under the ROC curves (Training on 7(xxxxxxx) frames) The Area under the ROC curves (Training on randomly selected frames) Performances on the training set Performances on the testing set The influence of face misalignment Recognition rates (%) on different misalignment level sets with SVM and Adaboost on Cohn-Kanade database Recognition rates (%) on different misalignment level sets with SVM and Adaboost on MMI database Performances on the testing set The confusion matrix of our proposed method based on the Apex Data Performances on the extended testing set The confusion matrix of our proposed method based on the extended Data ix

10 List of Figures.. (Left)Gabor Kernel, and its real part and complex part.(right) Gabor features extracted from face image The basic haar-like feature template Haar-like features extracted from different position and different scale on the face image..4. The LBP operator LBP feature vectors extracted from different region on the face image Demonstration Landmarks on the face Landmarks on the profile face, and tracking is used to detect AUs Grids on the face with different expressions The structure of the proposed framework The flowchart of dynamic features extraction Example of Haar-like features superimposed onto a face image Example of one dynamic feature unit u i,j in an image sequence Example of coding one dynamic feature unit u i,j The procedure of the expression recognition Examples of six basic expressions.(anger, Disgust, Fear, Happiness, Sadness and Surprise) 4.8. Different facial motion speed of different subjects ROC curves of expression based on coded dynamic features and static features Example of the AUs ROC curves of AUs based on coded dynamic features and static features Some examples of smile events from different subjects and different cameras The structure of our approach The process of extracting dynamic binary pattern ROC curves of six expressions in table ROC curves of six expressions in table The mean and variance of results in table 4.4, 4.5 and The example of AU5 on 5 levels (A, B, C, D, E) x

11 5.. Intensity over time for AU5.(From []) The intensity changing on one sequence, the output of SVM, and the output of the proposed ranking model The framework of our method The organization of training data on the ranking of intensity of happy expression. (other expressions, the correlative intensity to happy is inverse to its original intensity. The red rectangle has higher intensity than the green rectangle in each pair.) Error rates on the training data.(anger, Disgust, Fear, Happiness, Sadness and Surprise) Error rates on the testing data.(anger, Disgust, Fear, Happiness, Sadness and Surprise) Mis-classified examples The impaction of the mis-alignment on the normalized face Multi-Scale multiple instance feature. The blue rectangle means the patch we used which is looked as the bag, and the red rectangles inside are the corresponding instances The distribution of AUs on the face is shown in first row, and some superimposed subwindows on the face image. (a)upper face action units;(b)lower face action units(up/down Actions);(c)lower face action units(horizontal actions);(d)lower face action units( oblique actions);(e)lower face action units(orbital actions). Bottom:(A)The row of superimposed sub-windows along the eyes area;(b)one row of superimposed sub-windows at face bottom The sample of a compositional feature The example of six expressions: Anger, Disgust, Fear, Happiness, Sadness and Surprise. (A)Samples of the apex data which come from the last frame of the sequence; (B)Samples of the extended data which are at low intensity level Training error VS. Number of features on the apex data Training error VS. Number of features on the extended data The distribution of the selected features on the Apex Data. Top row: the top 5 features selected in Haar+boosting; Middle row: the top 5 features selected in Haar(constrained)+boosting; Bottom row: the first compositional features in our method The distribution of the selected features on the extended Data. Top row: the top 5 features selected in Haar+boosting; Middle row: the top 5 features selected in Haar(constrained)+boosting; Bottom row: the first compositional features in our method xi

12 Chapter Introduction. Motivations The future computing environments will be human-centered designs instead of computer centered designs [] [][4][5]. The changing in the users affective state is one of the fundamental components of human communication. Some affective states motivate human actions and others enrich meaning of human communication. The traditional HCI is based on user s intentional input and ignores the users natural affective states, therefore a large portion of the useful information available in the interaction process is lost. In the future, the human-centered interfaces must have the ability to detect subtleties of changes of the user s behavior, especially his or her affective status, and to initiate interactions based on this information, rather than simply responding to the user s input [6]. Affects have been described by psychologists in terms of discrete categories. The most popular example of this description is the prototypical (basic) emotion categories. Izard presented to categorize facial expression into six basic expressions: happiness, sadness, disgust, surprise, anger, and fear. These six basic emotions are the universal expressions across different races, different cultures, and different areas. In [7], Ekman and Friesen designed a comprehensive standards to decompose each expression into several special action units (AUs), i.e., Facial Action Coding System. These two works can be regarded as the pioneer works of facial expression analysis. Some works also investigated other deliberately displayed facial expressions such as fatigue[8][40], pain[9], and mental states such as agreeing, disagreeing, lie, frustration, thinking and et.al [0][]. In the context of the face and nonverbal communication, expression implies a revelation about the characteristics of a person, a message about something internal to the expresser. Therefore, facial expressions are an important channel of nonverbal communication. Even though the human species has acquired the powerful capabilities of a verbal language, the role of facial expressions in person-to-person interactions remains substantial. Therefore, automatic facial expression recognition has attracted much attention in recent years due to its potential applications, such as social security, entrainment, medical cares and human computer interface. Especially because of the theoretical interest of cognitive scientists, automatic human affect analysis has attracted the interest of many researchers. In affect-related research (e.g. in psychology, psychiatry, behavioral and neuroscience),

13 with the assistant of automatic affect analysis system, scientist could mine the meaningful information from the large scale data and speed up the manual task of precessing data on human affective behavior [].. The Current Approaches Currently, there exist two main approaches to analyze human affect: vision based method and audio based method. In recent years, more researchers realize that integrating the information from audio and visual channels leads to an improved recognition of affective behavior occurring in real-world settings. As a result, an increased n umber of studies on audiovisual human affect recognition have emerged in recent years (e.g., []). However, for the limitation on lots of application, vision based methods are still the hottest research area in human affection... Vision based Methods For vision based methods, the input signal is image based. Feature extraction is the first step to explore the content in the images. Usually, there are two ways to extract facial features: the geometric features and the appearance features. Normally geometric features come from the shapes of the facial components and the location of facial salient points (corners of the eyes, mouth, etc.). For the appearance features, the facial texture is built by wrinkles, bulges, and furrows. Typical examples of geometric-feature-based methods are those of Chang et al. [4], Pantic and her colleagues [4][5][6]. The key landmark points around nose, mouth, and eyes on the face are used to build geometric features. For the appearance-feature-based methods, Bartlett et al. [7][8] and Guo and Dyer[9] used Gabor wavelets or eigenfaces. Anderson and McOwen [0] used a holistic spatial ratio face template. Chang et al. [4] built a probabilistic recognition algorithm based on the manifold subspace of aligned face appearances. Chaifei[] used local binary patterns as the feature descriptors. Some researchers also tried to combine both geometric and appearance features together. Lucey et al. [] proposed to use Active Appearance Model (AAM) to capture the characteristics of the facial appearance and the shape of facial expressions.. Challenges in Vision Based Methods Although some methods archive successful results in facial expression analysis. However, there are still lots problems within the practical system, such as time resolution, intensity estimation, low level expression recognition, mis-alignment problem, illumination, and face pose variation.

14 .. Dynamic Feature Extraction Expression usually implies a change of a visual pattern over time, but as a static painting can express a mood or capture a sentiment, so the face can also express relatively static characteristics. Therefore, the previous facial recognition works can be categorized into two classes: image based methods and video based methods. Image based methods assumed that facial expressions are static and recognition is conducted on single image. Actually, a natural facial event evolves over time from the onset, the apex, and the offset, so does the facial expression. Therefore, image based expression recognition could not achieve a good performance in practical systems. The video-based methods are demonstrated to be a good option, and they aim to integrate of temporal information of facial appearance. Although much progress has been made by many researchers, achieving high accuracy is still a challenge due to the complicated variation of facial dynamics. How to extract and represent the dynamic features is a key issue for video based method... Time Resolution Problem There are two key issues in the video based facial expression recognition in practice. One issue is how to represent the dynamics of the facial expression for recognition. Another one is how to segment the expression events from input video. The first one is about exploring discriminative features, and the second one focuses on low level expression recognition. Because the subtle changes on the expression is hard to be detected for the event segmentation, normally the sliding window technology is still the most popular method to extract subsequences to do sequences based recognition. Although sequence based model such as HMM has been applied to video based expression recognition, the sequence based model is hard to be trained when feature dimension is high. Therefore, dimension deduction technologies such as PCA are always used as the preprocessing. Recently, volume-based features attract much attention in capturing the dynamics of actions including facial events, in which the image sequence is modeled as a volumetric data. They have the advantage that couples temporal dynamics with the spacial appearance tightly. For example, in [], the volume LBP features were proposed for facial expression, and achieved much success. However, the volumetric features have a prerequisite that the training and the testing data must have the same length and the same speed rate, i.e., the same time resolution. However, it is hard to satisfy this prerequisite in practical systems. For example, different cameras have different capture speed rate, and they produce the videos at different time resolutions. Even when using the same camera, different subjects and different environments also have an effect on the time resolution of facial expressions. Thus, a time wrapping strategy should be performed before volume feature extraction can be used for practical applications, but it is

15 4 inevitable that the recognition performance will be influenced by the time wrapping operation... Expression Intensity Estimation The basic goal of facial expression analysis aims to automatically identify an input facial image or sequence as one of six basic expressions, and some studies have obtained good performances in special cases. However, simply classifying expression into such six basic categories is insufficient to further understand human emotion. Recently, psychological researches have demonstrated that besides the categories of expression, facial expression dynamics is important when attempting to decipher its meaning[4]. Briefly speaking, expression dynamics can be represented by the expression intensity variation in temporal domain. Expression intensity estimation has lots of potential applications in humanrobot interaction, patient monitoring, security surveillance and entertainment. Facial expression intensity estimation is very important to understand human emotion variation, while a few works addressed this issue. One reason is that expression intensity estimation directly relates to low level expression recognition, and currently, low level expression recognition is still a hard problem. The most proposed works still focus on dealing with deliberately displayed expressions. Another reason is that intensity estimation problem is not a classification problem but a regression problem, the tradition classification method can not work well on it. Facial expression is a dynamic process from onset, and apex, to offset [5], but the intensities of the apexes are various due to different subjects. For example, someone shows the happiness with smiling, while someone may laugh loudly. It is also hard to give a quantitative measure to expression intensity in practices. These issues make expression intensity estimation more difficult than expression recognition. In fact, expression intensity estimation is not a hard decision problem, this is why conventional classification methods are not suitable for it...4 Mis-Alignment Problem In practice, almost all face analysis systems do face alignment to get the normalized faces based on some landmark on the face, usually, based on the eyes center. However, it is difficult to obtain well-aligned face images automatically by current face alignment techniques including eye detection [6], active shape model (ASM) [7] or active appearance model (AAM) [8], due to the influence of lighting, pose, or other environment factors. The impaction of face misalignment has been demonstrated to rapidly degrade the performance of face recognition[9].

16 5..5 Low Level Expression Recognition For most current works, the researchers focus on the recognizing the expressions at the apex level, and missed the frames at the low intensity level. In recent years, psychological researches have demonstrated that besides the categories of expression, facial expression dynamics is important when attempting to decipher its meaning[4]. Therefore, recognizing the expression with low intensity is also necessary. However, low level expression recognition is much harder than the apex level, because the current features are still not so discriminative. How to explore the powerful descriptor is the key problem in this topic..4 Our Approach and Contributions Our approaches try to solve the above problems in the practical system. For the dynamic feature extraction, we aim to observe the change of the appearance features in the time domain, and encode these appearance features to build dynamics features. Based on the thinking in the dynamic features, we further extend the encoded dynamic features to solve the time resolution problem. For expression intensity estimation problem, we proposed to use ranking model instead of traditional classification solution. In order to handle the misalignment problem, we embedded the multiple instance learning into expression recognition, and built multiple instance features to solve the misalignment problem. Furthermore, we proposed combinational features to improve the performance on low level facial expression recognition..4. Encoded Dynamic Features We explore the discriminative dynamic features to do video based facial expression recognition. In order to capture both spacial and temporal information, haar-like features are used as the raw features to describe the appearance characteristic. Since facial expression variations are dynamic in temporal domain, it is a trend to use the variations of temporal information for facial expression recognition []. We propose a novel framework for facial expression recognition based on the encoded dynamic features which has three main components: ) Dynamic feature extraction. For the component of the dynamic feature extraction, we design the dynamic haar-like features to capture the temporal variations of facial expressions. Basically, the haar-like features along the time axles at the same location are looked as the dynamic features. ) Encoding dynamic features. For this part, the distribution of each raw feature of the corresponding expression is simulated as one Gaussian distribution, and use all Gaussian distributions of the feature templates as the code book. Each raw feature is encoded to binary value {, 0}, and then the dynamic features are encoded to one binary vector. ) Adaboost learning. The general boosting learning algorithm is applied to the encoded dynamic features to do feature selection and build strong classifier.

17 6 Our approach is different from previous volume based cubic features, basically, the encoded dynamic feature uses the output of weak classifier (gate function on Gaussian distribution) as the descriptor. Feature selection is on a higher level, not on the raw feature, and classifier is also built on the encoded dynamic features..4. Extension on Encoded Dynamic Features In order to handle the time resolution problem, we extended the encoded dynamic features to encoded dynamic patterns. Due to there are some intrinsic states in each expression, we can use these intrinsic states to encode the appearance features along the time axis and build the encoded dynamic patterns. Basically, we first use the haar-like features to represent facial appearance, due to their simplicity and effectiveness. Then we perform K-Means clustering on the facial appearance features to explore the intrinsic temporal patterns of each expression. Based on the temporal pattern models, we further map the facial appearance variations into dynamic binary patterns, which is independent of the time resolution. Finally, boosting learning is performed to construct the expression classifiers due to the feature dimension is high. Compared to previous work, the dynamic binary patterns encode the intrinsic dynamics of expression, and our method makes no assumption on the time resolution of the data..4. Ranking Model for Intensity Estimation Facial expression intensity estimation is very important to understand human emotion variation, while a few works addressed this issue. Most facial expression analysis works only focused on expression recognition. We introduce a new framework for both expression recognition and intensity estimation based on ranking model. Although it is hard to obtain quantitative measurement of expression intensity, it is easy to obtain the ordinal relationship between pairwise samples according to temporal variations. Based on this observation, we convert the problem of intensity estimation to a ranking problem, which can be modeled by the RankBoost well. The output ranking score given by the ranking model can be used for expression intensity estimation. Also we can use the ranking score for expression recognition..4.4 Multiple Instance Feature Almost all the previous works only focused on designing new algorithms on the well-aligned face images. In practice, it is hard to obtain well-aligned face images by current face alignment techniques due to the impaction of illumination and pose. To solve this problem, we first investigate the influence of face misalignment on facial expression recognition, and we propose a new framework of facial expression recognition based on multiple instance features, which is robust to face misalignment. Facial expression

18 7 is generally dominated by a few parts of face appearance, so we first divide face image into some image patches. To better capture variations of face appearance, a multi-scale appearance representation scheme is developed. For each patch, one multiple instance feature is learned. In order to get the learned feature, we propose to use a boosting based multiple instance learning approach to learn discriminant patterns in the patch level, and we take the outputs of the patches as the features to learn final facial expression classifiers by Adaboost or SVM..4.5 Compositional Features In order to explore the solution on the low level expression recognition, we proposed a new compositional features as the powerful descriptors to do facial expression recognition. We first divide face image into blocks to cover the AUs location, and then we extract local appearance features from each patch. Based on the strategy that optimizes the minimum classification error, the compositional features are built through the local appearance features, and this process is embedded into Boosting learning structure. In each iteration of boosting learning, L raw features are combined to build one compositional feature, and the final classifier is trained based on those compositional features.

19 8 Chapter Related work A facial expression is a visible manifestation of the affective state, cognitive activity, intention, personality, and psychopathology of a person. Mehrabian reported that facial expressions have a considerable effect on a listening interlocutor; the facial expression of a speaker accounts for about 55% of the effect, 8 % is conveyed by voice intonation, and only 7% is on the spoken words. Therefore, facial expression plays a dominatingly communicative role in interpersonal relations. In the context of the face and nonverbal communication, expression implies a revelation about the characteristics of a person, a message about something internal to the expresser. Facial expressions are an important channel of nonverbal communication. Over the last decades, automatic facial expression analysis has become an active research area that has potential applications in areas such as computer interfaces, talking heads, image retrieval, teleconferencing and human emotion analysis. An automatic facial expression analyzer should explain the human s expression, explore the meaning of the expression and embed this information into humancomputer interaction. Just like human being, automatic facial expression recognition system should find where the face is, and then analyze the expression. For the first part, the face detection technology [0] has got great success, but for the expression recognition is still a hard problem for computer. Lots of works have been proposed [][][][][9] on this topic, we will go over the important works in this section, and discuss the main advantages and the limitations.. Feature extraction and Representation Almost all the methods do facial expression analysis based on the detected normalized face, and the first step is to explore the descriptors to describe facial expression. How to extract the useful information, in other words, feature extraction, is the key of all kinds of methods. Feature extraction methods can be categorized according to whether these methods focus on motion or deformation of faces and facial features, respectively, whether they act locally or holistically. As we talked about in the first chapter, we categorize the facial expression recognition methods into two main categories: appearance-feature based methods and geometric-feature based methods. Currently, some hybrid methods also take the advantage

20 9 of the appearance and use the appearance features as the low level input for upper level use... Appearance Features The discriminative appearance features normally come from the eye area, mouth, nose and wrinkles. How to extract these texture information is a key problem, the most popular texture features are Gabor wavelets [4], Haar-like features[0], and local binary patterns (LBP)[5]. These features can extract texture information at different scales and different orientations, therefore they are powerful as the discriminative descriptors. Due to these features are important, and we survey them and their extensions in this section. Gabor Wavelets In image processing, a Gabor filter, named after Dennis Gabor, is a linear filter used for edge detection. Frequency and orientation representations of Gabor filter are similar to those of human visual system, and it has been found to be particularly appropriate for texture representation and discrimination. In the spatial domain, a D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave. The Gabor filters are self-similar: all filters can be generated from one mother wavelet by dilation and rotation. Its impulse response is defined by a harmonic function multiplied by a Gaussian function. Because of the multiplication-convolution property (Convolution theorem), the Fourier transform of a Gabor filter s impulse response is the convolution of the Fourier transform of the harmonic function and the Fourier transform of the Gaussian function. The filter has a real and an imaginary component representing orthogonal directions. The two components may be formed into a complex number or used individually. In the Figure. shows the Gabor kernel normally used in computer vision area. In [9][9][6], the researchers demonstrated that Gabor wavelets can successfully extract the texture information for expression recognition. In [7], Tian further analyze the impaction of Gabor wavelets on expression recognition at difference image scales. However, Calculating Gabor features is very time consume due to the convolution operations. Haar-like Features Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. Historically, working with only image intensities made the task of feature calculation computationally expensive. Viola and Jones [8] adapted the idea of using Haar wavelets and developed the so called Haar-like features. A Haar-like feature considers adjacent rectangular regions at a specific location in a detection

0 Figure.: (Left)Gabor Kernel, and its real part and complex part.(right) Gabor features extracted from face image. Figure.: The basic haar-like feature template window, sums up the pixel intensities in these regions and calculates the difference between them.

And the process of haar-like feature extraction on a face image is shown in the figure.. The key advantage of a Haar-like feature over most other features is its calculation speed.

21 0 Figure.: (Left)Gabor Kernel, and its real part and complex part.(right) Gabor features extracted from face image. Figure.: The basic haar-like feature template window, sums up the pixel intensities in these regions and calculates the difference between them. This difference is then used to categorize subsections of an image. In the Figure., the basic haar-like feature templates are displayed. And the process of haar-like feature extraction on a face image is shown in the figure.. The key advantage of a Haar-like feature over most other features is its calculation speed. Due to the use of integral images, a Haar-like feature of any size can be calculated in constant time. Because of the advantages of Haar-like feature, researchers also applied it on facial expression analysis [9] and got promising performance. Local Binary Pattern Features The original local binary pattern operator was introduced by Ojala et al. [5], and was proved a powerful feature for texture description. The operator labels the pixels of an image by comparing the neighborhood of each pixel with the center value and considering the results as a binary number 0 or, and the n-bin histogram of the LBP labels computed over a region is used as a texture descriptor. Normally, the

22 Figure.: Haar-like features extracted from different position and different scale on the face image. neighborhood size is three by three, but the limitation of this basic LBP operator is that it could not handle large scale structures. Therefore, the operator was extended to neighborhood with different size, and handle circular neighborhoods by bilinearly interpolating the pixel values on any radius and number of pixels in the neighborhood. The Figure.4 shows how the basic operator works to get the encoded value. For face image, normally the image is segmented into multi regions, and the LBP vectors are calculated on each region. And these vectors are connected together as the final feature descriptors. The Figure.5 displayed how the LBP vectors are built. Shan [] applied the LBP features on facial expression recognition and also got promising performance. In order to handle sequence based facial expression analysis, Zhao[] extended LBP to volume local binary pattern (VLBP). Due to both spacial and temporal information is considered in VLBP, it got much better result comparing with the traditional LBP. Inspired by Zhao s work, we also combine both spacial and temporal information and build encoded dynamic features applied on facial expression recognition... Geometric Features The typical examples of geometric-feature-based methods are those of Chang et al. who used a shape model defined by 58 facial landmarks, and of Pantic and her colleagues [4][5][6], who used a set of facial characteristic points around the mouth, eyes, eyebrows, nose, and chin. Figure.6 shows the landmarks in Pantic s work, and through tracking these landmarks, the motion information is obtained to do expression recognition. The key problem on this method is to precisely locate the landmark and track it. However, in the real application, due to the pose variation, and the noise from the background,

23 Figure.4: The LBP operator Figure.5: LBP feature vectors extracted from different region on the face image.

24 Figure.6: Demonstration Landmarks on the face. it is still very hard to precisely locate the landmarks. Huang [40] represented face by using point distribution model (PDM). The used PDM has been generated from 90 facial feature points that have been manually localized in 90 images of 5 subjects showing six basic emotions. The PDM models the face as a whole and interacts with the estimated face region of an input image as entire. After an initial placement of the PDM in the input image, the entire PDM can be moved and deformed simultaneously. Here, a gradient-based shape parameters estimation, which minimizes the overall gray-level model fitness measure, is applied. The face should be without facial hair and glasses, no rigid head motion may be encountered, so successfulness of the method is strongly constrained. Kotsia[4] put grids on the face, and used geometric deformation features to do expression classification, the Figure.7 displayed the grids changes on different expressions. Generally, these methods applied active shape models to locate the landmarks with global constrain and local search strategies. Active shape models (ASMs) are statistical models of the shape of objects which iteratively deform to fit to an example of the object in a new image. One disadvantage of ASM is that it only uses shape constraints(together with some information about the image structure near the landmarks), and does not take advantage of all the available information: the texture across the target. Therefore, active appearance model (AAM) which is related to ASM is proposed to for matching a statistical model of object based on both shape and appearance to a new image. Like the ASM, AAM is also built during a training phase: on a set of images, together with coordinates of landmarks that appear in all of the images, is provided to the training supervisor. AAM could be looked as the hybrid methods based on both geometric and appearance features. Lucey et al. [], uses AAM to capture the characteristics of

25 4 Figure.7: Landmarks on the profile face, and tracking is used to detect AUs. the facial appearance and the shape of facial expressions. In summary, the geometric feature based methods try to locate the landmarks on the face, track it and do recognition based on the motion or deformed information. The common problem of these methods is that the localization of landmark can not be accurate enough, even worse when tracking is added. Generally, this problem is called mis-alignment problem. Actually, mis-alignment problem also impacts the appearance base methods negatively. In the real automatic system, both appearance and geometric feature methods suffer the mis-alignment problem. In order to solve this problem, we proposed to use multiple instance features to handle it, the details will be introduced in chapter 6.. Facial Expression Classification After extracting features, classification is performed in the last stage of an automatic facial expression analysis system. The classification methods could be categorized in various ways, in this section, we categorize the mechanism of classification into the following three categories: ) template based methods; ) neural network based methods; ) statistical classification methods. Template based methods are the firstly proposed methods in decades ago, currently most researchers focus on using neural network based methods and statistical classification methods. We discuss the representing work using these three different methods respectively.

26 5 Figure.8: Grids on the face with different expressions... Template Based Methods The template-based techniques are simple face representation and classification methods. They just compare the new images with the learned template which normally is the average of the images in the same category. These methods have limited recognition capabilities, because the averaging process always causes smoothing of some important individual facial details, and misalignment of the faces also impacts the template. The more important problem is there exists the large inter-personal expression differences. For template based methods, Essa and Pentland [4] used the spatio-temporal motion-energy representation of facial motion for an observed expression. By precisely labeled the training images, D motion were extracted to generate the spatio-temporal templates for six different expressions, two facial actions (smile and raised eyebrows) and four emotional expressions (surprise, sadness, anger, and disgust). The Euclidean norm of the motion energy difference between the template and the observed image is used as a metric for measuring similarity. Although they got 98% recognition rate on 5 frontal-view image sequences from eight subjects with six different expression, but this method still could not be used in real system due to the misalignment problem and the large variations between the subjects. Kimura [4] proposed a Potential Net on each frame of the labeled facial image sequence. The pattern of the deformed net is compared to the pattern extracted from an neutral face, and the variation in the position of the net nodes is used for further processing. However, the proposed method could not

27 6 work well on the image sequences of unknown subjects due to the limitation of a very small number of training examples and an insufficient diversity of the subjects. In summary, template based methods are the pioneer works on facial expression analysis, but they can not describe the variance among different subjects and are not robust to mis-alignment problem... Neural Network Based Methods An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational model that is inspired by the structure and functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. They are usually used to model complex relationships between inputs and outputs or to find patterns in data. Generally, neural network could be looked as one black box which greedily maximize or minimize the object function. Because this advantage, Neural networks were often used for facial expression classification in lots of works. Neural networks were directly applied on face images, and also combined with facial features extraction and representation methods such as PCA independent component analysis (ICA) or Gabor wavelet filters... Statistical Model Based Methods In recent years, machine learning technologies are broadly used in all kinds of areas, also in computer vision area. For classification methods on facial expression analysis, the statistical classification methods are dominated. The most popular methods are support vector machine, boosting method and other sequence based methods such as hidden Markov models. We will survey these three methods on facial expression analysis separately. Support Vector Machines Based Methods Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. More formally, a support vector machine constructs a hyperplane or set of hyperplane in a high or infinite dimensional space, which can be used for classification, regression, or other tasks. As a classifier, SVM is trained on a set of training examples which are marked as belonging to one of two categories, SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. As a powerful

28 7 general classification method, support vector machine is also used in lots of works [9] [4] [][] on expression analysis. Tian [7] firstly extracted Gabor features on face images, and then trained SVM on the training samples from six expressions. The experiment result on the CMU data set got average recognition over 90%. Shan [44] also applied SVM on the LBP features and got the similar performance on the same data set. Zhao [] extended LBP to VLBP, and also applied SVM to build the classifier. Boosting Based Methods Normally, the dimension of features extracted from face image is very high, and it is almost impossible to directly use the high dimension features to train classifier. Therefore, feature selection and dimension deduction must be done as the preprocessing. When Boosting builds the strong classifier through combining the weak classifiers, it can do feature selection at the same time because the weak classifier is directly relative to the corresponding feature. Littlewort [45] used boosting to do feature selection on the extracted Gabor features, and then sent the selected feature into SVM to further train the model. But Littlewort s work just focused on the static images. For spatio-temporal approaches, Hidden Markov models (HMM) are commonly used for facial expression analysis as they allow to model the dynamics off facial actions. Several HMM-based classification approaches can be found in the literature [46][47].. Facial Expression Databases There are several publicly available facial expression databases, and these databases can be categorized into two categories: ) D image or sequence based database. ) D static images or D sequence based database. For the first category, there are Japanese Female Facial Expression (JAFFE) database [6], Cohen-Kanade (CK)facial expression database [5], the CMU PIE database [48], and MMI database [49]. CK database is the most popular expression database which contains 97 subjects, 48 video sequences with six kinds of basic expressions. Subjects in every video began from a neutral expression, and ended at the apex level. FACS coding on every video is also provided. Another popular database is MMI facial expression database which includes 894 sessions in the database. Currently 95 sessions are AU coded and 97 sessions are labeled as one of the six basic emotions. The images in MMI facial expression database are in both front view and profile view. The main purpose of CMU PIE database is about analyzing pose and illumination problems, and it includes 4,68 face images of 68 people captured under poses, 4 illuminations conditions. Three different expressions : neutral, smile and blinking are also included. Japanese Female Facial Expression (JAFFE) Database contains images

29 8 of ten expressers posed or 4 examples of each of the seven basic expressions. All the above three database only contains D images. Recently, Yin [50] built D facial expression databases on D static images and D image sequences, and these two databases are called BU-DFE (Binghamton University D Facial Expression) database and BU-4DFE (D + time). BU-DFE database presently contains 00 subjects (56% female, 44% male), ranging age from 8 years to 70 years old, with a variety of ethnic/racial ancestries, including White, Black, East-Asian, Middle-east Asian, Indian, and Hispanic Latino. Each subject in the database performed seven expressions in front of the D face scanner. With the exception of the neutral expression, each of the six prototypic expressions includes four levels of intensity. Although lots of expression databases were built, none of these databases collected spontaneous facial expression. The subjects were instructed to perform specific facial expression which are faked ones and not the true spontaneous expression in real life. But it is true that it is hard to collect all six basic spontaneous expressions in daily life, and ground truth labeling is also a hard problem. Another problem is that for most static image based works on these data sets, they just focus on recognizing the expressions at the apex level, and missed the images at the low level. Because the low level expression recognition is much more difficult than recognizing exaggerated expressions..4 Summary In this chapter, we surveyed the important works in recently years from both view of feature and view of classifier design. These works got great success on the collected databases, however, for a practical automatic expression system, these methods still could not handle the problems in the real data such as time resolution problem, mis-alignment problem, expression intensity estimation, and low level expression recognition. In this thesis, we proposed the encoded dynamic patterns, multiple instance features, ranking models and the compositional features to handle these problems.

30 9 Chapter Encoded Dynamic Features In this chapter, we propose a new approach called encoded dynamic features to do facial expression recognition. In order to capture temporal characteristic of facial expression, we design the dynamic haar-like features to represent facial images. Compared to Gabor features, haar-like features are very simple, since they are just based on simple add or minus operators [8]. Moreover, from our experiment results, we find that the haar-like features are a little better than the Gabor features for describing facial expressions. The dynamic haar-like features are encoded into the binary patterns for further effective analysis inspired by [5]. Finally based on the encoded dynamic features, the Adaboost is employed to learn the combination of optimal discriminate features to construct the classifier. We test the proposed method on the CMU facial expression database, and extensive experiments prove its effectiveness. We also extend it for AU recognition, and get a promising performance.. Motivation The previous works can be categorized into two classes: image based methods and video based methods [5] [5] [5]. The image based methods take only mug shots as observations which capture characteristic images at the apex of the expression, and recognize expressions according to appearance features [] [54] [5] [44]. However, a natural facial event is dynamic, which evolves over time from the onset, the apex, and the offset, including facial expressions. Therefore, the image based methods ignore the dynamic characteristics of facial expressions, and do not perform well in a real world setting. The idea of the video based methods is to analyze the dynamics of facial expression for recognition. Extensive experiments have demonstrated the importance of the facial dynamics for recognition [] [55] [56] [] [4] [57], including psychology experiments [58] [4]. It is a trend to use the variations of temporal information for facial expression recognition in recent years. Comparing with the static image based methods [][4], the spatio-temporal features based methods[][59][60] combine both space and temporal information, therefore these methods can get better performance. Black and Yacoob did pioneering work on the dynamic analysis of facial expressions []. They used the parametric motion models to describe the facial dynamics, and recognized the expression according

31 0 Figure.: The structure of the proposed framework. to the parameters of local motion models. De la Torre et al. [6] used condensation to track the local appearance dynamics with the help of subspace representation. In [6], the dynamics are represented by key point tracking, which is based on Active Shape Model [6]. All these methods are dependent on low-level image feature representation to some extent, and they are sensitive to noise. In [64] [65], manifold learning was employed to explore the intrinsic subspace of the facial expression events. [64] used the Leipschitz embedding to build a facial expression manifold, and [65] used multi-linear models to construct a non-linear manifold model. The manifold methods can not work well in practice due to noise and the complicated facial appearance variations of different subjects. In this chapter, we proposed a novel framework for facial expression recognition based on the encoded dynamic features. Figure. illustrates the structure of our framework, which has three main components: dynamic feature extraction, coding dynamic features, and Adaboost learning. For the component of the dynamic feature extraction, we design the dynamic haar-like features to capture the temporal variations of facial expressions. Inspired by the binary pattern coding [5], we analyze the distribution of each dynamic feature, and we create a code book for it. Based on the code books, the dynamic features are further mapped into the binary pattern features. Finally the Adaboost is used to learn a set of discriminating coded features for facial expression recognition.. Dynamic Feature Representation The haar-like features achieved a great performance in face detection [8]. Due to their much lower computation cost compared to the gabor features, we exploit the haar-like features to represent face images, and extend them to represent the dynamic characteristic of facial expression. Figure. shows the example of extracting the haar-like features in a face image. The dynamic haar-like features are built by two steps: () Thousands of haar-like features are extracted in each frame. () The same haar-like features in the consecutive frames are combined as the dynamic features. Figure. shows the flowchart of the dynamic haar-like features, and the details are described in the following.

Corresponding to each haar-like feature h i,j, we build a dynamic haar-like feature u i,j Figure.

32 Figure.: The flowchart of dynamic features extraction For simplicity, we denote one image sequence I = {I, I,..., I n } with n frames. We define H i = {h i,, h i,,..., h i,m }, as the haar-like feature set in the image I i, where m is the number of the haar-like feature. Corresponding to each haar-like feature h i,j, we build a dynamic haar-like feature u i,j Figure.: Example of Haar-like features superimposed onto a face image in the temporal window as u i,j = {h i,j, h i+,j,..., h i+l,j }. Figure.4 gives an illustration. We call each u i,j as a dynamic feature, we can see that the size of dynamic feature is decided by the temporal window size. The temporal variation of facial expressions can be effectively described by all the u i,j s.. Coding the Dynamic Features As mentioned above, a dynamic feature unit is composed of a set of the haar-like features in the same position along the temporal window. Thus, each dynamic unit is a feature vector. In this section, we further code the dynamic feature unit into a binary pattern inspired by [4] [5]. This coding scheme has two advantages: ) It is easier to construct weak learners for Adaboost learning with a scalar feature than with a feature vector; ) the proposed binary coding is based on the statistical distribution, so the

33 Figure.4: Example of one dynamic feature unit u i,j in an image sequence encoded features are robust to noise. We create one code book for each facial expression as follows: first, we analyze the distribution of each feature h,j under each expression, and then the mean µ and the variance σ can be estimated from this distribution. The gaussian distribution N j (µ, σ) is adopted to model the distribution of the feature unit h,j. Second, we obtain the code book (N (µ, σ ), N (µ, σ ),..., N m (µ m, σ m )) for each expression. Based on the code book, we can map the each haar-like feature h i,j to {, 0} pattern by the formula (.): 0 : if C i,j = : if h i,j µ j σ j h i,j µ j σ j > T < T (.) where T is the threshold, we will give a detail analysis on the influence of different T values in section.5.4, and we find our method is not sensitive to T. Based on the formula (.), we get the binary pattern E i,j. Figure.5 gives the procedure of creating the coded feature E i,j with length L, and we can convert binary vector E i,j to a scalar. E i,j = {C i,j, C i+,j,..., C i+l,j, C i+l,j } (.).4 Boosting the Coded Dynamic Features As we know, we can extract thousands of haar-like features from one image, so we can obtain thousands of the dynamic coded features. A set of discriminating features should be selected to construct the final expression classifier. In this section, we use the Adaboost learning to achieve this goal, and the weak learner is built on each coded feature as in [8]. For each expression, we set the image sequences of this expression as the positive examples, and the image sequences of the other expressions as the negative samples. Therefore, one classifier for each expression can be built based on the corresponding coded features. The learning procedure is in algorithm 5. The testing procedure is displayed in figure.6. We

Figure.5: Example of coding one dynamic feature unit u i,j take six expression categories into account: happiness, sadness, surprise, disgust, fear, and anger.

34 Figure.5: Example of coding one dynamic feature unit u i,j take six expression categories into account: happiness, sadness, surprise, disgust, fear, and anger. Six Adaboost learners are trained to discriminate each expression from the others. Algorithm Learning procedure : Given training image sequences (x i, y i ),..., (x n, y n ), y i {, 0} for specified expression and the other expressions respectively. : Initialize weight D t (i) = /N. : Get the dynamic features on each image sequence. 4: Code the dynamic features based on the corresponding code book, and get E i,j. Build one weak classifier on each coded feature. 5: Use the Adaboost to get the strong classifier H(x i ) for expression recognition..5 Experiment Our experiments are conducted on the Cohn-Kanade Facial Expression Database [5]. We select 96 subjects and 00 image sequences from the database, and each subject has six basic expression sequences. The six basic expression samples are shown in figure.7. We randomly select 60 subjects as the training set, and the rest subjects as the testing set. We use two strategies to organize the data set: one is to manually fix the eyes location and crop the images to exclude the effect of misalignment on the recognition result; another one is to automatically detect face [8] and do alignment to simulate the practical application. The images are cropped and normalized to 64x64 for both strategies. In section.5., we use the manually cropped data set to investigate the

4 Figure.6: The procedure of the expression recognition Figure.7: Examples of six basic expressions.(anger, Disgust, Fear, Happiness, Sadness and Surprise) characteristics of the proposed method.

35 4 Figure.6: The procedure of the expression recognition Figure.7: Examples of six basic expressions.(anger, Disgust, Fear, Happiness, Sadness and Surprise) characteristics of the proposed method. In section.5., we use the automatically cropped data set to verify the performance of our method in the practical system. In [] [9], they gave some recognition ratio on each expression or AU, however, the false alarm was ignored. To evaluate the performance completely, the ROC curve is used in our experiment to evaluate the performance..5. Analysis on the Characteristics of the proposed method In order to analyze the performance of our method without the influence of face misalignment [9], we take the manually cropped data set in this section, and compare the dynamic haar-like features with the dynamic gabor feature. In our method, there are only two parameters: T and L. T is threshold in formula ( 4.) and L is the size of the window in equation (.). We use the last frame of each sequence to evaluate the performance of the static image based method..5. Evaluation on the Dynamic Haar-like and the Dynamic Gabor Features In this experiment, we fix the window size L = E i,j = {C i,j, C i+,j, C i+,j }. (.) to evaluate the performance of the haar-like and gabor features respectively. The influence of different L will be discussed in section.5.. We set T =.65 for all the expression, because the standard

36 5 normal gaussian distribution x N(0, ), P r( x.65) = 95%. In the practical system, for each expression different optimal thresholds T should be used to get the best performance, the details about the influence of T on the recognition performance are talked about in section.5.4. The area under the ROC curves of different features is shown in table.. We can see that the encoded dynamic features outperform the static features, and the dynamic haar-like features are a little better than the dynamic gabor features. Table.: The Area under the ROC curves (Based on Haar-like Feature and Gabor Feature) Expression Haar-Like Feature Gabor Feature Static Encoded Dynamic Static Encoded Dynamic Angry Disgust Fear Happiness Sadness Surprise Although for the fear expression, the static feature beats the dynamic feature, it is not surprise because the fear expressions from people are quite different. In the following two parts, we analyze the relation between the recognition result and the two parameters L and T..5. Analysis of the Window Size L One of the key parameters in our method is the size of the temporal window, for it is directly related to the dynamic features. In this subsection, we set L =, 4, 5, 6, 7 as the size of the temporal window to investigate the performance of our method. Table. and. report the experiment results. We can see that different L has a slight effect on different expression. It is caused by the fact that different expression has different motion pattern, and different subjects have differen facial motion speed, the example is shown in figure.8. However, the performances are acceptable over all the L values. For simplicity, we set L = for all the expressions in the following experiments..5.4 Analysis of the Threshold T Another parameter is the threshold T, which enforces the real value of feature to the binary value. In this part, we try to check the influence of the parameter T. We set the windows size as, and use the haar-like feature as the basic feature unit because in section.5., we find that haar-like features are a little better than the gabor features. Different thresholds T = {0., 0.6,.0,.,.5,.65,.9,.4} are used to train the models, and the corresponding recognition results are shown in table.4.

37 6 Figure.8: Different facial motion speed of different subjects. Table.: The area under the ROC curves with different window size based on Haar-like feature Expression Encoded dynamic Haar-like feature + AdaBoost Window Size Window Size 4 Window Size 5 Window Size 6 Window Size 7 Angry Disgust Fear Happiness Sadness Surprise From table.4, we can clearly see that our dynamic features are not sensitive to the threshold T, except for the fear expression. Even the threshold decreases to 0.6, our method still works well and gets good performance. Similarly, the recognition result of the fear expression varies with different T perhaps because its variances are large..5.5 Experiment on Automatically Cropped Data Set In this sub-section, we test the proposed method on the automatically cropped data set, we use Viola s [8] method to detect face and normalize the face automatically based on the location of the eyes. The L is set to and T is.65. Figure.9 shows the ROC curves of six expressions, and the area under the ROC curves is listed in the table.5. We can see the ROC curves based on the encoded dynamic feature are much better than the ones based on the static feature..5.6 Experiment on Facial AUs In this subsection, we also extend the proposed method for facial AU recognition. According to Ekman s Group description method of Facial Action Coding System (FACS) [66], totally there are 7 primitive

38 7 ROC curve for expression Angry ROC curve for expression Disguss true positive rate true positive rate Coded Dynamice Feature Static Feature false positive rate 0. Coded Dynamice Feature Static Feature false positive rate ROC curve for expression Fear ROC curve for expression Happy true positive rate true positive rate Coded Dynamice Feature Static Feature false positive rate 0. Coded Dynamice Feature Static Feature false positive rate ROC curve for expression Sadness ROC curve for expression Surprise true positive rate true positive rate Coded Dynamice Feature Static Feature false positive rate 0. Coded Dynamice Feature Static Feature false positive rate Figure.9: ROC curves of expression based on coded dynamic features and static features

39 8 Table.: The area under the ROC curves with different window size based on Gabor feature Expression Encoded dynamic Gabor feature + AdaBoost Window Size Window Size 4 Window Size 5 Window Size 6 Window Size 7 Angry Disgust Fear Happiness Sadness Surprise Table.4: The Area under the ROC curves on different T Expression Encoded dynamic Haar feature + AdaBoost (window size is ) T =.4 T=.9 T =.65 T =.5 T =. T=.0 T = 0.6 T = 0. Angry Disgust Fear Happiness Sadness Surprise units, called Action Unit(AU). Based on them, any expression can be represented by a single AU or a combination of several AUs. We test our approach on facial AUs database which has subjects. Because there are few samples for some AUs, in our experiments, we just focus on 8 AUs: AU, AU, AU4, AU5, AU0, AU, AU4 and AU0. The examples are shown in Figure.0. We randomly select subjects as the training set, and the other subject as the testing set. The ROC curves are reported in figure., we can clearly see the coded dynamic features are much better than the static features. And the area under the ROC curves is listed in table 6. Table.5: The Area under the ROC curves (Expression) Expression Static feature Coded dynamic feature + AdaBoost + AdaBoost Angry Disgust Fear Happiness Sadness Surprise

40 9 Figure.0: Example of the AUs Table.6: The Area under the ROC curves (AUs) AUs Static feature Coded dynamic feature + AdaBoost + AdaBoost AU AU AU AU AU AU AU AU

41 0 ROC curve for AU ROC curve for AU true positive rate true positive rate Static feature Coded dynamic feature false positive rate 0. Static feature Coded dynamic feature false positive rate ROC curve for AU 4 ROC curve for AU true positive rate true positive rate Static feature Coded dynamic feature false positive rate 0. Static feature Coded dynamic feature false positive rate ROC curve for AU 0 ROC curve for AU true positive rate true positive rate Static feature Coded dynamic feature false positive rate 0. Static feature Coded dynamic feature false positive rate ROC curve for AU 4 ROC curve for AU true positive rate true positive rate Static feature Coded dynamic feature false positive rate 0. Static feature Coded dynamic feature false positive rate Figure.: ROC curves of AUs based on coded dynamic features and static features

42 .6 Summary This chapter presented a novel approach for video based expression recognition, which is based on the encoded dynamic features. We first extracted dynamic haar-like features to capture the temporal information of facial expressions, and then we further coded them into the binary pattern features inspired by the binary pattern coding. Finally the Adaboost was performed to learn a set of discriminating encoded features to construct the final classifier. Experiments on the CK facial expression database and our own facial AU database showed that the proposed method has a promising performance.

43 Chapter 4 Encoded Dynamic Patterns on Time Resolution Problem In this chapter, we focus on solving time resolution problem in video-based facial expression recognition. The proposed method aims to handle the data with varying time resolution including a single frame. Basically, as the extension of the encoded dynamic features, we also first use the haar-like features to represent facial appearance, due to their simplicity and effectiveness. Then we perform K-Means clustering on the facial appearance features to explore the intrinsic temporal patterns of each expression. Based on the temporal pattern models, we further map the facial appearance variations into dynamic binary patterns, which is independent of the time resolution. Finally, boosting learning is performed to construct the expression classifiers due to the feature dimension is high. Compared to previous work, the dynamic binary patterns encode the intrinsic dynamics of expression, and our method makes no assumption on the time resolution of the data. Extensive experiments carried on the Cohn-Kanade database show the promising performance of the proposed method. 4. Motivation As we talked about in Chapter, the video based facial expression recognition is a trend in recent year, but there are two key issues in the video based facial expression recognition in practice. One is the temporal segmentation of facial expression events from the input video [67] [68]. The other one is how to represent the dynamics of the facial expression for recognition. In this chapter, we focus on the latter. Recently, volume features attract much attention in capturing the dynamics of action including facial events, in which the image sequence is modeled as a volumetric data. They have the advantage that couples temporal dynamics with the spacial appearance tightly. In [], the volume LBP features were proposed for facial expression, and achieved much success. Similar features are proposed for video based face recognition in [69]. The volume haar-like features obtained an encouraging performance on the pedestrian detection and action analysis in [70]. Similar to the volumetric features, [7] designed the ensemble of the haar-like features in the temporal domain and combined them with a coding scheme for facial expression recognition. However, the volumetric features have a prerequisite that the training and the testing data must have the same length and the same speed rate, i.e., the same time resolution.

However, it is hard to satisfy this prerequisite in practical systems. For example, different cameras have different capture speed rate, and they produce the videos at different time resolutions.

44 However, it is hard to satisfy this prerequisite in practical systems. For example, different cameras have different capture speed rate, and they produce the videos at different time resolutions. Even when using the same camera, different subjects and different environments also have an effect on the time resolution of facial expressions. Figure 4. shows an illustration. All these video sequences represent smile events from the onset to the apex, but they have different time resolutions due to the use of different subjects and cameras. Thus, a time wrapping strategy should be performed before volume feature extraction can be used for practical applications, but it is inevitable that the recognition performance will be influenced by the time wrapping operation. Figure 4.: Some examples of smile events from different subjects and different cameras. Therefore, we propose a new framework for video-based facial expression recognition, in which the new dynamic binary patterns are developed and they are independent of the time resolution. Our method does not need any time wrapping operation. The structure of the proposed framework is illustrated in Fig Our Contribution In this section, the representation of the facial appearance is first introduced. Then we discuss how to cluster the intrinsic temporal patterns of the facial expression, and address how to map the facial appearance variations into the dynamic binary patterns according to the intrinsic pattern models. Finally the construction of expression classifiers is described. 4.. Haar-like Facial Appearance Representation Facial expression is behaved by the facial appearance variations, so we should represent facial appearance first. Since the haar-like features have achieved much success in face detection as facial appearance

45 4 Figure 4.: The structure of our approach. descriptors [8], they have been introduced into face recognition [7] and facial expression recognition [7]. They are easy and fast in implementation due to only add and minus operators. In this chapter, we also use them to represent facial appearance. There are thousands of haar-like features in one image. We denote H i = {h i,j }, j =,,..., M as the haar-like features of the image I i, where the subscript j means the jth haar-like feature in I i, and M is the number of the features. For one image sequence with N frames, S = {I i }, i =,,..., N, we extract the haar-like features from each frame I i respectively, so we get a set of the haar-like features, SH = {H i }, i =,,..., N. In SH, given a haar-like feature at position j, we define its temporal variations {h i,j }, i =,,..., N, as a dynamic feature unit u j. The analysis of facial dynamics is based on all the u j, j =,,..., M. 4.. Clustering Intrinsic Temporal Patterns An expression is a dynamic event, which evolves over time and can be decomposed into the onset, the apex, and the offset. For simplicity, we only take the dynamics of expression from the onset to the apex into account due to their importance in recognition. We can assume that each expression process is comprised of several intrinsic states (patterns) along the temporal domain. Since it is difficult to make clear definitions of these intrinsic patterns, in this chapter, we adopt an alternative scheme to represent

46 5 these intrinsic patterns. We cluster each dynamic feature unit u j into five levels: start, middle(-), middle, middle(+) and apex, according to its variation of the feature values, and use the five-level models of all the feature units to represent the intrinsic patterns of expression. Because each expression has its special intrinsic patterns, without loss of generality, in the following we discuss how to build the five-level models for an expression E. Given the training data, we perform the K-Means algorithm on each dynamic feature unit u j, and its five-level clusters are obtained by setting K = 5 for the K-Means. We model the five clusters as a Gaussian distribution respectively, Nj k{µk j, σk j }, k =,,..., 5, where µ and σ represent the mean and the variance respectively. Thus, for the expression E, we obtain an ensemble of the five-level models as follows (), which implicitly enrich the intrinsic patterns of the expression. For convenience, we call this ensemble the temporal pattern models of the expression E. E = N (µ, σ ), N (µ, σ ),..., N 5 (µ 5, σ 5 ) N (µ, σ ), N (µ, σ )..., N 5 (µ 5, σ 5 ). N M (µ M, σ M ), N M (µ M, σ M )..., N 5 M (µ5 M, σ5 M ) (4.) 4.. Dynamic Binary Pattern Mapping As we mentioned in section 4., in practice, the expression sequences we get often have different time resolutions due to various reasons. In order to handle this issue, we design the dynamic binary patterns to normalize the expression sequences and embed the dynamics of the expression into the feature representation. Given an expression sequence with N frames, {I i }, i =,,..., N, we first extract the haar-like features h i,j, j =,,..., M, for each frame I i. With the help of the temporal pattern models described above, we can find a good matching from its corresponding five-level Gaussian models for each haar-like feature h i,j, and we convert it into a five-dimensional binary vector, i.e., h i,j b i,j = (v k ), where k =,,..., 5 and v k = 0 or. b i,j is computed by the Bayesian rule as: if k = argmax P (h i,j Nj c ), c =,,..., 5; v k = c 0 otherwise. (4.) where P (h i,j Nj c) means the probability of h i,j given the model Nj c. So for the binary feature b i,j, there is only one dimension which is, and the other four dimensions are 0. It means each haar-like feature can be projected into one of its corresponding five clusters. We map all the haar-like features to the five-dimensional binary feature vectors for each frame of the sequence. Inspired by the idea in [4], we compute the histogram of all the binary feature vectors over

6 (00 7 4 7 7 ) = () 0 Figure 4.: The process of extracting dynamic binary pattern. all the sequences for each feature, and do the normalization as: φ j = N i= b i,j, j =,,..., M. (4.

47 6 ( ) = () 0 Figure 4.: The process of extracting dynamic binary pattern. all the sequences for each feature, and do the normalization as: φ j = N i= b i,j, j =,,..., M. (4.) N where φ j is still a five-dimensional vector. Based on equation 4., the sequence is represented by M five-dimensional feature φ j, and φ j is independent of the time resolution of the sequence. We call φ j the dynamic binary pattern. As in [], we convert the binary patterns into decimal values, and we use them to construct the expression classifier. Figure 4. shows an example of the dynamic binary pattern Boosting Classifier for Expression Recognition Any sequence can be represented by the dynamic binary patterns, and the number of the dynamic binary patterns is fixed. However, the number of the dynamic binary patterns is still large, since it is equal to the number of the haar-like features in one image. Moreover, for each expression, there are only some local facial components with distinct response, which means only a subset of dynamic binary patterns are discriminative for expression recognition. It is well known that the Adaboost learning is a good tool to select some good features and combine them together to construct a strong classifier [8]. Therefore we adopt Adaboost to learn a set of discriminant dynamic binary patterns and use them to construct the expression classifier. In this chapter, we take six basic expressions into account, so it is a six-class recognition problem. Since the Adaboost is used typically for discriminating two classes, we use the one-against-all strategy to decompose the six-class issue into multiple two-class issues. For

48 7 each expression, we set its samples as the positive samples, and the samples of other expressions as the negative samples. Algorithm 5 summarizes the learning algorithm. Algorithm Learning procedure : Give example image sequences (x i, y i ),...,(x n, y n ), y i {, 0} for specified expression and other expressions respectively. : Initialize weight D t (i) = /N. : Get the dynamic features on each image sequence. 4: Code the dynamic features based on the corresponding temporal binary model, and get φ i,j. Build one weak classifier on each coded dynamic binary pattern. 5: Use Adaboost to learn the strong classifier H(x i ). 4. Experiments We conducted our experiments on the Cohn-Kanade facial expression database [5], which is widely used to evaluate the facial expression recognition algorithms. We randomly selected 60 subjects as the training set, and the rest of subjects as the testing set. The face is detected automatically by Viola s [8] face detector and it is normalized to as in Tian [4] based on the location of the eyes. In order to efficiently evaluate the performance of the dynamic binary patterns, we compare it with the haar-like volume features [70]. For simplicity, we denote our method as DBP and the haar-like volume features as D haar. We also evaluate the robustness of the proposed method if the training samples and the testing samples have different length. We adopt two different sampling strategies on the original sequences to simulate this case. One is uniform sampling, and another is non-uniform sampling. We use the ROC curve as the measurement tool to evaluate the performance, because it is more general and reliable than the recognition rate. 4.. Comparison to D Haar-like Features Similar to the D haar, the DBP integrates the dynamics into the appearance, but it is different from the D haar. The DBP is not sensitive to the time resolution. To demonstrate this, we compare the DBP to the D-haar first. For fair comparison, we compare them under the same framework, and the training samples and the testing samples have the same length, because this is a premise of the D haar based method. Since the sequences in the Cohn-Kanade facial database have different length, we use a fixed-length window to slide over the sequences to produce the fix-length samples. In this experiment we fix the training samples with 7 frames and 9 frames respectively. Figure 4.4 reports the ROC curves of the comparison experiments, and table 4. reports the area below the ROC curves. We can see that the performance of the DBP is better than that of the D haar. There are two reasons: ) the dynamic binary patterns are encoded based on the statistics and the Bayesian rule, so it is robust to some noise; )

49 8 the samples generated from the fix-length window should have different active speeds, while the DBPs are insensitive to active speeds. Table 4.: The Area under the ROC curves (D haar-like feature and DBP) Expression 9(xxxxxxxxx) frames 7(xxxxxxx) frames D Haar DBP D Haar DBP Angry Disgust Fear Happiness Sadness Surprise We also investigate how much the two methods are affected, if we use different capture ratios to record the original video. We use different sampling schemes to simulate this case. We take the samples generated with the 7-frame window as original sequences, and we perform the sampling operator on it to produce the training and testing sets. For simplicity, we note the original sequence with 7 frames as XXXXXXX in the following, and X0X0XXX means that we throw off the second and the fourth frames and keep the other five frames. Figure 4.5 shows the ROC curves, and the areas under the ROC curves are given the table 4.. We can see the variance of the D haar results is large, while the variance of the DBP is stable in a sense. This implies that different capture ratios have great influence on the performance of volume feature representation. From the experiment results reported in the [], we can also see the influence of the volume length on the performance of the volume local binary pattern features. Table 4.: The Area under the ROC curves (Different sampling strategies) Expression Train on Train on Train on (x0x0x0x) (xx000xx) (x0x00xx) D Haar DBP D Haar DBP D Haar DBP Angry Disgust Fear Happiness Sadness Surprise

50 9 4.. Robustness Analysis In the above experiments, we have compared the performances of the DBP and the D-haar. We know that the DBP has another advantage against the D-haar: it has no requirement on the length of the samples. In the following, we will analyze its robustness if the training samples and the testing samples have different length. We first fix the training samples with the same length, but the length of the testing Table 4.: The Area under the ROC curves (Training on 7(xxxxxxx) frames) Angry Disgust Fear Happiness Sadness Surprisee xxxxxxxxxxxx x0x0x0x0x0x x0x0x0x xxxxxxx x00x00x x00000x x000x x0x0x xxxxx xxx x xxx x0x x x mean standard variance samples is variable. Table 4. reports a group of experimental results, where the length of testing samples from to and the sampling is uniform. Here the testing images are the ones around the apex if the window size is less than 5. Table 4.4 shows the results where the sampling is non-uniform. We can see that our method is insensitive to the length variance of the testing samples. The large window size has a little better performance, because the large window captures much dynamics of the expressions. We also investigate the case that both the training and the testing samples are variable. We randomly select training samples, whose length changes from frames to 5 frames with different sampling scheme. For each original training image sequence, we randomly select one kind of templates from the 9 templates(xxxxxxxxxxxx, x0x0x0x0x0x0, xxxxxxxxxx, xx00xx00xx, xxxxxxxx, xxxxxxx, x0x0x0x, xxxxx, x0x0x) to create the training image sequences. The testing samples are also varied from frames to 5 frames which are the same as the ones in table 4.4. Table 4.5 illustrates the results of the non-uniform sampling on the testing samples. Figure 4.6 shows the mean and standard variance of the table, 4 and 5. We can see that the performance is still stable in this case that both the training and the testing samples are variable.

51 40 Table 4.4: The Area under the ROC curves (Training on 7(xxxxxxx) frames) Angry Disgust Fear Happiness Sadness Surprise xxx00xxxx0xx x0xxx0xxxxx xxxxx0x x000xxx xxxx00x x0x000x xx0x xx0x mean Standard variance Table 4.5: The Area under the ROC curves (Training on randomly selected frames) Angry Disgust Fear Happiness Sadness Surprise xxx00xxxx0xx x0xxx0xxxxx xxxxx0x x000xxx xxxx00x x0x000x xx0x xx0x mean Standard variance

52 4 4.4 Summary This chaper presented a novel approach for video-based facial expression recognition, in which the dynamic binary patterns are developed to represent the dynamics of the expression. Compared to previous work, our method is robust to the time resolution of the expressions. We first extract the haar-like features to represent the facial appearances, and then we perform the K-Means clustering to generate the temporal pattern models of the expressions. Based on the temporal pattern models, the haar-like features in the spatio-temporal domain are mapped to the dynamic binary patterns. The expression classifiers are built by the Adaboost learning. Experiments on the well-known Cohn-Kanade facial expression database show the power of the proposed method.

53 4 ROC curve for expression Angry ROC curve for expression Disguss true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (a) ROC curve for expression Fear true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (b) ROC curve for expression Happy true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (c) ROC curve for expression Sadness true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (d) ROC curve for expression Surprise true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (e) true positive rate D Haar 7 frames(xxxxxxx) D Haar 9 frames(xxxxxxxxx) 0. DBP 7 frames (xxxxxxx) DBP 9 frames (xxxxxxxxx) false positive rate (f) Figure 4.4: ROC curves of six expressions in table 4.

54 4 ROC curve for expression Angry ROC curve for expression Disguss true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (a) ROC curve for expression Fear true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (b) ROC curve for expression Happy true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (c) ROC curve for expression Sadness true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (d) ROC curve for expression Surprise true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (e) true positive rate D Haar(x0x0x0x) 0. D Haar(xx000xx) 0. D Haar(x0x00xx) DBP (x0x0x0x) 0. DBP (xx000xx) DBP (x0x00xx) false positive rate (f) Figure 4.5: ROC curves of six expressions in table 4.

55 Figure 4.6: The mean and variance of results in table 4.4, 4.5 and 4. 44

56 45 Chapter 5 Facial Expression Intensity Estimation Facial expression intensity estimation is very important to understand human emotion variation, while a few works addressed this issue. Most facial expression analysis works only focused on expression recognition. In this chapter, we introduce a new framework for both expression recognition and intensity estimation based on ranking model. Although it is hard to obtain quantitative measurement of expression intensity, it is easy to obtain the ordinal relationship between pairwise samples according to temporal variations. Based on this observation, we convert the problem of intensity estimation to a ranking problem, which can be modeled by the RankBoost well. In order to further improve the performance of the RankBoost, we propose to introduce the sparsity of l regularization into it. The output ranking score given by the ranking model can be used for expression intensity estimation. Also we can use the ranking score for expression recognition. Experiments on the Cohn-Kanade database show that the proposed method has a promising performance compared to the state-of-the-art. 5. Motivation In [7], Izard presented to categorize facial expression into six basic expressions: happiness, sadness, disgust, surprise, anger, and fear. In [7], Ekman and Friesen designed a comprehensive standards to decompose each expression into several special active units (AUs), i.e., Facial Action Coding System. These two works can be regarded as the pioneer works of facial expression analysis. The basic goal of facial expression analyzer aims to automatically identify an input facial image or sequence as one of six basic expressions, and some studies have obtained good performances in special cases. However, simply classifying expression into such six basic categories is insufficient to further understand human emotion. Recent psychological researches have demonstrated that besides the categories of expression, facial expression dynamics is important when attempting to decipher its meaning[4]. Briefly speaking, expression dynamics can be represented by the expression intensity variation in temporal domain. Expression intensity estimation has lots of potential applications in human-robot interaction, patient monitoring, security surveillance and entertainment. For example, expression intensity helps intelligent robots to understand human emotion. So far, just a few works addressed this issue.

46 5.. The Problems In FACS, the intensities of AUs are divided into 5 levels A,B,C,D,E. Figure 5. shows the 5 levels of AU 5, which are specified by how far the lips are parted. Figure 5.: The example of AU5 on 5 levels (A, B, C, D, E).

57 The Problems In FACS, the intensities of AUs are divided into 5 levels A,B,C,D,E. Figure 5. shows the 5 levels of AU 5, which are specified by how far the lips are parted. Figure 5.: The example of AU5 on 5 levels (A, B, C, D, E). However, how to automatically label such 5 levels is still an open problem. In [], the Gaussian models are used to simulate the distributions of 5 levels of AU 5 based on Locally Linear Embedding (LLE), and the results are illustrated in figure 5.. We can see that big overlaps exit among five levels. These are caused by two main reasons: ) FACS does not give a quantitative measure between levels; ) The variances exist in different subjects. Figure 5.: Intensity over time for AU5.(From []) Expression intensity estimation also has the above issues. Facial expression is a dynamic process from onset, and apex, to offset [5], but the intensities of the apexes are various due to different subjects. For example, someone shows the happiness with smiling, while someone may laugh loudly. It is also hard to give a quantitative measure to expression intensity in practices. These issues make expression intensity estimation more difficult than expression recognition. In fact, expression intensity estimation is not a hard decision problem, so conventional classification methods are unsuitable for it. Conventional regression methods also do not work well, because we can not have ground-truthes of absolute intensities. In addition, expression intensity estimation is coupled with expression recognition, and it depends on what kind of expressions are focused on.

58 Related Works A lot of studies have been proposed for expression recognition, but a few works addressed the issue of expression intensity estimation. Some studies attempted to use the replacements of facial landmarks as features for estimating expression intensity. Reilly [][74] performed LLE and Kernel Principal Component (KPCA) to reduce the feature dimensions, and then SVM was used to train the classifier to do classification at different intensity levels. Although small errors were obtained on the training set, the performance on the unseen data was not good, because LLE has a out of sample problem and it is not easy to obtain robust facial landmarks. In [75], Ke Keung used the ISO-MAP for feature dimension reduction, and then SVM and Cascade Neutral Network were used to learn the classifier. Similar to [][74], it could not handle the unseen data well yet. In [76], Potential Net Model was proposed to describe deformation processing of facial expression and tried to estimate the degrees of expression, but the experiment results was really limited. In [77], they combined facial feature point tracking, dense flow tracking, and high gradient component analysis together to do expression analysis. They claimed they could estimate expression estimation, but only recognition results were evaluated in their experiments, and no experiments on intensity estimation were reported. Some researchers [9][78] tried to use the outputs of expression classifiers for intensity estimation. [9] used SVM for expression recognition, and the distance output of SVM, wt x+b w, is directly used to estimate the intensity of expression. This is based on the assumption that large distance to the hyperplane means the corresponding sample with high intensity, which is actually not guaranteed. The hyperplane of SVM is formed by the support vectors, which have no constrains or relationships to the intensity. Similar to [9], Koelstra [78] used the output of Gentleboost to predict the intensity of AUs, but he did not give quantitative analysis on the predicted results. The both methods took intensity estimation as a byproduct of the learned recognition classifier directly. However, as we mentioned in Section 5., intensity estimation is not a simple hard-decision problem, so conventional classification scheme does not work well. Figure 5. shows the distance output of SVM [9] on an image sequence of expression happiness from the the Cohn-Kanade database. The sequence is from the onset to the apex, and the expression intensity should be monotonous increase, while the output distances of SVM does not guarantee intensity increasing monotonically. In this chapter, we also aim to handle expression recognition and intensity estimation together. Different from them, our framework is based on the ranking model, which focuses on the ordinal relationships between pair-wise data. The output of our ranking model can basically approximate the intensity variation, which is illustrated in the bottom row of figure 5.

48 Figure 5.: The intensity changing on one sequence, the output of SVM, and the output of the proposed ranking model. 5. Our Contribution To handle the above issues, we propose a novel framework for expression recognition and its intensity estimation based on ranking model.

59 48 Figure 5.: The intensity changing on one sequence, the output of SVM, and the output of the proposed ranking model. 5. Our Contribution To handle the above issues, we propose a novel framework for expression recognition and its intensity estimation based on ranking model. Although it is hard to measure absolute intensities of samples, it is easy to organize the training samples with pair-wise ordinal relationship, so we can use ranking learning to estimate expression intensity with a relative measure. Ranking learning can also reduce the influence of subjects by pair-wise data organization. The proposed framework is shown in figure 5.4. It consists of three components: ) facial appearance feature representation. We use the haar-like features to represent facial appearance due to its good properties, especially in facial appearance representation [8][79][80]; ). Ordinal pair-wise data organization. We make the data suitable for the ranking model by using the temporal variations of the data; ) Building ranking model. This is the core component. Due to a large number of haar-like features existing, we propose to use the RankBoost [8] to select a subset of haar-like features to construct a final strong ranker. In order to further improve the performance of the RankBoost, we integrate the sparsity of l regularization into the RankBoost. The final ranking score given by ranking function H(x) can be used for expression intensity estimation and recognition. The details will be addressed in section 5.4. The main contributions of this paper are: () We convert the intensity estimation problem to a ranking problem, and we propose to use the RankBoost for expression intensity estimation. () In order to further improve the performance of the RankBoost, we propose to integrate the sparsity of l regularization into

60 49 the RankBoost. () The proposed framework can also handle expression recognition besides intensity estimation. Figure 5.4: The framework of our method. In this section, we will first introduce how to organize the data, and then we will present the Rank- Boost and how to improve it with l regularization. Finally we will discuss how to use the output ranking model for expression recognition. 5. Data Organization Given an expression sequence, although it is hard for us to label the intensity on each instance which could be single image or sub-sequence. We can definitely label the ordinal relationship between a pair of instances according to temporal order easily. In order to separate different expressions, we train the ranking model by the strategy of one vs. all, i.e., when we learn the rank model for expression E i, we define the rank order of any data within the E i is higher than that of the data from the other expressions. Without loss of generality, we take the expression E i as the interested expression for example to talk about how to organize the data as follows:

50 Taking account of a subject with the expression E i and the other expression E j, we label its intensity decreases from apex to the start state, and then it is connected by the sequence from the

61 50 Taking account of a subject with the expression E i and the other expression E j, we label its intensity decreases from apex to the start state, and then it is connected by the sequence from the other expressione j as, R(I Ei,Apex) R(I Ei,start) R(I Ej,start) R(I Ej,Apex), where R(I) is the ranking score of the instance I. According to this rule, we get the reordered sequences set {S Ei }, and based on {S Ei }, we build pairwise instances {(x k, x k+ )} for the ranking model learning to satisfy R(x k+ ) R(x k ). We also define the ranking orders of R(I Ei,start) of one subject are always higher than those of R(I Ej,start) of any subjects to produce some pairwise samples between different subjects. Figure 5.5 shows an example of the intensity ranking on a happiness sequence. Figure 5.5: The organization of training data on the ranking of intensity of happy expression. (other expressions, the correlative intensity to happy is inverse to its original intensity. The red rectangle has higher intensity than the green rectangle in each pair.) 5.4 Ranking Model Ranking is widely used in the fields of information retrieval [8] and econometric model [8]. For a ranking problem, it is assumed there is an outcome space Y = {r,..., r q } with ordered ranks r q r q... r, where denotes the order between different ranks. Generally, one latent continuous function U(x) should be learned to map one sample x into value r in the space Y. Ordinal regression is a classical ranking method. However, it is not easy to use the ordinal regression to learn the map function U(x) for expression intensity estimation directly, because it is hard to label each image with an

62 5 absolute intensity value. In this chapter, we propose to use the RankBoost model for intensity estimation and recognition, which is based on the ordinal relationship between pairwise data, because we can easily organize the data in the ordinal pairwise format according to the temporal variation of expression RankBoost In this work, we also use the haar-like features to represent facial appearance, so we have thousands of haar-like features. It is untractable to use all the haar-like features to build the ranking model. Moreover, each expression is only dominated by parts of facial appearances. Thus, we adopt the RankBoost to build the rank model over the ordinal pair-wise data for intensity estimation. Similar to the boosting learning, the RankBoost [8] aims to select a set of weak rankers to build a strong ranker. Given pairwise sample T sets {x i,0, x i, }, the RankBoost tries to find a ranking function H(x) = α t h t (x), where h i (x) is a weak ranker, based on the following loss function: loss(h) = min exp(h(x 0) H(x )) x 0,x = min T exp( α(h t (x 0 ) h t (x ))) x 0,x t= The Rankboost additively selects weak rankers by minimizing the exponential loss function, in which the greedy optimization strategy is used to solve it. The detailed algorithm is presented in Algorithm. Algorithm RankBoost Learning procedure : Give example image pairs (x i,0, x i, ),...,(x n,0, x n, ). : Initialize weight D t (i) = /N. : for t = 0... T do 4: Train weak learner using distribution D t (i). 5: Get weak ranking h t : h t (x) R, s.t equation 5.. 6: Choose α t R. 7: Update: D t+(x i,0, x i,) = D t(x i,0,x i, ) exp(α t (h t (x i,0 ) h t (x i, ))) Z t where Z t is a normalization factor. 8: end for T 9: Output the final ranking H(x) = α t h t (x). t t (5.) 5.4. RankBoost with l Regularization In supervised learning settings with many inputting features, overfitting is usually a potential problem. It is well known that sample complexity grows linearly with the VC dimension when using unregularized discriminative models to fit the samples via training error minimization. Further, the VC dimension for most models grows about linearly in the number of parameters, which typically grows at least linearly

63 5 in the number of input features. Thus, unless the training set size is large enough against the dimension of the input, some special mechanism such as regularization, which encourages the fitted parameters to be small is usually needed to prevent overfitting [84]. l regularization is a good choice due to its sparse character, which takes the sum of the squares of the parameters as the penalty term. We use the l regularization to improve the performance of the RankBoost, inspired by [85], and then we rewrite the loss function of the RankBoost as, loss(h) = min T exp( α(h t (x 0 ) h t (x ))) x 0,x s.t Σ T t α t r, α j 0 t= (5.) We call it the RegRankBoost, and Algorithm 4 summarizes its learning procedure. Algorithm 4 RankBoost Learning procedure : Give example image pairs (x i,0, x i, ),...,(x n,0, x n, ), set U 0 = Ø, r 0 = 0. : Initialize weight D t (i) = /N. : for t =... T do 4: Train weak learner using distribution D t (i). 5: Get weak ranking h k : h k (x) R. 6: Choose α t R, update U t = U t h k,r t = r t + vα t. 7: Solve the convex minimization problem over {a j } j Ut : min Σ m i= exp(σ j Ut α j h j (x i,0 x i, )) s.t Σ j Ut α j r t,α j 0, j U t 8: Update the coefficients: { αj if j U a t (j) = t ; 0 otherwise. 9: Update: D t+ (x i,0, x i, ) = D t(x i,0,x i, ) exp(α t (h t (x i,0 ) h t (x i, ))) Z t where Z t is a normalization factor. 0: end for : Output the final ranking H(x) = T t α th t (x). (5.) 5.4. Recognition by the Ranking Model The conventional expression recognition methods generally assume samples of interested expression are disjoint with samples of other expressions, denoted as two disjoint subsets X 0 and X, and then they aim to find one function F to make F (X ) = + and F (X 0 ) =. In the review of ranking, they can actually be regarded as bipartite ranking methods [8]. The ranks of all the instances in X are above the ranks of all the instances in X 0, and the one feedback function Φ is built to make Φ(X 0, X ) =, Φ(X, X 0 ) =, and Φ(X, Y ) = 0 for all the other pairs. Under this strategy, the subtle ranks in the sets of X 0 and X are ignored, and the ranking model is degraded to a classifier. Based on this interpretation, we can extend our ranking model for expression recognition directly by the following formula:

64 5 F (x) = arg max m H m (x) (5.4) α m,t where {α m,t } are the coefficients of selected weak rankers in the corresponding ranking function H m (x), and m is the class label. t 5.5 Experiment Our experiments are conducted on the Cohn-Kanade facial expression database [86]. We randomly select 66% subjects as the training set, and the rest subjects as the testing set. The face is detected automatically by the Viola s [8] face detector and it is normalized to as in [4] by fixing the eyes locations. Different from [] where just the last three peak frames are used, we almost use all the frames of each sequence to cover the onset state of the expression to the apex, because we focus on expression intensity estimation besides expression recognition. To evaluate the proposed method, we compare it to two related works: the SVM based method [9] and the boost based method [78]. For fair comparison, all the methods are based on the haar-like features. For the SVM based method, following by [9], we first use Adaboost to select some discriminative features and train SVM based on the selected features, in which the linear kernel is adopted. For [78], we use AdaBoost to replace the Gentleboost for convenience, because these two boosting methods have similar performances. For simplicity, we denote them as the AdaSVM and the Adaboost respectively. There are six basic expressions, so we use one against all strategy to build 6 classifiers or ranking models for each method. In [9] the output distance of SVM is directly for intensity estimation, and the output of boost is used to predict intensity in [78]. Three kinds of criteria are used to evaluate the performances: Detection Rate (DR), Recognition Rate(RR) and Relevant Accuracy(RA)[87]. used for evaluating the performance of intensity estimation, and its definition of RA is: RA is RA = number of correctly ranked relevant pairs number of all relevant pairs Given a sequences with n frames whose intensity increases monotonically, we could build C n relevant pairs (x 0, x ), where intensity of x is higher than x 0. Because the variations of some consecutive images are too subtle to be distinguished, we use an interval with frames to build the pairwise data along the rebuilt sequences. (5.5) 5.5. Training Error Analysis Before we report the results on the testing set, we would like to compare the performances of four methods on the training set, by analyzing DA and RA. For SVM and Adaboost, we calculate DA based on all

65 54 Table 5.: Performances on the training set Expression AdaSVM(linear) AdaBoost RankBoost RegRankBoost DR RA DR RA DR RA DR RA Angry Disgust Fear Happiness Sadness Surprise mean Table 5.: Performances on the testing set Expression AdaSVM(linear) AdaBoost RankBoost RegRankBoost RR RA RR RA RR RA RR RA Angry Disgust Fear Happiness Sadness Surprise mean the frames in the positive set, and make sure the false alarm is zero. For RankBoost and RegRankBoost, we use the pairwise instances in the training set to calculate DA. For RA, all the methods calculate it based on the same pairwise instance in the interested expression. Table 5. reports the results of DAs and RAs. From the table 5., we can see that even the DAs of the AdaSVM and the AdaBoost are very high, their RAs are low. It means that the output distances of the SVM and the AdaBoost can not precisely describe the intensities. The RankBoost and the RegRankBoost obtain good DAs and RAs, because they are based on the ordinal pairwise data and their outputs are directly related to the intensities. In order to further investigate the performances of the RankBoost and the RegRankBoost, we illustrate their training errors in figure 5.6, where the error rate is equal to RA. We can see that the RegRankBoost slightly outperforms the RankBoost from the figure 5.6. For the RegRankBoost, almost 400 weak rankers are enough to build a final strong ranker, while the RankBoost needs at least 800 weak rankers to achieve similar performance to the RegRankBoost.

66 Training Error v.s Feature Num on Angry RankBoost RegRankBoost Training Error v.s Feature Num on Disgust RankBoost RegRankBoost training error training error training error training error number of rankers Training Error v.s Feature Num on Fear number of rankers RankBoost RegRankBoost Training Error v.s Feature Num on Sadness number of rankers RankBoost RegRankBoost training error training error number of rankers Training Error v.s Feature Num on Happiness number of rankers RankBoost RegRankBoost Training Error v.s Feature Num on Surprise number of rankers RankBoost RegRankBoost Figure 5.6: Error rates on the training data.(anger, Disgust, Fear, Happiness, Sadness and Surprise)

67 Testing Error v.s Feature Num on Angry RankBoost RegRankBoost 0.9 Testing Error v.s Feature Num on Disgust RankBoost RegRankBoost testing error testing error testing error number of rankers Testing Error v.s Feature Num on Fear RankBoost RegRankBoost testing error number of rankers Testing Error v.s Feature Num on Happiness RankBoost RegRankBoost number of rankers Testing Error v.s Feature Num on Sadness RankBoost RegRankBoost number of rankers Testing Error v.s Feature Num on Surprise RankBoost RegRankBoost testing error testing error number of rankers number of rankers Figure 5.7: Error rates on the testing data.(anger, Disgust, Fear, Happiness, Sadness and Surprise)

68 Testing Error Analysis On the testing data, we use RR and RA to evaluate the performances of four methods. Because there are six classifiers, we use the maximal normalized distance as the final recognition results for the AdaSVM and the AdaBoost. For the RankBoost and the RegRankBoost, the equation 5.4 is used to do recognition. The detailed results of RRs and RAs are shown in table 5.. The results on the testing set are similar to the results on the training set. The RankBoost and the RegRankBoost are better than the AdaSVM and the AdaBoost. Although four methods obtain high DRs on the training set, their RRs have some drops on the testing set, because there are no overlaps between the training set and the testing set. RRs in our experiments seem not as good as the results in [], but the testing settings are totally different. In [], they only pick the last tree peak frames to do recognition, while we use all the frames from the onset to the apex to do recognition and intensity estimation. Especially for the frames around the onset, i.e., the frames with low intensities, it is hard to identify their categorizations. The RegBankBoost outperforms the Rankboost in both RRs and RAs. In order to further investigate their performances, we also evaluate their testing errors with the increase of the weak rankers shown in figure 5.7, where the testing error is equal to RA. We can see the RegBankBoost is a little better than the RankBoost. It demonstrates that the introduction of l regularization can efficiently improve the performance of the RankBoost. Comparing the RAs in table 5. and 5., we can see RA drops a little for the RankBoost and the RegRankBoost. Besides non-overlaps exist between the training set and the testing set, some pairs of the data are really hard to label their ranks, because their intensity variations are too subtle to discriminate. Figure 5.8 shows some pairs which are wrongly ranked by the ranking model. We can see that they are very similar. Additionally, we can see that the Adaboost gets a slightly higher RA in the testing set than in the training set from table 5. and table 5.. It shows that the the Adaboost does not take account of intensity values during the training and it cannot be used for intensity estimation. 5.6 Summary In this chapter, we proposed to use the RankBoost for facial expression recognition and intensity estimation. Different from the previous work, we converted the intensity estimation problem to a ranking problem, and the intensity level is scored by the ranking function. Also we can use the ranking score for expression recognition. In order to further improve the performance of the RankBoost, the RegRank- Boost is proposed, in which l regularization is integrated into the RankBoost. Extensive experiments conducted on the Cohn-Kanade facial expression database demonstrated the power of the proposed

69 58 Figure 5.8: Mis-classified examples. method.

70 59 Chapter 6 Multiple Instance Features Almost all the previous works only focused on designing new algorithms on the well-aligned face images. In practice, it is hard to obtain well-aligned face images by current face alignment techniques due to the impaction of illumination and pose. In this chapter, we first investigate the influence of misalignment on facial expression recognition, and we propose a new framework of facial expression recognition based on multiple instance features, which is robust to face misalignment. Facial expression is generally dominated by a few parts of face appearance, so we first divide face image into image patches. To better capture variations of face appearance, a multi-scale appearance representation scheme is developed. Considering face images are not well aligned, we propose to use a boosting based multiple instance learning approach to learn discriminant patterns in the patch level, and we take the outputs of the patches as the features to learn final facial expression classifiers by Adaboost or SVM. The experiments conducted on the Cohn-Kanade database and the MMI database demonstrate that the proposed method achieves exciting performance, especially the robustness to misaligned face images. 6. Motivation In practices, it is difficult to obtain well-aligned face images automatically by current face alignment techniques including eye detection [6], active shape model (ASM) [7] or active appearance model (AAM) [8], due to the influence of lighting, pose, or other environment factors. Face misalignment has been demonstrated to rapidly degrade the performance of face recognition [9]. We argue that face misalignment has much effect on the performance of facial expression recognition too, but few literature took it into account in designing recognition algorithm. We focus on this practical issue in this chapter. Face misalignment is a very important and realistic issue in practical system of facial expression recognition. To validate the influence of face misalignment, here we report the experiments we did on the Cohn-Kanade facial expression database[5] to give a quantitative evaluation. This database is widely used to evaluate the facial expression recognition algorithm, and its detailed description will be given in the section of experiments. Also, this database provides the labeled coordinates of two eyes in face images, which make us easy to set up the experiments for evaluation of face misalignment. We crop

71 60 Table 6.: The influence of face misalignment Misalignment Size Accuracy(%) Normalized images pixels disturb pixels disturb pixels disturb pixels disturb face images based on the eyes coordinates. We take the labeled coordinates as the ground truth, and we randomly add the perturbations of ±, ±4, ±6 and ±8 to the eyes coordinates respectively to simulate the data with misalignment. For simplicity, we only take account of the apex images. We divide / of the apex data into the training set and the rest of / data is taken as the testing set. The recognition algorithm is followed by [], which used the Local Binary Patterns (LBP) as the raw features and SVM as the classifier. Table 6. reports the experimental results. We can see that the recognition accuracy has the degradations of.6% and 8.0% respectively when there have ±4 and ±6 pixels perturbation in the eyes coordinates. When there is the perturbation up to ±8 pixels, the degradation reaches to.%. Face misalignment is inevitable in real systems, even in the case of face alignment manually. Figure 6. shows some examples of face misalignment, which are aligned by a ASM-based algorithm [88]. Figure 6.: The impaction of the mis-alignment on the normalized face. 6. Our Contribution To handle the issue of face misalignment described above, we propose a new facial expression recognition method based on multiple instance learning. Multiple instance learning is different from conventional pattern-based learning, in which training class labels are only associated with sets of samples (or

72 6 bags), instead of individual samples (or instances) [89]. If a bag is labeled as positive, it means that at least one instance inside the bag is positive, and a bag is labeled as negative if and only if all the instances in the bag are negative. Multiple instance learning has been successfully used to handle object misalignment in human detection [89][90]. [9] extended multiple instance learning to feature selection to handle feature misalignment occurred in object detection. Different from general object detection, facial expression recognition is sensitive to local face appearance variation, and face shape and appearance variations are complicated due to the diversity of subjects. We divide face image into some local image patches, and we adopt a multi-scale patch representation to better capture face appearance variation. In each patch, we extract different sub-areas with the same scale at the different locations and use local binary pattern (LBP) to describe the sub-areas. Considering that face image may be not well aligned, we adopt a boosting based multiple instance learning to learn the discriminant patterns in the patch level, in which the patch is taken as a bag and its sub-areas are its instances, because boosting based multiple instance learning has been demonstrated to be robust to object misalignment and feature misalignment in object detection [9][8][90]. The outputs of all the patch-level learners are combined into the multiple instance features to build final facial expression classifiers by SVM [9] or Adaboost [9]. The proposed method is evaluated on two benchmarks: the Cohn-Kanade database [5] and the MMI database [49], and the experimental results demonstrate that it is robust to facial misalignment. To our best knowledge, we are the first one to handle the issue of face misalignment for facial expression recognition. In [9], they investigated the influence of misalignment on face recognition, and they presented to enrich the training set by adding the noised data and to learn a subspace based on the new training set with the noise data for face recognition. Such simply adding noised data to the training only reduce the influence of misalignment to some extent. 6.. Face Appearance Representation Facial expression variation is reflected by face appearance variation, and a few parts of face appearance variation is usually enough to disclose a facial expression. We first divide face image into several image patches for analysis similar to [44]. To better capture facial appearance variations, we adopt a multi-scale scheme to partition face image, which is shown in Figure 6.. The large scale is benefit for describing the global variations, and the small scale is more discriminative to local variations. In the level t, we set local patch size as n t n t, and / overlaps are allowed between the patches. Because it is difficult to make face images well-aligned, there may have misalignments between the corresponding local patches in face images. To handle this misalignment problem, we perform multiple instance learning in the patch level. Different from traditional pattern-based learning, multiple instance learning does not care

73 6 the labeling information on samples (instance), but it needs to know the labeling information on bag, which contains instances [8]. If a bag is labeled as negative, all the instances in the bag must be negative. If a bag is labeled as positive, at least on instance in the bag is positive. In our case, although face images are not well-aligned, it is reasonable to make the assumption that the corresponding patches have the overlaps. We regard each image patch as one bag. The bags from the interested expression are defined as positive bags, otherwise they are defined as negative bags. In each patch, we extract different sub-areas at different locations from it as its instances. For each instance, we use LBP feature to represent it, because LBP has been successfully used on facial expression recognition in [] [], and LBP feature is a kind of statistical feature. Figure 6.: Multi-Scale multiple instance feature. The blue rectangle means the patch we used which is looked as the bag, and the red rectangles inside are the corresponding instances. 6.. Learn Multiple Instance Features As described above, we first perform a boosting-based multiple instance learning on each patch, and its output is called a multiple instance feature. The discriminative multiple instance features are selected from all the patches, and used to learn the final facial expression classifier. Assuming a bag X i has N i instances: x ij, j =,..., N i. Given a classifier C, the response score y ij of an instance x ij over C is: y ij = C(x ij ). The probability of an instance x ij to be positive can be evaluated by the logistic function [89] as: p ij =. With a Noisy-OR model, which is proposed +e y ij to harness the diverse density based multiple instance learning in [94], the probability of a bag x i is

74 6 positive can be formulated as p i = N i j= ( p ij). The Noise-OR model means the probability of the bag to be positive is high when this bag includes at least one instance with high probability to be positive, otherwise the bag is negative when all the instances inside are with low probabilities. To avoid the numerical issues when N i is large, the geometric mean is used to modify the above formulation as p i = N i j= ( p ij) /N i [9]. Following to [9], a multiple instance feature f i refers to an aggregation function of multiple instances, and there also has a logistic relationship between the bag s probability p i and f i, p i = +e f i. With all the response scores of the instances, y ij, the multiple instance feature f i is calculated by: N i f i = log( ( + e y ij ) /N i ), (6.) j= Thus, we need to maximize () to learn an optimal multiple instance feature f i, which is equivalent to learning the instance classifier C. The multiple instance boosting (MILBoosting) has been demonstrated to be a good tool to learn such a classifier C with the greedy descent searching [95][89]. The samples in the boosting learning are the multiple instances in the bags. The goal is to build the classifier C, whose output is the score of an instance x ij computed through the weighted sum of scores from selected weak classifiers h t (x ij ): y ij = C(x ij ) = T λ t h t (x ij ), (6.) where h t (x ij ) {, +} and the weight λ t > 0. After getting the score on the instances, we can easily calculate the probability of the instance by p ij =. The likelihood assigned to a labeled +e y ij bag is: t L{C} = t p t i i ( p i) t i, (6.) where t i {0, }. The purpose of the boosting is to maximize the likelihood L(C), and it is equivalent to maximizing the log-likelihood function: log(l{c}) = log( t = i p t i i ( p i) ti )) t i log p i + ( t i )log( p i ) (6.4) where t i {0, }. Put the the partial derivative of the likelihood function with respect to each sample, we can have

75 64 Let w ij = log(l(c)) y ij = N i t i p i p i p ij (6.5) log L(C) y ij, which is regarded as the weight of instance sample. The weight on the positive bag is pi p i, and the weight on the negative bag is always. In the positive bag, the weight of the instance is calculated by w + ij = p i N i p i, and in the negative bag, its instance s weight is calculated by w ij = N i p ij. The boosting process selects the optimal hypotheses h t (x) based on the iterative updated sample weights. In each round, the selection of the optimal hypotheses is summarized by the following algorithm: Algorithm 5 Optimal Hypotheses Selection : Give the instances samples: (x i, w i ),...,(x n,i, w ni ), : search the optimal weak classifier h(x) {, +}, which maximizes the energy function ψ(h): ψ(h) = ij wt ij h(x ij) : Based on the previous learned classifier C t (x) = λ t h t (x), search the optimal step size λ t+ to maximize log(l(c t (x) + λ t+h t+ (x)), then the classifier is updated as C t+ (x) = C t (x) + λ t+h t+ (x)) Simple stump classifier is used as the weak learner h t (x) whose output is {, }. In each round of the multiple instance learning, we aim to get the optimal weak classifier h t (x) to maximize w ij h t (x ij ). The coefficient λ t is decided by line search to maximize the log likelihood function L(C + λ t h t (x)). And we set the search range is [0, 0] in our experiment and the searching step size is Learning Facial Expression Classifier Assuming we have M local patches based on multi-scale partition, we can learn M multiple instance features for each face. Based on these multiple instance features, we use Support Vector Machine (SVM) and Adaboost learning to build final face expression classifiers respectively. As we know, each expression is only revealed by appearance variations of a few local patches, so we had better select some multiple instance features from the dominant local patches to build the final classifiers. The Adaboost learning has such a function that selects some discriminant features or weak classifiers to build a strong classifier. In our experiments, we set the number of discriminant features to 60 during the Adaboost iterations. However, SVM learning does not have a function of feature selection, so we perform a feature selection scheme first, and then we perform SVM learning on the selected features. We calculate the mean (x pos )of the each multiple instance feature across all positive samples(the interested expression), so does the mean (x neg ) across the negative samples(other expressions). Then the distance between x pos and x neg is used as the measurement to sort all the components. The top K components are selected to

76 65 calculate the multiple instance features through the following equation 6.6 {k} = argmax topk { x i,pos x i,neg }, (6.6) Finally we select the top K multiple instance features for SVM learning. We set K = 60 in the experiments. The linear kernel function is used for simplicity. Facial expression recognition is a typical multiple class recognition problem. We take six basic expressions into account in the paper, so it is a six-class recognition problem. We use the one-against-all strategy to decompose the six-class problem into multiple two-class problems. For each expression, we set the corresponding samples as the positive samples, and the samples of other expressions are taken as the negative samples. 6.4 Experiment In the experiments, two popular benchmarks are used to evaluate the proposed method: the Cohn- Kanade [5] and MMI [] facial expression databases. For the both databases, all the face images are cropped and resized to We divided the image into overlap patches by sliding square windows of different scales (6, 5,, 40 pixels), and the slide step is half of the current window size, so finally we have 6 local patches for each face. The testing protocol is based on five-fold cross-validation. To better evaluate the performance of the proposed method, we will compare the proposed method with two baselines: ) Directly training the SVM classifier based on LBP featuers (Baseline ); ) Performing the Adaboost classifier on the LBP features (Baseline ) Evaluation on the Cohn-Kanade Database For our experiments on the CK database, we selected 5 image sequences from 96 subjects. The selection criterion was that a sequence could be labeled as one of the six basic emotions. The Cohn- Kanade database provides the eyes coordinates, and we fix the eyes coordinates to crop faces and resize them to To simulate the misalignment, we add some noises i.e.,, 4, 6, 8 pixels disturb to the eyes coordinates respectively. The experimental results are listed in tables 6.. We can see that the proposed methods almost obtain similar performance on the normalized images and images with pixels disturb. This is because pixels disturb is also inevitable in labeling eye s coordinates manually. Compared to two baselines, the proposed multiple instance features are obviously robust to some misalignments. When the disturb reaches 6 pixels, the proposed method also has similar performances of two baselines under the case of no misalignment. Comparison in the same situation, the proposed method improves the baselines by over 7%.

77 66 Table 6.: Recognition rates (%) on different misalignment level sets with SVM and Adaboost on Cohn- Kanade database Misalignment Size SVM Adaboost Baseline Baseline Normalized images pixels disturb pixels disturb pixels disturb pixels disturb Table 6.: Recognition rates (%) on different misalignment level sets with SVM and Adaboost on MMI database SVM Adaboost Baseline Baseline Evaluation on the MMI Database The MMI database includes 0 students and research staff members aged from 9 to 6, of whom 44% are female, having either a European, Asian, or South American ethnic background. In this database, sequences have been labeled with six basic expressions, in which 05 sequences are with frontal face. We conduct out experiments on the data from all the 05 sequences. MMI is a more challenging database than Cohn-Kanade database. First, the subjects make expressions spontaneously. Different people make the same expression in differen ways. There is a big variance existing in the same expression due to different subjects. Second, some subjects wear accessories, such as glasses, head, cloth, or moustache. Additionally, in some sequences, the apex frames are not with high expression intensity. All these factors will greatly degrade the recognition performance. Because the MMI database does not provides the eyes coordinates, we use the ASM to do face alignment and crop faces. Table reports our experimental results. We can see the efficiency of the proposed multiple instance features. With the SVM classifier, the multiple instance features obtains a recognition rate of 68.70%, but the baseline only has a recognition rate of 6.8%. With the Adaboost learning, the recognition rates of the proposed method and the baseline are 69.5% and 57.54% respectively. By the way, the recognition rate reported in [96] is a slight better than our method on this database, but it did not use all the front face data, and it collected its data only from 99 selected sequences, while our experimental data is from all the 05 sequences.

78 Summary Automatic facial expression recognition attracted much attention in these years in the communities of computer vision and pattern recognition, and a lot of algorithms have been proposed, especially machine learning approaches made big progresses on accurately recognizing facial expression. However, almost all the previous works only focused on designing new algorithms on the well-aligned face images. In practice, it is hard to obtain well-aligned face images by current face alignment techniques due to the impaction of illumination and pose. In this chapter, we first investigate the influence of face misalignment on facial expression recognition, and we propose a new framework of facial expression recognition based on multiple instance features, which is robust to face misalignment. Facial expression is generally dominated by a few parts of face appearance, so we first divide face image into some image patches. To better capture variations of face appearance, a multi-scale appearance representation scheme is developed. Considering face images are not well aligned, we propose to use a boosting based multiple instance learning approach to learn discriminant patterns in the patch level, and we take the outputs of the patches as the features to learn final facial expression classifiers by Adaboost or SVM. The experiments conducted on the Cohn-Kanade database and the MMI database demonstrate that the proposed method achieves exciting performance, especially the robustness to misaligned face images.

79 68 Chapter 7 Exploring Facial Expressions with Compositional Features Most previous work focuses on how to learn discriminating appearance features over all the face without considering the fact that each facial expression is physically composed of some relative action units (AU). However, the definition of AU is an ambiguous semantic description in Facial Action Coding System (FACS), so it makes accurate AU detection very difficult. In this chapter, we adopt a scheme of compromise to avoid AU detection, and try to interpret facial expression by learning some compositional appearance features around AU areas. We first divided face image into local patches according to the locations of AUs, and then we extract local appearance features from each patch. A minimum error based optimization strategy is adopted to build compositional features based on local appearance features, and this process embedded into Boosting learning structure. 7. Motivation Izard [7] and Facial Action Coding System (FACS) designed by Ekman and Friesen [7]. FACS is composed of comprehensive standards that decompose each expression into several relative action units (AUs). Although AU-based facial expression analysis is much more precise compared to the six emotions based analysis, the definitions of AUs are actually ambiguous semantic descriptions. Therefore, it is not easy to do accurate AU detection automatically. Thus, in the communities of computer vision and pattern recognition, most of automatic facial expression analysis work only focused on identifying an input facial image or sequence as one of six basic emotions [5][57][68][6] [][80][7][55][97][6]. Generally, the first step is to extract appearance features, such as, Gabor features[4][9], haar-like features[98], and local binary patterns(lbp)[44], and then some learning methods are adopted to select discriminant features over the whole face to build the classifiers, such as SVM [9][44], Adaboost [98][9], and Adaboost + SVM [9]. Although they obtained good performance in some cases, the features they considered or selected are short of physical interpretation. According to FACS [7], each expression has explicit and intuitive local appearance variations, which are corresponding to AUs. In other words, each expression should be physically represented by some features with special spatial information. Moreover, it is rare that a single AU appears alone in an expression. Usually several AUs appear

80 69 simultaneously to show a meaningful facial expression. For example, the happiness expression may involve AU6+AU+AU5 (lips part); the surprise expression may involve AU (inner brow raiser)+au (outer brow raiser)+au5 (upper lid raiser) + AU5 + AU7 (mouth stretch); and the sadness expression may involve AU+AU4 (brow lowerer)+ AU5+AU7 [7]. Driven by this observation, Yan[99] developed the method to do AUs recognition based on UAs dynamic and semantic relationships. In [99], the relation among AU s, which comes from the phycologist scientists observation, is used as the prior to build the structure of a Bayesian network, and then the learning strategy is applied to update the structure. Due to considering the co-occurrence of AUs, Yan got very good recognition results. However, as we know, FACS only gives semantical descriptions of AUs. For example, AU means inner brow raiser, and AU4 means eye brow lower. The definition of the level of the AU is much ambiguous, and it makes accurate AU detection much difficult in practices. Therefore, it is hard to do low level expression recognition through analyzing AUs in the practical system. Figure 7.: The distribution of AUs on the face is shown in first row, and some superimposed subwindows on the face image. (a)upper face action units;(b)lower face action units(up/down Actions);(c)lower face action units(horizontal actions);(d)lower face action units( oblique actions);(e)lower face action units(orbital actions). Bottom:(A)The row of superimposed sub-windows along the eyes area;(b)one row of superimposed sub-windows at face bottom. 7. Our Contribution In order to handle the above issues, we propose a compromise scheme in this chapter. Although FACS does not make a clear definition for AU s level, it points out where each AU occurs in the face. Based on the locations of AUs, we first divide the face into the overlap blocks to cover the location of all the possible AUs, and we extract local appearance features from each patch with haar-like descriptors [8]. Inspired by the observation that each expression is composed of several relative AUs, we try to build compositional features with constrains to simulate the combination of AUs. We develop an optimization

70 method to build compositional features through minimizing the error. The procedure of compositional feature searching is integrated into Boosting framework to construct the final classifiers.

81 70 method to build compositional features through minimizing the error. The procedure of compositional feature searching is integrated into Boosting framework to construct the final classifiers. The proposed method is tested on the Cohn-Kanada database, and experimental results demonstrate its efficiency and its consistency to FACS. 7. Local Patches and Feature Description FACS defines AUs, and each expression is composed of several AUs. Although it is hard to obtain accurate AU detection according its definition, the location it occurs is clear, so we divide the face image into blocks to cover almost all the AUs location. Assuming the image with the size of m, m, we superimpose the sub-window (local patch) with the size of (m/4, m/4), and use the step of m/8 in both x and y direction to obtain the local patches. In the experiments, the image size is 64 64, so we totally obtain 49 local patches. Figure7. shows the distribution of active units on the facial image, we can see the red rectangles in (A) almost covers upper AUs and the red ones in(b) covers the orbital AUs. Therefore, we can use the features within the patch to describe the potential information of AUs inside. As described above, we extract 49 patches(sub-windows) from one facial image, and we first extract the haar-like features from each patch. For convenience, we denote the patches as {P,..., P 49 }, and the haar-like feature set {f p,i } for each patch P, where p is the index of patch, i is the feature index within the corresponding patch. The number of the haar-like features are M features in each patch. Therefore, for one facial image, we totally have a feature pool of {{f,i },...{f 49,i }}. Based on these raw features {{f,i },...{f 49,i }}, we want to explore the compositional features which combine some possible subset features together, and we expect such combination consistent to the interpretation of FACS. Figure 7. shows an example of a compositional feature which is composed of three raw features from three different patches. The details of compositional feature pursuit are addressed in the next section. Figure 7.: The sample of a compositional feature.

Boosting Coded Dynamic Features for Facial Action Units and Facial Expression Recognition

Boosting Coded Dynamic Features for Facial Action Units and Facial Expression Recognition Peng Yang Qingshan Liu,2 Dimitris N. Metaxas Computer Science Department, Rutgers University Frelinghuysen Road,