FACIAL EXPRESSION RECOGNITION USING ARTIFICIAL NEURAL NETWORKS

FACIAL EXPRESSION RECOGNITION USING ARTIFICIAL NEURAL NETWORKS M.Gargesha and P.Kuchi EEE 511 Artificial Neural Computation Systems, Spring 2002 Department of Electrical Engineering Arizona State University Tempe, AZ 85287 Instructor: Dr. Kari Torkkola Abstract Analysis and recognition of human facial expressions from images and video forms the basis for understanding image content at a higher semantic level. Expression recognition forms the core task of intelligent systems based on human computer interaction (HCI). In this paper, we explore the use of Artificial Neural Networks in performing expression recognition. We analyze seven basic types of human expressions neutral, happy, sad, disgust, anger, surprise and fear. The expression recognition task is divided into two steps an automatic facial feature extraction step, and a classification step that employs Multi Layer Perceptrons and Radial Basis Function Networks. Simulation results demonstrate an expression classification accuracy of 73% on the JAFFE database (using MLP networks), which is significant for a system whose preliminary feature extraction step is automatic. 1. Introduction The ability of humans to recognize a wide variety of facial expressions is unparalleled. Researchers in the recent past have been trying to automate this task on a computer, employing a combination of image/ video processing techniques, along with machine learning techniques like ANNs. A brief survey of the existing techniques for facial expression analysis is presented next. Approaches for facial expression analysis from both static images [3,4,5,7,11] and video [1,2,9,10] have been proposed in the literature. Since temporal information yields crucial cues, it is relatively easier to analyze and recognize expressions from a temporal sequence of images (video). Ekman and Friesen [13] have produced a system for describing all visually distinguishable facial movements, called the Facial Action Coding System, which has been referred to in recent literature [6,8]. It is based on the enumeration of all Action Units on a face that cause facial movements. There are 46 such AUs in FACS that account for changes in facial expression. Researchers have used the FACS as the basis for their expression recognition research. For e.g., Lien et al [6], have developed a computer vision system that specifically recognizes individual AUs or AU combinations in the upper face. Ding et al [8] note that discovering rules that relate AUs to specific expressions associated with emotional states anger, fear, happiness, disgust, surprise and sadness, is difficult, since it cannot be defined by any regular mathematical function. This is where neural networks come into play. Most of the approaches employing neural networks for facial expression recognition, involve a preliminary facial feature extraction / tracking step that employs a wide variety of methods like Point Contour Detection Method [5], optical flow based tracking [2], etc. This is then followed by an expression classification step in which various features extracted from the faces (for e.g., simple geometric coordinates and Gabor Jet coefficients [14], etc.) are fed into Neural Network structures (MLPs, RBFs, Hopfield Neural Nets, SVMs, etc.). In this paper, we adopt a similar approach- we first automatically extract the facial features that are essential for discriminating between facial expressions. This is an improvement over the manual annotation method for Facial Characteristic Points (FCPs) described in [14]. Then, we feed geometric data derived from the contour points of the features for the given image and and its corresponding neutral image, along with Hu-moment invariants [12] describing the shapes of the facial features enclosed by the contour point, to MLP and RBF networks for classification. We assume the availability of neutral expression for a given face and classify facial expressions based on this neutral image. 1

between the left and right corners of the mouth as compared to the upper and lower lips. Also, eyes tend to be wide open and hence larger in size. Eyebrows tend to be drawn upwards as compared to their neutral position. Therefore, an accurate extraction of contours of these facial features would enable us to automatically recognize these expressions. Facial feature extraction is performed in two stages (1) Search region identification and adaptive thresholding with connected components analysis to obtain a tighter bounding box, and (2) an accurate contour fitting step involving active contour models (snakes). Figure 1. Sample images from JAFFE database We use a subset of the Japanese Female Facial Expression (JAFFE) Database [15] to conduct our experiments. The database has 213 images in all, of which we use 115 images for our experiments. The images feature 10 different subjects with 7 different expression categories neutral, happy, sad, disgust, anger, surprise and fear. Some sample images from the database are shown in Figure 1.Simulation results suggest that MLPs (73%) perform better than RBFs (65%). The reason for the moderate performance is the automatic feature extraction step, which currently extracts only the positions of the eyes, eyebrows and mouth. However, by extracting other feature points on the face automatically, we can improve this performance. The rest of the paper is organized as follows. We describe the facial feature extraction step in detail in section 2. Section 3 discusses the data model to be used to train the ANNs considered for classification purposes. Simulation results are provided in section 4, which also describes the optimal structure of MLPs and RBFs for solving the given problem and compares their performance for expression classification. Finally section 5 concludes the paper and discusses future work. 2. Facial Feature Extraction The first step to be performed in expression recognition is localization of facial feature points whose position and deformation are characteristic of each basic expression. We analyze seven basic types of human expressions neutral, happy, sad, disgust, anger, surprise and fear. For example, happiness is characterized by a larger separation between the left and right corners of the mouth as compared to the upper and lower lips. Eyebrows and eyes tend to be relaxed for a happy expression. On the other hand, surprise/ fear is generally characterized by the mouth being wide open, which means a smaller separation 2.1. Search region identification We studied a few sample images from the JAFFE database to derive some heuristics to isolate search regions for the five facial features left and right eyebrows, left and right eyes, and mouth. Figure 2(a) shows a sample image, and figure 2(b) shows the search areas for the facial features. Once the search regions were identified in terms of bounding boxes around them, a morphological edge intensity image [5] was obtained by subtracting an eroded image from a dilated image of the search region. A threshold was adaptively chosen based on intensity distribution in the gradient image and was applied to it. The resulting binary image was analyzed for the connected component with the largest area. A tighter bounding box was obtained for the facial feature from this largest connected component. Figure 2(c) shows the sample image with a tighter bounding box around the facial features. 2.2. Contour Fitting After an accurate bounding box has been obtained for a facial feature, we employ an adaptive contour fitting technique that uses an Active Contour Model (or a snake ) based on Gradient Vector Flow[16], which is an energy minimizing spline. We initialize a contour by a set of control points and define an energy function. The objective of the contour fitting process is to minimize the energy function, which is achieved by iteratively moving the control points closer to regions in the image that contain edges and maxima and minima in intensity, curvature, etc. After a finite number of iterations, the snake attains a minimum energy configuration and either shrinks or spreads out to cover the contour (depending on the initial shape). The final shape of the contour is defined by a set of control points, which can be obtained by subsampling the snake control points. These points give us an accurate estimate of the various Facial Characteristic Points (FCPs) [14] that are needed for facial expression 2

(a) Figure 3. Image with detected contour points (b) compared them with that of a neutral image of the same person. We could immediately note that each expression had a characteristic displacement of these points as compared with the neutral image. Also, the inter-feature distances were characteristic of each expression when compared with those of the neutral image. This motivated us to feed in the following data into the neural network (i) The geometric coordinates of the FCPs for the image carrying an expression and the corresponding neutral image (ii) Euclidean distances between the corresponding contour points of the given image and the neutral image (iii) Differences in inter-feature distances computed for the given and neutral image. 3.1. Hu-Moment Invariants (c) Figure 2. (a) Sample image (b) Search regions for facial features (c) Tighter bounding box for facial features recognition. Figure 3 shows the contour points on the eyes, eyebrows and mouth, for the sample image in Figure 2(a). 3. Data Model for the ANN The next problem to be addressed is the selection of the right kind of data to be fed into the MLP or RBFN for training. We analyzed the geometric coordinates of the contour points of the features for a given image and In addition to the geometric data, we explored the possibility of utilizing shape information conveyed by the contour points, to train the MLP / RBF. Moments have been used to code shape information. They have been used very effectively in optical character recognition. During any expression change, there is a change in the shape of the features in the face. Since, moments describe shape information, we felt we could use moments for expression recognition. In this regard, we compute the Hu-moment invariants [12]. Hu moments are invariant to scale, rotation and translation. We use the first 3 out of 7 invariant moments as shape descriptors for each of the 5 facial features, for both the given and neutral images. We do away with the other 4 moments because then the neural network would learn the specifics of the facial feature shape associated with a person, not the expression. We feed these moments along with their differences into the MLP / RBF. 3

3.2. Dimensions of the data The feature vector for training the RBF / MLP had a dimension of 395, which consisted of the following 200 geometric coordinates, 50 Euclidean distances between corresponding feature points in given and neutral images, 100 differences in inter-feature distances, and 45 Humoment values. The 45 Hu moment values are 15 for the expressed face (3 for each feature), 15 for the neutral image and 15 differences between the moments of the neutral face and the expressed face. 4. Test Results The JAFFE database [15] consists of 213 images of which we have used a subset of 115 images consisting of all 7 expressions from 10 subjects (repetitions in subjects and expressions) from which data is extracted automatically. 105 images were set aside for training and 10 images were set aside for testing/ cross-validation. We deviated slightly from the classical method of model selection by cross validation, because we use the same sizes of training / test sets for both model selection and final performance evaluation. 4.1. Model Selection for MLP network An MLP with an input layer, a single hidden layer and an output layer was used for training. For selecting the optimal structure (number of hidden layers and weight decay parameter), the 115 images were randomly partitioned into a training set of 105 images and a crossvalidation set of 10 images. The number of hidden layers and the weight decay parameter was varied for the MLP. The result of this model selection is depicted in Figure 4. After the optimal network structure was selected for the MLP, 10 random partitions of the 115 images into 105 training images and 10 test images were performed. We averaged the network classification performance over these 10 trials. We note that model selection was performed using only 300 epochs because the error was driven to a sufficiently small value by then. Once the optimal parameters were determined, we used 500 epochs of training while evaluating network performance. Hence, the mean classification rate shown in Table 1 below (73%) is significantly higher than that obtained during model selection (62.5%). NumberHidden = [ 6 100 200 300 400 500 600] decay = [0.0500 0.2000 0.3000 0.4000] Best classification performance = 62.5% Ideal parameter set : NumberHidden = 200, decay = 0.05 300 training cycles with scg algorithm. Figure 4. Model Selection for MLP Trial Correct % Correct Classifications 1 8/10 80 2 7/10 70 3 8/10 80 4 7/10 70 5 6/10 60 6 7/10 70 7 7/10 70 8 8/10 80 9 7/10 70 10 8/10 80 Mean classification rate = 73% Table 1. Classification performance of MLP 4.2. Model Selection for RBF network An RBF network was used with Gaussian basis functions and linear output activation function. For selecting the optimal structure (number of basis functions and regularization parameter), the 115 images were randomly partitioned into a training set of 105 images and a crossvalidation set of 10 images. The number of basis functions (hidden neurons) and regularization parameter was varied for the RBF. The following performance was observed (as shown in Figure 5). 4

After the optimal network structure was selected for the RBF, 10 random partitions of the 115 images into 105 training images and 10 test images were performed. We averaged the network classification performance over these 10 trials. A mean classification accuracy of 65% was obtained for the RBFN, as shown in Table 2. many cases, there are only very subtle variations in the actual geometric and shape data fed into the neural network. The classes have a high degree of overlap in the output space. Hence, the RBF network using Gaussian basis functions is not able to learn as well as an MLP with the same (comparable) number of hidden layer neurons. 5. Conclusions and Future Work An automatic facial expression recognition system was developed that consisted of two stages a feature extraction engine to determine the contours of the facial feature points that are crucial for discriminating between expressions, and an Artificial Neural Network that trains on specific features of the contour points to perform expression classification. We tested the system on the JAFFE database with both MLP and RBF networks. Simulation results demonstrate show that the MLP (73 % classification accuracy) performs better than the RBF (65 % classification accuracy). number_hidden =[ 50 100 200 300 400 500] lambda = [0 0.05 0.1 0.2 0.3 0.4 0.5] Best classification performance =62% Ideal parameter set: NumberHidden = 300, regularization = 0.1 300 epochs of training Fig 5. Model Selection for RBF Trial Correct % correct Classifications 1 8/10 80 2 7/10 70 3 6/10 60 4 6/10 60 5 6/10 60 6 8/10 80 7 6/10 60 8 6/10 60 9 6/10 60 10 6/10 60 Mean classification rate = 65% Table 2. Classification performance of RBFN From the simulation results, we can see that the MLP networks, on the whole, perform better than RBF networks. Let us analyze the reason for this behavior. In The reason for the moderate performance is the automatic feature extraction step, which currently extracts only the positions of the eyes, eyebrows and mouth. However, by extracting other feature points on the face automatically, we can improve this performance. The performance results can also be improved by exploring more discriminatory and effective image features to the extracted from the facial images. We also need to improve the accuracy of the contour estimation technique, because this is a crucial step for accurate facial expression recognition. Finally, a better training strategy and a better choice of network structure (for e.g., MLP with multiple hidden layers) may yield better classification results. We will explore these aspects as part of our future work. References [1] Essa, I.A.; Pentland, A.P., Facial expression recognition using a dynamic model and motion energy, Fifth International Conference on Computer Vision, 1995. Page(s): 360 367 [2] Essa, I.A.; Pentland, A.P., Coding, analysis, interpretation, and recognition of facial expressions, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume:19,Issue:7,July 1997, Page(s): 757 763 [3] Yoneyama, M.; Ohtake, A.; Iwano, Y.; Shirai, K., Facial expressions recognition using discrete Hopfield neural networks, Proceedings., International Conference 5

on Image Processing, 1997. Volume: 1, 1997 Page(s): 117 120 [4] Pantic, M.; Rothkrantz, L.J.M., An expert system for multiple emotional classification of facial expressions, Proceedings, 11th IEEE International Conference on Tools with Artificial Intelligence, 1999, Page(s): 113 120 [5] Jyh-Yeong Chang; Jia-Lin Chen, A facial expression recognition system using neural networks, IJCNN '99. International Joint Conference on Neural Networks, 1999, Volume:5, 1999 Page(s): 3511 3516 [6] Lien, J.J.; Kanade, T.; Cohn, J.F.; Ching-Chung Li, Automated facial expression recognition based on FACS action units, Proceedings, Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998. Page(s): 390 395 [7] Kobayashi, H.; Hara, F., Recognition of Six basic facial expression and their strength by neural network, Proceedings, IEEE International Workshop on Robot and Human Communication,1992, Page(s): 381 386 [13] P.Ekman and W.V. Friesen, Facial Action Coding System, Palo Alto, Calif.: Consulting Psychologists Press Inc., 1978. [14] Zhengyou Zhang; Lyons, M.; Schuster, M.; Akamatsu, S., Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron, Proceedings. Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, Page(s): 454 459. [15] JAFFE database: http://www.mic.atr.co.jp/~mlyons/jaffe_temp.html, The paper which describes the database: Michael J. Lyons, Shigeru Akamatsu, Miyuki Kamachi & Jiro Gyoba, Coding Facial Expressions with Gabor Wavelets, Proceedings,Third IEEE International Conference on Automatic Face and Gesture Recognition,April 14-16 1998, Nara Japan, IEEE Computer Society, pp. 200-205. [16] Chenyang Xu and Jerry L. Prince, Snakes, Shapes, and Gradient Vector Flow,Image Analysis and Communications Laboratory,Department of Electrical and Computer Engineering,The Johns Hopkins University,Baltimore, MD. [8] Ding, J.; Shimaniura, M.; Kobayashi, H.; Nakamura, T. Neural Network Structures For Expression Recognition, IJCNN '93-Nagoya. Proceedings of 1993 International Joint Conference on Neural Networks,1993.,Volume:2, Page(s): 1430 1433 [9] Ira Cohen, Nicu Sebe, Larry Chen, Ashutosh Garg, Thomas S. Huang, Facial Expression Recognition from Video Sequences: Temporal and Static Modelling, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign [10] Rosenblum, M.; Yacoob, Y.; Davis, L.S., Human expression recognition from motion using a radial basis function network architecture, IEEE Transactions on Neural Networks, Volume: 7 Issue: 5, Sept. 1996, Page(s): 1121 1138 [11] Iwano, Y.; Yoneyama, M.; Shirai, K., Recognition of facial expressions using associative memory, Proceedings, IEEE Digital Signal Processing Workshop 1996., Page(s): 243-246 [12] M.-K. Hu, Visual pattern Recognition by moment invariants, IRE Trans. Information Theory, vol IT-8, pp 179-187, Feb 1962. 6