Beyond Prototypic Expressions: Discriminating Subtle Changes in the Face. January 15, 1999

Size: px

Start display at page:

Download "Beyond Prototypic Expressions: Discriminating Subtle Changes in the Face. January 15, 1999"

Gregory Pope
5 years ago
Views:

1 Lien, Kanade, Cohn, & Li, page 0 Beyond Prototypic Expressions: Discriminating Subtle Changes in the Face January 15, 1999

2 Lien, Kanade, Cohn, & Li, page 1 Beyond Prototypic Expressions: Discriminating Subtle Changes in the Face James Jenn-Jier Lien and Takeo Kanade Robotics Institute, Carnegie Mellon University Jeffrey F. Cohn Department of Psychology, University of Pittsburgh Ching-Chung Li Department of Electrical Engineering, University of Pittsburgh Corresponding author: Jeffrey F. Cohn, Clinical Psychology Program, 4015 O'Hara Street, Pittsburgh, PA 15260, USA. jeffcohn@vms.cis.pitt.edu

3 Lien, Kanade, Cohn, & Li, page 2 Abstract Current approaches to automated analysis focus on a small set of prototypic expressions (e.g., joy or anger). Prototypic expressions occur infrequently in everyday life, however, and emotion expression is far more varied. To capture the full range of emotion expression, automated discrimination of fine-grained changes in facial expression is needed. We developed and implemented a computer vision system that is sensitive to subtle changes in the face. Three convergent modules extract feature information and discriminate action units using discriminant analysis or hidden Markov models. FACS action units are the smallest visibly discriminable muscle actions that combine to form expressions. The modules are feature-point tracking, denseflow tracking with principal components analysis, and high-gradient component detection. In image sequences from 100 adults, action units and action unit combinations in the brow, eye, and mouth regions were selected for analysis if they occurred a minimum of 25 times in the image database. Each module demonstrated high concurrent validity with manual FACS coding.

4 Lien, Kanade, Cohn, & Li, page 3 1 Introduction Most computer-vision-based approaches to facial expression analysis [e.g., 3,25] attempt to discriminate only a small set of prototypic expressions of emotion. This focus follows from the work of Darwin [10] and more recently Ekman [13] and Izard et al.[21] who proposed that basic emotions (i.e., joy, surprise, anger, sadness, fear, and disgust) each have a prototypic facial expression, involving changes in facial features in multiple regions of the face, which facilitates analysis. In everyday life prototypic expressions occur relatively infrequently,however, and emotion more often is communicated by changes in one or two discrete features, such as tightening the lips in anger [5]. Change in isolated features, especially the brows or eyelids, also is typical of paralinguistic displays (such as raising the brows to signal greeting). To capture the subtlety of human emotion and paralinguistic communication, automated discrimination of fine-grained changes in facial expression is needed. The Facial Action Coding System (FACS) [15] is a human-observer-based system designed to detect subtle changes in facial features. Using FACS and viewing videotaped facial behavior in slow motion, trained observers can manually code all possible facial displays, which are referred to as action units (AUs). More than 7,000 combinations have been observed [12]. Ekman and Friesen [16] proposed that specific combinations of FACS action units represent prototypic expressions of emotion (i.e. joy, sadness, anger, disgust, fear, and surprise). Emotion expressions, however, are not part of FACS; they are coded in separate systems, such as EMFACS [19]`. FACS itself is purely descriptive, uses no emotion or other inferential labels, and provides the necessary ground truth with which to describe facial expression. 1.1 Principal components analysis of grayscale images Several approaches to automated facial feature extraction and recognition have proven promising. One approach, initially developed for face recognition, uses principal components

5 Lien, Kanade, Cohn, & Li, page 4 analysis (PCA) of grayscale face images and artificial neural networks. High dimensional face images are reduced to a lower dimensional set of eigenvectors, referred to as eigenfaces [34], which are used as input to an artificial neural network or other classifier. Using this approach, Padgett, Cottrell, and Adolphs [27] discriminated 86% of six prototypic emotion expressions as defined by Ekman (i.e., joy, sadness, anger, disgust, fear, and surprise). Bartlett and colleagues [1], discriminated 89% of six upper face FACS action units. Although encouraging, these systems have some limitations. First, because Padgett et al. [27] and Bartlett et al. [1] perform PCA on grayscale values, information about individual identity is encoded along with information about expression, which may impair discrimination. Lower-level image processing may be required to produce more robust discrimination of facial displays. Second, eigenfaces appear highly sensitive to minor variation in image alignment for the task of face recognition [28]. Similar or even better precision in image alignment is required when eigenfaces are used to discriminate facial displays. The image alignment used by Padgett et al. [27] and Bartlett et al. [1] was limited to translation and scaling, which is insufficient to align face images across subjects with face rotation. Third, these methods have been tested only on rather limited image data sets. Padgett et al. [27] analyzed photographs from Ekman s and Friesen s [14] Pictures of Facial Affect, which are considered prototypic expressions of emotion. Prototypic expressions differ from each other in many ways, which facilitates automated discrimination. Bartlett et al. [1] analyzed images of subjects many of whom were experts in recognizing and performing FACS action units, and target action units occurred individually rather than being embedded within other facial displays. Fourth, Bartlett et al. performed manual time warping to produce a standard set of six pre-selected frames for each subject. Manual time warping is of variable reliability and is time consuming. In many applications

6 Lien, Kanade, Cohn, & Li, page 5 behavior samples are variable in duration, and standardizing duration may omit critical information [7]. 1.2 Optical flow in image sequences More recent research has taken optical-flow-based approaches to discriminate facial displays. Such approaches are based on the assumption that muscle contraction causes deformation of overlying skin. In a digitized image sequence, algorithms for optical flow extract motion from the subtle texture changes in skin, and the pattern of such movement may be used to discriminate facial displays. Specifically, the magnitude and direction of pixel movement across the entire face or within windows selected to cover certain facial regions are computed between successive frames. Using measures of optical flow, Essa, and Pentland [17], Mase [25] and Yacoob and Davis [37] discriminated among emotion-specified displays (e.g., joy, surprise, fear). This level of analysis is comparable to the objective of manual methods that are based on prototypic emotion expressions (e.g., AFFEX: [21]). The work of Mase [25] and Essa and Pentland [17] suggested that more subtle changes in facial displays, as represented by FACS action units, could be detected from differential patterns of optical flow. Essa and Pentland [17], for instance, found increased flow associated with action units in the brow and mouth region. The specificity of optical flow to action unit discrimination, however, was not tested. Discrimination of facial displays remained at the level of emotion expressions rather than the finer and more objective level of FACS action units. Bartlett et al. [1] discriminated between action units in the brow and eye regions in a small number of subjects. A question about optical-flow based methods is whether they have sufficient sensitivity to subtle differences in facial displays, as represented in FACS action units. Work to date has used aggregate measures of optical flow within relatively large facial regions (e.g., forehead or

7 Lien, Kanade, Cohn, & Li, page 6 cheeks), including plurality of flow [3,31] and mean flow within the region [25,26]. For recognition, Black and Yacoob [3] and Black, Yacoob, Jepson, and Fleet [4] also disregard subtle changes in flow that are below an assigned threshold. Information about small deviations is lost when the flow pattern is aggregated or thresholds are imposed. As a result, the accuracy for discriminating FACS action units may be reduced. 1.3 High-gradient component detection Another approach involves detection of high gradient components. Facial motion produces transient wrinkles and furrows perpendicular to the motion direction of the activated muscle, and these transient features supply information relevant to the discrimination of action units. Contraction of the corrugator muscle, for instance, produces vertical furrows between the brows, which is coded in FACS as AU 4, while contraction of the medial portion of the frontalis (AU 1) causes horizontal wrinkling in the center of the forehead. Some of these lines and furrows may become permanent with development. Permanent crows-feet wrinkles around the outside corners of the eyes (characteristic of AU 6) are common in adults but not in infants [39]. When lines and furrows become permanent facial features, contraction of the corresponding muscles produces changes in their appearance, such as deepening or lengthening. Early approaches to high-gradient component detection required artificial enhancement of facial features, such as the use of make-up [32]. Improved algorithms have overcome this limitation [22]. 2 Automated Face Analysis The objective of the present study was to implement the first version of our system of automated face analysis and to assess its concurrent validity with manual FACS coding. The system (Figure 1) includes three convergent modules for facial feature extraction and quantification. One module, feature-point tracking, tracks the movement of pre-selected feature

8 Lien, Kanade, Cohn, & Li, page 7 points within small feature windows (currently 13 by 13 pixels) and imposes no arbitrary thresholds. The feature points to be tracked are selected based on two criteria: they are in regions of high texture and represent underlying muscle activation of closely related action units. In an early study [8], we found that motion in a small number of pre-selected feature points was sufficient to discriminate between closely related action units. The second module, dense-flow tracking, quantifies flow across the entire face. Dense flow tracking has reduced sensitivity for localized, small motion but extracts information within larger regions. The third module is highgradient component detection. Facial motion produces transient wrinkles and furrows perpendicular to the motion direction of the activated muscle. The facial motion associated with a furrow produces gray-value change in the face image. Insert Figure 1 About Here Because analysis of dynamic images produces more accurate and robust recognition than that of a single static image [2], action units are recognized in the context of entire image sequences of arbitrary length. Discriminant analysis [11] or hidden Markov models (HMMs) [30] are used for facial expression recognition. HMM performs well in the spatio-temporal domain, robustly resolves the time warping problem (compared with [1], and has demonstrated validity in related domains of human performance (e.g., gesture [38] and speech recognition [30]. Insert Figure 2 About Here 2.1 Facial Action Coding System (FACS) Unlike previous methods that recover a small set of prototypic expressions, ours recovers FACS action units, which are the smallest visibly discriminable changes in facial expression. In the present study, 15 FACS action units and action unit combinations are discriminated (Figure 2). These action units were selected because of their importance in emotion and paralinguistic

9 Lien, Kanade, Cohn, & Li, page 8 communication. For instance, AU 4 is characteristic of negative emotion and mental effort, and AU 1+2 is a component of surprise. In all three facial regions, the action units chosen are relatively difficult to discriminate because they involve subtle differences in appearance (e.g. brow narrowing due to AU 1+4 versus AU 4, eye narrowing due to AU 6 versus AU 7, three separate action unit combinations involving AU 17, and mouth widening due to AU 12 versus AU 20). Unless otherwise noted, action units as used below refers to both single action units and action-unit combinations 2.2 Image alignment To remove the effects of spatial variation in face position, slight head rotation, and facial proportions, images are warped automatically to a standard face model using either an affine [22] or perspective transformation [36,20]. By automatically controlling for face position, orientation, and magnification in this initial processing step, feature-point displacement, dense flow, and high-gradient component vectors in each frame have close geometric correspondence. Face position and size are kept constant across subjects so that these variables do not interfere with action unit discrimination. An example of image alignment by perspective transformation can be seen in Figure 3. In the original image sequence (Row A), the subject turns his head to the right while beginning to smile (AU 12), pitches the head up, increases mouth opening (AU 25/26), and narrows the eyes and raises the cheeks (AU 6) and brows (AU 1+2). The intensity difference images in Row B illustrate the confounding of rigid and non-rigid motion in the original image sequence (Row A). Row C shows the perspective transformation of the original image sequence in Row A, and row D the intensity differences between Row C and the initial frame in Row A. In comparison with the intensity images of the original series (Row B), non-rigid motion no longer is confounded by head motion.

10 Lien, Kanade, Cohn, & Li, page 9 Insert figure 3 About Here 2.3 Feature-point tracking In the first frame, 38 features are manually marked using a computer mouse: 6 feature points around the contours of the brows, 8 around the eyes, 14 around the nose, and 10 around the mouth. Based on comparison feature-point marking by two operators, mean inter-observer error in manual marking is 2.29 and 2.01 pixels in the horizontal and vertical dimensions, respectively. The Pearson correlation for inter-observer reliability is 0.97 and 0.93 in the horizontal and vertical dimensions, respectively. Automated marking of feature points is feasible and has been partially implemented in recent work [33]. The movement of feature points is automatically tracked in the image sequence using the Lucas-Kanade algorithm [24]. Given an n by n feature region R and a gray-scale image I, the algorithm solves for the displacement vector d = (d x, d y ) of the original n by n feature region by minimizing the residual E(d), which is defined as E(d) = [ I t+1 (x+d) - I t (x) ] 2 x R where x = (x, y) is a vector of image coordinates. The Lucas-Kanade algorithm performs the minimization efficiently by robustly optimizing the sum-of-squares difference equation, and the displacements d x and d y are solved with sub-pixel accuracy. The region size used in the algorithm was 13-by-13. The algorithm was implemented by using an iterative image-pyramid [29], with which rapid and large displacements of up to 100 pixels (e.g., as found in sudden mouth opening) can be tracked robustly while maintaining sensitivity to subtle (sub-pixel) facial motion. On a 300 MHz Pentium II computer, processing time is approximately 1 second per frame. Insert Figure 4 About Here

11 Lien, Kanade, Cohn, & Li, page 10 The two images on the right in Figure 4 show an example of feature-point-tracking. The subject s face changes from neutral (AU 0) to brow raise (AU 1+2), eye widening (AU 5), and jaw drop (AU 26), which is characteristic of surprise. The feature points are precisely tracked across the image sequence. Lines trailing from the feature points represent changes in their location during the image sequence. As the action units become more intense, the feature point trajectory becomes longer. 2.4 Dense Flow Tracking with Principal Component Analysis (PCA) Feature-point tracking is sensitive to subtle feature motion and tracks large displacement well but omits information from the forehead, cheek and chin regions, which also contain important expression information. To include information from these regions, wavelet basis functions are used to track dense flow across the entire face image [36] (Figure 5). Insert Figure 5 About Here Because our goal is to discriminate expression rather than identify individuals or objects, facial motion is analyzed using dense flow rather than gray value. Compared with Bartlett et. al. [1], estimates of facial motion by this approach are unaffected by differences between subjects. To ensure that the pixel-wise flows of each frame have close geometric correspondence, flows are automatically warped to the 2-D face model. The high-dimensional dense flows of each frame are compressed by PCA to low-dimensional representations without losing significant characteristics or inter-frame correlation. Horizontal and vertical flows were analyzed separately within the upper (110 by 240 pixels) and lower (220 by 240 pixels) face. In the upper face, the first 10 horizontal and 10 vertical eigenvectors, or eigenflows, accounted for over 90 percent of the variation in image data. In the lower face, the first 15 horizontal and 15 vertical eigenflows met this criterion.

12 Lien, Kanade, Cohn, & Li, page 11 Each flow-based frame of an expression sequence is projected onto the flow-based eigenspace by taking the inner product of each element of the eigenflow set. This results in two (horizontal and vertical) 10 dimensional vectors in the upper face and two 15 dimensional weighted vectors in the lower face (Figure 6). Within each region, horizontal and vertical vectors are concatenated to form a 20- and 30-dimensional weighted vector for each frame. Insert Figure 6 About Here 2.5 High Gradient Component Analysis Facial motion produces transient wrinkles and furrows perpendicular to the motion direction of the activated muscle. The facial motion associated with these furrows produces grayvalue changes in the face image. A 5 x 5 Gaussian filter is used to smooth the image data. Next, 3 x 5 (row x column) horizontal line and 5 x 3 vertical line detectors are used to detect horizontal lines (i.e., high gradient components in the vertical directions) and vertical lines in the forehead region, respectively; 5 x 5 diagonal line detectors are used to detect diagonal lines along the nasolabial furrow; and 3 x 3 edge detectors are used to detect high gradient components around the lips and on the chin region. On a 300 MHz Pentium PC, processing time is approximately four frames per second. To verify that the high-gradient components are produced by transient skin or feature deformations and not by permanent characteristics of the individual s face, the gradient intensity of each detected high-gradient component in the current frame is compared with corresponding points within a 3 x 3 region of the first frame. If the absolute value of the difference in gradient intensity between these points is higher than the threshold value, it is considered a valid highgradient component produced by facial expression. All other high-gradient components are ignored (see Figure 7). In the former case, the high gradient component (pixel) is assigned a value of one. In the latter case, the pixels are assigned a value of zero.

13 Lien, Kanade, Cohn, & Li, page 12 Insert Figure 7 About Here The forehead (within the upper face) and lower face regions of the normalized face image are divided into sixteen blocks each. The mean value of each block is calculated by dividing the number of pixels having a value of 1 by the total number of pixels in the block. The variance of each block is also calculated. For upper- and lower-face expression recognition, mean and variance values are concatenated to form 32-dimensional mean and variance vectors for each frame. 2.6 Action unit recognition In method development, all data are divided into training and test sets. The upper and lower face regions are analyzed separately using discriminant analysis (feature point module only) or HMM (all modules). The two methods produce comparable results. For feature-point tracking, more image sequences have been analyzed by discriminant analysis than by HMM. Results of discriminant analysis are presented. For dense-flow tracking and high-gradient component analysis, results of HMM are presented Discriminant analysis In the analyses of the brow region, the measurements consist of the horizontal and vertical displacements of the 6 feature points around the brows. In the analyses of the eye region, the measurements consist of the horizontal and vertical displacements of the 8 feature points around the eyes. In analyses of the mouth region, the measurements consist of the horizontal and vertical displacements of the 10 feature points around the mouth and four on the either side of the nostrils because of the latter s relevance to the action of AU 9. Therefore, each measurement is represented by a 2p dimensional vector by concatenating p feature point displacements; that is D = (d 1,d 2,,d p ) = (d 1x, d 1y, d 2x,d 2y, d px, d py ).

14 Lien, Kanade, Cohn, & Li, page 13 The discrimination between action units is done by computing and comparing the a posteriori probabilities of action units AUs; that is D AU k if p (AU k D) > p(au j D) j k where p( AU i D) = p( D AU i ) p( AU i ) p( D) = p( D AU i ) p( AU i ) k p( D AU p AU j = 1 j ) ( j ) The discriminant function between AU i and AU j is therefore the log-likelihood ratio f ij p( AU ) ( ) log i D D = = log p( AU j D) p( D AU i ) p( AU i ) p( D AU j ) p( AU j ) The p(d AU i ) is assumed to be a multivariate Gaussian probability distribution N(u i, i ), where the mean u i, and the covariance matrix i are estimated by the sample means and sample covariance matrices of the training data. This discriminant function is a quadratic discriminant function in general; but if covariance matrices i and j are the same, it reduces to a linear discriminant function. Because we are interested in the descriptive power of the feature point displacement vector itself, rather than relying on other information (e.g., relative frequencies of action units in our specific samples), a priori probabilities p(au i )s are assumed to be equal Hidden Markov models An HMM is defined by the following triplet: λ = {A,B,π}, where A is the probababilistic NxN state transition matrix, B is the MxN observable symbol probability matrix with M discrete observable symbols, and π is the N-length initial state probability vector for the HMM. Vector quantization [23] is used to reduce the multidimensional input vector X i to a unidimensional set of symbols. Each element of the state transition probability matrix A[N][N] and the observed

15 Lien, Kanade, Cohn, & Li, page 14 symbol probability matrix B[M][N] is initialized with uniformly distributed random values and normalized to meet stochastic constraints. Because a left-to-right HMM to model the image sequence, the initial state distribution vector π [N] is 1 for the first state, 0 for all other states after initialization. The Baum-Welch method is used iteratively to reestimate the parameters, λ=(a,b, π). After each iteration, the observable symbol probability distributions of the model parameters are smoothed and renormalized to meet stochastic constraints. Because the trained HMM represents the most likely AUs and AU combinations, it may be used to evaluate input data O test data = {o 1, o 2,...o T } where T is the number of frames in the image sequence. The goal of HMM is to learn the mapping of observations (quantized feature vectors) onto FACS action units. The generalizability of the HMM is tested by evaluating performance on novel image sequences. 3 Experimental Results Facial behavior was recorded in 100 adults (65% male, 35% female, 85% European- American, 15% African-American or Asian, ages 18 to 35 years). Subjects sat directly in front of the camera and performed a series of facial expressions that included single action units (e.g., AU 12, or smile) and combinations of action units (e.g., AU 1+2, or brow raise). Each expression sequence began from a neutral face. Action units were coded by a lab member certified in the use of FACS. Seventeen percent of the data were comparison coded by a second certified FACS coder. Inter-observer agreement was quantified with coefficient kappa, which is the proportion of agreement above what would be expected to occur by chance [6,18]. The mean kappa for inter-observer agreement was Action units that are important to the communication of emotion and that occurred a minimum of 25 times were selected for analysis. The frequency criterion ensured sufficient data for training and testing. When an action unit occurred in combination with other action units that

16 Lien, Kanade, Cohn, & Li, page 15 may modify its appearance, the combination rather than the single action unit was the unit of analysis. Figure 2 shows the 15 action units and action unit combinations thus selected. Insert Tables 1 Through 5 About Here Fifteen action units were analyzed by feature point tracking. Nine of the 15 action units were analyzed by dense-flow tracking, and five of these nine were analyzed by high gradient component detection. Results are presented in tables 1-5. Feature-point and dense-flow tracked both show strong concurrent validity with manual FACS coding (kappa coefficients ranged between 0.79 and 0.91). High-gradient component detection showed strong concordance in the brow region (κ = 0.82) and moderate concordance in the brow region (κ = 0.60). 4 Discussion Previous studies have used optical flow to discriminate facial expression [1,17,37]. Sample sizes in these studies have been small, and with the exception of Bartlett et al. [1], this work has focused on the recognition of molar expressions, such as positive or negative emotion or emotion prototypes (e.g., joy, surprise, fear). We developed and implemented three convergent modules for feature extraction, dense-flow tracking, feature-point tracking, and highgradient component detection. All three were sensitive to subtle motion in facial displays. Feature-point tracking was tested in 504 images sequences from 100 subjects and achieved a level of precision that was as high as or higher than that of previous studies and comparable to that of human coders. Dense-flow tracking and high-gradient component detection were tested in a smaller number of sequences but showed comparable accuracy, with the exception of lower accuracy of high-gradient component detection in the mouth region. Accuracy in the test sets was 80% or higher in each region. The one previous study to demonstrate accuracy for discrete facial actions [1] used extensive manual preprocessing of the image sequences and was limited to upper face action units in only ten subjects. In the present

17 Lien, Kanade, Cohn, & Li, page 16 study, pre-processing was limited to manual marking with a pointing device, such as computer mouse, in the initial digitized image, facial behavior included action units in both the upper and lower face, and the large number of subjects, which included African-Americans and Asians in addition to European-Americans, provided a sufficient test of how well the initial training analyses generalized to new image sequences. Each module demonstrated moderate to high concurrent validity with human FACS coding. The level of agreement between automated face analysis and manual FACS coding was comparable to that achieved between manual FACS coders. Moreover, the inter-method disagreements that did occur were generally ones that are common in human coders, such as the distinctions between AU 1+4 and AU 4 or AU 6 and AU 7. These findings suggest that Automated Face Analysis and manual FACS coding are at least equivalent for the type of image sequences and action units analyzed here. One reason for the lack of 100% agreement at the level of action units is the inherent subjectivity of human FACS coding, which attenuates the reliability of human FACS codes. Two other possible reasons were the restricted number of features in feature-point tracking and the lack of aggregation across modules. From a psychometric perspective, aggregating the results of each module can be expected to optimize discrimination accuracy. In human communication, the timing of a display is an important aspect of its meaning. For example, the duration of a smile is an important factor in distinguishing between felt and false positive emotion. Until now, hypotheses about the temporal organization of emotion displays have been difficult to test. Human observers have difficulty in locating precise changes in behavior as well as in estimating changes in expression intensity. Automated Face Analysis by contrast can track quantitative changes on a frame-by-frame basis [9]. Sub-pixel changes may be measured, and the temporal dynamics of facial expression can be determined.

18 Lien, Kanade, Cohn, & Li, page 17 Two challenges in analyzing facial displays are the problems of controlling for rigid motion and the need to segment displays within the stream of behavior. Perspective alignment performs well for mild out-of-plane rotation [20,36], but more complex models may be needed for larger out-of-plane motion. Segmenting displays within the stream of behavior is a focus of current research. The capacity to recover action units from still images may be useful in the event of occlusion. Central to the development of Automated Face Analysis is what we learn from substantive applications. For instance, we have found that infants have less skin texture than do adults, and rapid and sudden head motion is more common [39]. Initial findings suggest that these factors may be accommodated by increasing resolution, increasing the number of pyramid levels, or adjusting feature window size. Other applications include assessment of facial nerve dysfunction [35] and studies of emotion expression in relation to psychophysiology of emotion. Applications include mental health, forensic psychology (e.g., lie detection), biomedicine, and human-computer interaction. In summary, Automated Face Analysis demonstrated high concurrent validity with manual FACS coding. In the cross-validation set, which included subjects of mixed ethnicity, average recognition accuracy for 15 action units in the brow, eye, and mouth regions was 81% to 91%. This is comparable to the level of inter-observer agreement achieved in manual FACS coding and represents a significant advance over computer-vision systems that can recognize only a small set of prototypic expressions that vary in many facial regions. With continued development, Automated Face Analysis will greatly reduce or eliminate the need for manual coding in behavioral research and contribute to the development of multi-modal computer interfaces that can understand human facial and paralinguistic behavior and respond appropriately.

19 Lien, Kanade, Cohn, & Li, page 18 5 Acknowledgement This research was supported by grant number R01 MH51435 from the National Institute of Mental Health to Jeffrey Cohn and Takeo Kanade. 6 References [1] M.S. Bartlett, P.A. Viola, T.J. Sejnowski, B.A. Golomb, J. Larsen, J.C. Hager, P. Ekman, Classifying facial action, In D. Touretski, M. Mozer, M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (1996) [2] J.N. Bassili, The role of facial movement and the relative importance of upper and lower areas of the face, Journal of Personality and Social Psychology 37 (1979) [3] M.J. Black, Y. Yacoob, Recognizing facial expressions under rigid and non-rigid facial motions, International Workshop on Automatic Face and Gesture Recognition, Zurich, 1995, [4] M.J. Black, Y. Yacoob, A.D. Jepson, D.J. Fleet, Learning Parameterized Models of Image Motion, Computer Vision and Pattern Recognition, [5] J.M. Carroll, J.A. Russell, Facial Expressions in Hollywood s Portrayal of Emotion, Journal of Personality and Social Psychology 72 (1997) [6] J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20 (1960) [7] J.F. Cohn, E. Tronick, Three-month-old infants' reaction to simulated maternal depression, Child Development 54 (1983) l85-l93. [8] J.F. Cohn, A. Zlochower, A Computerized analysis of facial expression: Feasibility of automated discrimination, American Psychological Society, NY, June 1995.

20 Lien, Kanade, Cohn, & Li, page 19 [9] J.F. Cohn, A. Zlochower, J. Lien, Y.T. Wu, T. Kanade, In N.H. Frijda (Ed.), Proceedings of the 9 th Conference of the International Society for Research on Emotions, Toronto, Canada, August 1996, [10] C. Darwin, The Expression of Emotion in Man and Animals, University of Chicago, 1872/1965. [11] R.O. Duda, P.E. Hart, Pattern Classification and Analysis, NY: Wiley, 1973 [12] P. Ekman, Methods for measuring facial action, In K.R. Scherer & P. Ekman (Eds.), Handbook of Methods in Nonverbal Behavior Research, Cambridge: Cambridge University 1982, [13] P. Ekman, Facial expression and emotion, American Psychologist 48 (1993) [14] P. Ekman, W. Friesen, Pictures of Facial Affect, CA: Consulting Psychologists Press, [15] P. Ekman, W.V. Friesen, Facial Action Coding System, Consulting Psychologist Press, Palo Alto, CA, [16] P. Ekman, W.V. Friesen, Facial Action Coding System Investigator's Guide, Consulting Psychologist Press, Palo Alto, CA, 1978 [17] I.A. Essa, A. Pentland, A Vision system for observing and extracting facial action parameters, IEEE Computer Vision and Pattern Recognition, [18] J.L. Fleiss, Statistical Methods for Rates and Proportions, NY: Wiley, [19] W.V. Friesen, P. Ekman, EMFACS-7: Emotional Facial Action Coding System, Unpublished manuscript, University of California at San Francisco, [20] W. Hua, Building a facial expression analysis system, Unpublished manuscript, Department of Computer Science, University of Pittsburgh, April 1998.

21 Lien, Kanade, Cohn, & Li, page 20 [21] C. Izard, L. Dougherty, E. Hembree, A System for Identifying Emotion by Holistic Judgments (AFFEX), Newark, DE: University of Delaware: Instructional Resources Center, [22] J.J. Lien, T.K. Kanade, A.Z. Zlochower, J.F. Cohn, C.C. Li, A multi-method approach for discriminating between similar facial expressions, including expression intensity estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [23] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design, IEEE Trans. on Communications, COM-28 (1) (1980). [24] B.D. Lucas, T. Kanade, An iterative image registration technique with an application in stereo vision, Seventh International Joint Conference on Artificial Intelligence, 1981, [25] K. Mase, Recognition of facial expression from optical flow, IEICE Transactions E74 (1991) [26] K. Mase, A. Pentland, Automatic lipreading by optical-flow analysis, Systems and Computers in Japan 22 (6) (1991). [27] C. Padgett, G.W. Cottrell, B. Adolphs, Categorical perception in facial emotion classification, Proceedings of The Cognitive Science Conference, 18 (1996, in press) [28] J. Philips, Face recognition: Where are we now and where are we going. Panel Session Conducted at the International Conference on Automatic Face and Gesture Recognition, Killington, VT, October 1996.

22 Lien, Kanade, Cohn, & Li, page 21 [29] C.J. Poelman, The paraperspective and projective factorization method for recovering shape and motion, Technical Report CMU-CS , Carnegie Mellon University, Pittsburgh, PA, [30] L. Rabiner, B.H. Juang, Fundamentals of speech recognition, Englewood Cliffs, NJ: Prentice Hall, [31] M. Rosenblum, Y. Yacoob, L.S. Davis, Human emotion recognition from motion using a radial basis function network architecture, Proceedings of the Workshop on Motion of Non-rigid and Articulated Objects, Austin, TX, November [32] D. Terzopoulos and K. Waters, Analysis of facial images using physical and anatomical models, Proceedings of the International Conference on Computer Vision (ICCV), December 1990, [33] Y. Tian, Summary of the facial expression analysis system, Technical reports, Robotics Institute, Carnegie Mellon University, December [34] M. Turk, A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neuroscience 3 (1) (1991) [35] G.S. Wachtman, J.F. Cohn, J. VanSwearingen, E.K. Manders, Pixel-wise tracking of facial movement by computer image processing, Plastic and Reconstructive Surgery, Pittsburgh, PA, March [36] Y.T. Wu, T. Kanade, J.F. Cohn, C.C. Li., Optical flow estimation using wavelet motion model, Proceedings of the International Conference on Computer Vision (ICCV), [37] Y. Yacoob, L. Davis, Computing spatio-temporal representations of human faces, In Proc. Computer Vision and Pattern Recognition, Seattle, WA, June 1994, [38] J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov models, Proc ICCV, IEEE Press, 1992,

23 Lien, Kanade, Cohn, & Li, page 22 [39] A.J. Zlochower, J.F. Cohn, J.J. Lien, T. Kanade, Automated face coding: A computer vision based method of facial expression analysis in parent-infant interaction, International Conference on Infant Studies, Atlanta, Georgia, April 1998.

24 Lien, Kanade, Cohn, & Li, page 23 Table 1. Proportion of Agreement Between Automated Face Analysis and Manual Coding in the Brow Region Manual Coding AU 1+2 AU 1+4 AU 4 Feature-Point Tracking AU 1+2 (43) AU 1+4 (19) AU 4 (32) Dense-Flow Tracking AU 1+2 (21) AU 1+4 (15) AU 4 (22) High-Gradient Component Detection AU 1+2 (50) AU 1+4 (30) AU 4 (45) Note. Number of samples of each AU appears in parentheses. Proportion correct = 0.91, 0.92, and 0.88, κ =.87, 0.87and 0.82 for feature-point tracking, dense-flow tracking, and high-gradient component detection respectively. Results presented for test samples only.

25 Lien, Kanade, Cohn, & Li, page 24 Table 2. Proportion of Agreement Between Feature-Point Tracking and Manual Coding in the Eye Region Manual AU 5 AU 6 AU 7 Coding Feature-Point Tracking AU 5 (28) AU 6 (33) AU 7 (14) Note. Number of samples of each AU appears in parentheses. Proportion correct = 0.88, κ =.82. Results shown for test sample only.

Feature-Point Tracking AU 27 (29).79.10.03.00.00.07.00.00.00 AU 26 (20).30.52.18.00.00.00.00.00.00 AU 25 (22).00.14.73.00.00.00.14.00.00 AU 12 (18).00.00.00.83.17.00.00.00.00 AU 12+25 (35).

26 Lien, Kanade, Cohn, & Li, page 25 Table 3. Proportion of Agreement Between Feature Point Tracking and Manual Coding in the Mouth Region Manual Coding AU 27 AU 26 AU25 AU12 AU12+25 AU20+25 AU15+17 AU AU9+17 Feature-Point Tracking AU 27 (29) AU 26 (20) AU 25 (22) AU 12 (18) AU (35) AU20+25±16 (26) AU (36) AU (12) AU 9+17±25 (17) Note. Number of samples of each AU appears in parentheses. Proportion correct = 0.81, κ = Results shown for test sample only.

27 Lien, Kanade, Cohn, & Li, page 26 Table 4. Proportion of Agreement Between Dense-Flow Tracking and Manual Coding in the Mouth Region AU12 AU AU20+25 AU15+17 AU AU9+17 Manual Coding Dense Flow Tracking AU12 (15) AU (15) AU20+25±16 (15) AU (15) AU (15) AU 9+17±25 (15) Note. Number of samples of each AU appears in parentheses. Proportion correct = 0.92, κ = Results shown for the test sample only.

28 Lien, Kanade, Cohn, & Li, page 27 Table 5. Proportion of Agreement Between High-Gradient Component Detection and Manual Coding in the Mouth Region Manual Coding AU AU 9+17 Feature-Point Tracking AU (50) AU 9+17 (30) Note. Number of samples of each AU appears in parentheses. Proportion correct = 81%, κ =.60. Results shown for test sample only.

29 Lien, Kanade, Cohn, & Li, page 28 Figure Captions Figure 1. Figure 2. Figure 3. Overview of Automated Face Analysis. Examples of FACS Action Units. Example of perspective alignment. The original image sequence is shown in Row A. The difference between the first and each subsequent frame is shown in Row B. The perspective alignment of the original images is shown in Row C and the corresponding difference images in Row D. Figure 4. Example of manually located feature points (leftmost image) and results of automated feature point tracking (two images on the right). The subject s expression changes from neutral (AU 0) to brow raise (AU 1+2) and mouth open (AU 26). Figure 5. Example of dense flow tracking (compared with Figure 4). Figure 6. Figure 7. Computation of weighted vertical flow vector for the upper face. Horizontal line detection in the upper face region.

Lien, Kanade, Cohn, & Li, page 29 Motion Vector Sequence Feature

g n m e n t Principal Component Analysis Gradient Distribution

Variance Vector Sequence Maximum Likelihood Decision Hidden

30 Lien, Kanade, Cohn, & Li, page 29 Motion Vector Sequence Feature Point Tracking Dense Flow Tracking High Gradient Detection A l i g n m e n t Principal Component Analysis Gradient Distribution Displacement Vector Sequence Weight Vector Sequence Mean- Variance Vector Sequence Maximum Likelihood Decision Hidden Markov Model Vector Quantization Action Unit Recognition Intensity Estimation

Lien, Kanade, Cohn, & Li, page 30 Upper Face Action Units AU4 AU1+4 AU1+2 Brows lowered and drawn together

AU5 AU6 AU7 Upper eyelids are raised Cheeks are raised and eye opening is narrowed Lower eyelids are

mandible is lowered Mouth is stretched open and the mandible pulled down AU12 AU12+25 AU20+25 Lip corners

AU15+17 The infraorbital triangle and center of the upper lip are pulled upwards and the chin boss is

31 Lien, Kanade, Cohn, & Li, page 30 Upper Face Action Units AU4 AU1+4 AU1+2 Brows lowered and drawn together Medial portion of the brows is raised and pulled together Inner and outer portions of the brows are raised AU5 AU6 AU7 Upper eyelids are raised Cheeks are raised and eye opening is narrowed Lower eyelids are raised Lower Face Action Units AU25 AU26 AU27 Lips are relaxed and parted Lips are relaxed and parted; mandible is lowered Mouth is stretched open and the mandible pulled down AU12 AU12+25 AU20+25 Lip corners are pulled obliquely AU12 with mouth opening Lips are parted and pulled back laterally AU9+17 AU AU15+17 The infraorbital triangle and center of the upper lip are pulled upwards and the chin boss is raised (AU17) AU17 and lips are tightened, narrowed, and pressed together Lip corners are pulled down and chin is raised

Automated Facial Expression Recognition Based on FACS Action Units

Automated Facial Expression Recognition Based on FACS Action Units 1,2 James J. Lien 1 Department of Electrical Engineering University of Pittsburgh Pittsburgh, PA 15260 jjlien@cs.cmu.edu 2 Takeo Kanade