Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

0/19.. Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature Usman Tariq, Jianchao Yang, Thomas S. Huang Department of Electrical and Computer Engineering Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign {utariq2, jyang29, huang}@ifp.illinois.edu October 13, 2012

1/19. Outline. 1

2/19. Motivation It is not trivial in real application to always have a frontal face A detailed analysis of the effect of large variations in pose (both pan and title angles) on expression recognition performance is needed, e.g. for positioning the cameras Bulk of the existing literature assumes frontal or near-frontal face view manually/automatically detected key points and/or presence of a neutral face

3/19. Highlights This work deals with single image multi-view facial expression recognition It works with translation invariant sparse coding to obtain features and then does linear classification This work, despite achieving state of the art results, does an extensive analysis of effect of variation in pan and tilt angles

4/19. Database The database used in this work is the publicly available BU3D-FE database. It has 3D face scan of a 100 subjects, each performing 6 expressions at 4 intensity levels Facial expressions in the database: anger (AN), disgust (DI), fear (FE), happy (HA), sad (SA), surprise (SU) (and neutral) Out of 100 subject, 56 are females. The dataset is quite diverse and contains subjects with various racial ancestries. Views with seven pan angles (0, ±15, ±30, ±45 ) and five tilt angles (0, ±15, ±30 ) are generated for each subject with each expression-intensity combination, resulting into a dataset with 84000 images.

. Database - expressions and intensity levels Face images with different expressions and intensities Level 1 Level 2 Level 3 Level 4 AN DI FE HA SA SU Facial Expressions Rendered facial images of a subject with various expressions and intensity levels. The intensity levels are labeled from 1 to 4 with 4 being the most intense. 5/19

. Database - pan and tilt angles Face images with different pan and tilt angles 30 Tilt angles 15 0 +15 +30 45 30 15 0 +15 +30 +45 Pan angles Rendered facial images of a subject with various pan and tilt angles. 6/19

6/19. Outline. 1

7/19. Translation Invariant Sparse Coding (ScSPM) Bag of Features (BoF) model Spatial Pyramid Matching (SPM) framework the given image is partitioned then the BoF histograms are extracted for each of those partitions these histograms are then concatenated together When we relax the SPM cardinality constraint by solving a lasso problem and instead of histogram representation, use max-pooling, we approach at the ScSPM setting. it is robust to translation misalignments, since it is computed in an SPM framework. it gives significant improvement over SPM

8/19. ScSPM - Learning the code-book Suppose X is a matrix whose columns are image features. Now let X = [x 1,...,x N ] R p N. We solve the following in an alternative iterative fashion to come up with V, N min W,V n=1 x n Vw n 2 2 + λ w n 1 subject to v k 2 1, k 1,2,...,K (1) Here, V = [v 1,...,v K ] R p K is the code-book or dictionary, W = [w 1,...,w N ] R K N, and λ is a regularization parameter, whose value controls the sparseness in the solution w n

9/19. ScSPM - Coding In the ScSPM framework, to extract image level features, firstly the patches are densely sampled from a given image Then some low-level features like SIFT, are extracted on these patches These features are then sparsely coded using the code-book V, by solving min w n x n Vw n 2 2 + λ w n 1

10/19. ScSPM - Pooling These sparse vectors are then pooled in an SPM framework z = Φ(W) for instance, z i = max{ w i1,..., w in } The resulting image level feature vectors are extracted by concatenation of the vectors obtained for various partitions/levels

11/19. Experimental Setup Dense SIFT features with 3 pixel shift are extracted V R 128 1024 is used for sparse coding in an SPM framework, followed by pooling Experiments are done in 5 subject independent fold cross-validation setting on the 84000 images extracted from the BU-3DFE database. A universal is adopted approach for classification Linear SVMs are used for single image expression recognition L2-regularized logistic regression is used to get probability estimates for fusion

11/19. Outline. 1

12/19. Confusion matrix Class confusion matrix for over-all recognition performance (69.1%) averaged over all poses and expression intensity levels Overall classification Predicted AN DI FE HA SA SU Ground Truth AN 64.2 8.4 4.1 2.2 18.1 3.1 DI 10.9 70.1 5.8 3.9 5.2 4.3 FE 7.5 9.5 51.1 13.7 9.5 8.7 HA 2.1 4.3 9.4 81.2 1.7 1.4 SA 19.6 5.2 7.2 2.3 63.4 2.3 SU 1.8 3.0 4.7 3.0 2.6 85.0

13/19. Recognition performance for different intensities 100 90 80 Average recognition rates for various expressions of different intensities Intensity 1 (min) Intensity 2 Intensity 3 Intensity 4 (max) Average Percentage Recognition Rate 70 60 50 40 30 20 10 0 AN DI FE HA SA SU Expression Labels Recognition performance for various expressions with different intensities

14/19. Performance vs pan and tilt angle variations (exp.) Average recognition rates vs pan angles for various expressions Average recognition rates vs tilt angles for various expressions 90 90 80 80 70 70 Percentage Recognition Rate 60 50 40 Percentage Recognition Rate 60 50 40 30 20 10 45 30 15 0 +15 +30 +45 AN DI FE HA SA SU Average 30 20 10 30 15 0 +15 +30 AN DI FE HA SA SU Average Pan Angles Tilt Angles Effect of change in pan angles on the recognition performance of various expressions Effect of change in tilt angles on the recognition performance of various expressions

15/19. Performance vs pan and tilt angle variations (int.) Average recognition rates vs pan angles for various expression intensties Average recognition rates vs tilt angles for various expression intensties 90 90 80 80 70 70 Percentage Recognition Rate 60 50 40 Percentage Recognition Rate 60 50 40 30 30 20 10 Intensity 1 (min) Intensity 2 Intensity 3 Intensity 4 (max) Average 45 30 15 0 +15 +30 +45 20 10 Intensity 1 (min) Intensity 2 Intensity 3 Intensity 4 (max) Average 30 15 0 +15 +30 Pan Angles Tilt Angles Effect of change in pan angles on the recognition performance of various expression intensity levels Effect of change in tilt angles on the recognition performance of various expression intensity levels

16/19. Performance vs simultaneous pan and tilt variations Average recognition rates for various combinations of pan and tilt angles 30 72 71 15 70 69 Tilt angles 0 68 67 +15 66 65 +30 64 45 30 15 0 +15 +30 +45 Pan angles Effect of change in both pan and tilt angles on the overall single image expression recognition performance.

17/19. Comparison with State-of-the-Art Zheng et al. and Tang et al. follow the same experimental setting by restricting themselves to only the strongest expression intensity level. Hence, their image dataset consists of 100 6 7 5 = 21000 images. To compare, we repeat the experiments in the same experimental setting Performance comparison with earlier works on strongest expression intensity in terms of the percentage recognition rates Zheng et al Tang et al. Ours 68.2% 75.3% 76.1%

18/19. Concluding Remarks Our work sets a new state-of-the-art for multi-view facial expression recognition on the BU3D-FE database Unlike many other works, our method neither requires any key point detection nor does it need a neutral face A significant analysis of variations in expression recognition with changes in a range of pan angles, tilt angles or both is done The most subtle expression are the most difficult to recognize We do not find any conclusive evidence that non-frontal views give significantly better performance than frontal view

19/19 The End