Face Alignment Under Various Poses and Expressions

Face Alignment Under Various Poses and Expressions Shengjun Xin and Haizhou Ai Computer Science and Technology Department, Tsinghua University, Beijing 100084, China ahz@mail.tsinghua.edu.cn Abstract. In this paper, we present a face alignment system to deal with various poses and expressions. In addition to global shape model, we use component shape model such as mouth shape model, contour shape model in addition to global shape model to achieve more powerful representation for face components under complex pose and expression variations. Different from 1-D profile texture feature in classical ASM, we use 2-D local texture feature for more accuracy, and in order to achieve high robustness and fast speed it is represented by Haar-wavelet features as in [5]. Extensive experiments are reported to show its effectiveness. 1 Introduction Face alignment, whose goal is to locate facial feature points, such as eye-brows, eyes, nose, mouth and contour, is very important in face information processing including face recognition, face modeling, face expression recognition and analysis, etc. Since face information is very critical in human to human interaction, it is a key technology to make machine be able to process it in order to realize a natural way of human to machine interaction. In many complex face information processing researches, such as face expression analysis, as a fundamental preprocess to collect and align data, the face alignment algorithm is required workable under various poses and expressions. In this paper, this problem is discussed. In the literature, Cootes et al. [1] [2] proposed two important methods for face alignment: Active Shape Model (ASM) and Active Appearance Model (AAM). Both methods use the Point Distribution Model (PDM) to constrain a face shape and parameterize the shape by PCA, but their feature models are different. In ASM, the feature model is 1-D profile texture feature around every feature point, which is used to search for the appropriate candidate location of every feature point. However, in AAM, the global appearance model is introduced to conduct the optimization of shape parameters. Generally speaking, ASM outperforms AAM in shape localization accuracy and more robust to illumination but has local minima problem, the AAM is sensitive to illumination and noisy background but can get optimal global texture. In this paper, we focus our work on ASM. In recent years, many new derivative methods have been proposed, such as that of ASM-based, TC-ASM [3], W-ASM [4], and Haar-wavelet ASM [5], that of AAM-based, DAM [6], AWN [7]. However the problem is still an unsolved one for practical applications since their performances are very sensitive to large variations in face pose and especially in face expression although usually they can acquire good results on neutral faces, which may be caused J. Tao, T. Tan, and R.W. Picard (Eds.): ACII 2005, LNCS 3784, pp. 40 47, 2005. Springer-Verlag Berlin Heidelberg 2005

Face Alignment Under Various Poses and Expressions 41 by the global shape model that is not so powerful to represent changes in face components under complex pose and expression variations. As mentioned above, classical ASM use 1-D profile texture feature perpendicular to the feature point contour as its local texture model. However, this local texture model, which is related to a small area, is not sufficient to distinguish feature point from its neighbors, so ASM often suffer from local minima problem in the local searching stage. Tvercome this problem, we follow the approach in [5] to use 2-D local texture feature and represent it by Haar-wavelet features for robustness and high speed. In this paper, we extend the work [5] to multi-view face with expression variations, and we use component shape model such as mouth shape model, contour shape model in addition to global shape model to achieve more powerful representation for face components under complex pose and expression variations. This approach is developed over a very large data set and the algorithm is implemented in a hierarchical structure as in [8] for efficiency. This paper is organized as follow: In Section 2, the overview of the system framework and the pose-based face alignment algorithm is given. In Section 3, experiments are reported. Finally, in Section 4, conclusion is given. 2 Overview of the System The designed system consists of four modules: multi-view face detection (MVFD) [10], facial landmark extraction [11], pose estimation [12], and pose-based alignment, as illustrated in Fig. 1 and Fig. 2 (first two pictures are from FERET[14]). In this paper, pose-based alignment module will be introduced in detail. 2.1 Pose Based Shape Models Fig. 1. Framework of the system Considering face pose changes in off image plane from full profile to frontal (not losing generality, here we consider from right full profile to frontal), five types of global shape of Point Distribution Model (PDM) are defined as shown in Fig. 3, which are 37 points for [ 90, 75 ), 50 points for [ 75, 60 ), 59 points for [ 60, 45 ) and 88 points for [ 45, 15 ) and [ 15, + 15 ]. So, over corresponding training sets totally five PDMs are set up as posed based shape models.

42 S. Xin and H. Ai In addition to the above global shape models, component shape models for local shape representation are introduced in order to capture accurate shape changes due to large variations in poses and expressions as shown in Fig.4. The reason for this is that global shape model is too strong to Fig. 2. Component shape model for frontal face (mean contour and mean mouth) represent local shape changes. Taking a face with open mouth as an example (the picture is from AR[13]) shown in Fig.5a, we found that many of mouth feature points truly reach their correct positions in local search stage, but due to their contribution in global level is too little to have significant effects in the final shape they will leave their correct positions under the global shape model constraint shown in Fig.5e. However, if component shape model, that is, mouth shape model is used, their contribution is big enough to change the final shape shown in Fig.5f. a)[ 90 o, 75 o ) b)[ 75 o, 60 o ) c)[ 60 o, 45 o ) d) [ 45 o, 15 o ) e) [ 15, + 15 ] Fig. 3. Pose based shape model (mean shape) from right full profile to frontal Fig. 4. Pose-based face alignment

Face Alignment Under Various Poses and Expressions 43 In summary, the face alignment consists of two-stage processing, the first stage using global ASM model, the second stage using component ASM model with the initialization from the first stage, see Fig. 5 for an example. In this way, the accuracy is improved significantly. a) Sourc image b) Face alignment result c) Refined by contour shape model d) Refined by mouth shape model e) Feature points of mouth before refined by mouth shape model f) Feature points of mouth after refined by mouth shape model 2.2 Local Texture Model The 2-D local texture feature represented by Haar-wavelet features proposed in [5] as illustrated in Fig. 6 (the picture is from AR[13]) is adopted. For each point, over training set those features are clustered by K-means clustering into several representative templates. 2.3 Alignment Fig. 5. Face alignment using global & component shape mode In the hierarchical alignment algorithm shown in Fig. 7, for a given face image, first supposing several facial landmark Fig. 6. Haar-wavelet feature extraction points are known (for example, by way of manually labeling), a regression method is used to initialize a full shape from those given points to start the ASM algorithm. Second the Haar-wavelet feature of every feature point and its neighbors (a 3 3 area) are computed (described in section 2.2) to select current candidate point based on Euclidean distance between the current

44 S. Xin and H. Ai Fig. 7. Flowchart of the hierarchical alignment algorithm Haar-wavelet feature and the trained templates. Third those candidate points are projected to the shape space to get update shape parameters and pose parameters. Repeat from the second step until the shape converges in current layer. If this layer is the last layer, then stop, otherwise move to the next layer. 3 Experiment 3.1 Training and Testing Data Set Different from the view ranges presented in [6], that is[ 90, 55 ),[ 55, 15 ), [ 15,15 ], [15,55 ], [55,90 ], we divide the pose of full range multi-view face into the following intervals based on the visibilities of facial feature points and fine mode of shape variations: [ 90, 75 ), [ 75, 60 ), [ 60, 45 ), [ 45, 15 ), [ 15, + 15 ], (15, 45 ], (45,60 ], (60,75 ], (75,90 ]. The view [ 15, + 15 ] corresponds to frontal. The experiments are conducted on a very large data set. For frontal view, the data set consists of 2000 images including male and female aging from child tld people, many of which are with exaggerated expressions such as open mouths, closed eyes, or have ambiguous contours especially for old people. The average face size is about 180x180 pixels. We randomly chose 1600 images for training and the rest 400 for test. For the other views, we labeled feature points of 300 images of one side of view, such as[ 90, 75 ) with a semi-automatic labeling tool as their Ground Truth Data for training, and used the 300 mirrored images of its symmetric view, such as (75,90 ] for testing. In the system illustrated in Fig. 1, right now Facial landmark extraction [11] is implemented for frontal faces and Pose estimation [12] can only be used for the views[ 45, 15 ),[ 15, + 15 ], (15, 45 ]. So for the other part, manually picking several points and selecting the corresponding pose interval are necessary to start the experiments.

Face Alignment Under Various Poses and Expressions 45 3.2 Performance Evaluation The accuracy is measured with relative pt-pt error, which is the point-to-point distance between the alignment result and the ground truth divided by the distance between two eyes (If the face is not frontal, then we use the distance between the eye corner and mouth corner that can be seen). The feature points were initialized by a linear regression from 4 eye corner points and 2 mouth corner points of the ground truth. After the alignment procedure, the errors were measured. In Fig. 8a, the distributions of the overall average error are compared with Classical ASM [1], Gabor ASM [4], Haar-wavelet ASM [5]. It shows that the presented method of Haar-wavelet ASM with component model is better than the other three. In Fig. 8b, the average errors of the 88 feature point are compared. The distributions of the overall average errors of the four views except frontal are compared in Fig. 9 and the average error of each feature point of the other four views are showed in Fig. 10. The average execution time per iteration is listed in Table 1. a) Distribution of relative average pt-pt error b) Relative average pt-pt error for each feature point Fig. 8. Comparison of classical ASM, Gabor ASM, Haar-wavelet ASM and Haar-wavelet ASM with component model Fig. 9. Distribution of relative average pt-pt error of multi-view Fig. 10. Relative average pt-pt error for each feature point of multi-view

46 S. Xin and H. Ai Some experimental results on images from FERET[14], AR[13], and internet which are independent of the training/testing set with large poses and expression variations are shown in Fig. 11, Fig. 12, Fig. 13. Table 1. The average execution time per iteration Algorithm Classical ASM Gabor ASM Haar-wavelet ASM Haar-wavelet Frontal ASM with com- -45degree ~ -15degree ponent model of -60degree ~ -45degree this paper -75degree ~ -60degree -90degree ~ -75degree Execution time (per iteration) 2ms 576ms 30-70ms 53ms 58ms 54ms 45ms 35ms Fig. 11. Multi-view face alignment results Fig. 12. Some results on face database of AR [13] Fig. 13. Some results on face database of FERET [14] and Internet pictures 4 Conclusions In this paper, we extend the work [5] to multi-view face with expression variations using component shape model such as mouth shape model, contour shape model in

Face Alignment Under Various Poses and Expressions 47 addition to global shape model. A semi-automatic multi-view face alignment system is presented that combines face detection, facial landmark extraction, pose estimation and pose-based face alignment into a uniform coarse-to-fine hierarchical structure based on Haar-wavelet features. With component shape model, we can deal with faces with large expression variation and ambiguous contours. Extensive experiments show that the implemented system is very fast, yet robust against illumination, expressions and poses variation. It could be very useful in facial expression recognition approaches, for example, to collect shape data. Acknowledgements This work is supported by NSF of China grant No.60332010. References 1. T Cootes, D Cooper, C Taylor, and J Graham, Active shape models their training and application. Computer Vision and Image Understanding, 61(1):38-59, 1995 2. T Cootes, G Edwareds, and C Taylor, Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685, 2001 3. Shuicheng Yan, Ce Liu, Stan Z. Li, Hongjiang Zhang, Heung-Yeung Shum, Qiansheng Cheng. Texture-Constrained Active Shape Models. 4. Feng Jiao, Stan Li, Heung-Yeung Shum, Dale Schuurmans, Face Alignment Using Statistical Models and Wavelet Feature, Proceedings of IEEE Conference on CVPR, pp. 321-327, 2003. 5. Fei Zuo, Peter H.N. de With, Fast facial feature extraction using a deformable shape model with Haar-wavelet based local texture attributes, Proceedings of IEEE Conference on ICIP, pp. 1425-1428, 2004. 6. S. Z. Li, S. C. Yan, H. J. Zhang, Q. S. Cheng, Multi-View Face Alignment Using Direct Appearance Models, In Proceedings of The 5th International Conference on Automatic Face and Gesture Recognition. Washington.DC, USA, 2002 7. C. Hu, R. Feris, and M. Turk Active Wavelet networks for Face Alignment In British Machine Vision Conference, East Eaglia, Norwich, UK, 2003 8. Ce Liu, Heung-Yeung Shum, and Changshui Zhang, Hierarchical Shape Modeling for Automatic Face Localization, Proceedings of ECCV, pp.687-703, 2002. 9. P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, in Proc. CVPR, 2001, pp. 511 518. 10. Bo WU, Haizhou AI, Chang HUANG, Shihong LAO, Fast Rotation Invariant Multi-View Face Detection Based on Real Adaboost, In Proc. the 6th IEEE Conf. on Automatic Face and Gesture Recognition (FG 2004), Seoul, Korea, May 17-19, 2004. 11. Tong WANG, Haizhou AI, Gaofeng HUANG, A Two-Stage Approach to Automatic Face Alignment, in Proceedings of SPIE Vol. 5286, 558-563, 2003. 12. Zhiguang YANG, Haizhou AI, et.al, Multi-View Face Pose Classification by Tree- Structured Classifier, The IEEE Inter. Conf. on Image Processing (ICIP-05), Genoa, Italy, September 11-14, 2005. 13. http://rvl1.ecn.purdue.edu/~aleix/aleix_face_db.html 14. P. J. Phillips, H. Wechsler, J. Huang, and P. Rauss, The FERET database and evaluation procedure for face recognition algorithms, Image and Vision Computing J, Vol. 16, No. 5, pp 295-306, 1998.