Generic 3D Face Pose Estimation using Facial Shapes

Size: px

Start display at page:

Download "Generic 3D Face Pose Estimation using Facial Shapes"

Kory Lamb
5 years ago
Views:

1 Generic 3D Face Pose Estimation using Facial Shapes Jingu Heo CyLab Biometrics Center Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA Marios Savvides CyLab Biometrics Center Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA Abstract Generic 3D face pose estimation from a single 2D facial image is an extremely crucial requirement for face-related research areas. To meet with the remaining challenges for face pose estimation, suggested Murphy-Chutorian et al. [13], we believe that the first step is to create a large corpus of a 3D facial shape database in which the statistical relationship between projected 2D shapes and corresponding pose parameters can be easily observed. Because facial geometry provides the most essential information for facial pose, understanding the effect of pose parameters in 2D facial shapes is a key step toward solving the remaining challenges. In this paper, we present necessary tasks to reconstruct 3D facial shapes from multiple 2D images and then explain how to generate 2D projected shapes at any rotation interval. To deal with self occlusions, a novel hidden points removal (HPR) algorithm is also proposed. By flexibly changing the number of points in 2D shapes, we evaluate the performance of two different approaches for achieving generic 3D pose estimation in both coarse and fine levels and analyze the importance of facial shapes toward generic 3D pose estimation. 1. Introduction Face (Head) 1 pose estimation has been widely investigated in decades and still has room for improvement towards the remaining challenges addressed by the recent survey paper [13]. More accurate and automatic pose estimation, which can be performed in real-time by using a monocular camera under various lighting conditions with different resolutions, providing invariance to identity for handing a full range of head motion and processing multiple persons simultaneously, is desired for numerous applications such 1 We use face instead of head in pose estimation, since the three Degrees of Freedom (DOF) can be reliably estimated based on facial features without the information about the ears and head contour. as face tracking and recognition, human computer interaction, and video database indexing. Since human faces in digital imagery can be affected by numerous factors caused by intrinsic facial appearance changes (expression, aging and eyeglass) and extrinsic changes (illumination, camera geometry and distortions), achieving invariance to these changes is an ongoing research topic and is the ultimate goal of not only for pose estimation but also for other face-related research areas, including face detection, face tracking, face alignment, and face recognition. It is known that the simplest 3D pose estimation can be simply achievable by using some facial geometry assumptions [6] [7] [10], i.e. generic distances among facial features. However, no rigorous evaluations of validating these geometry assumptions on facial shape variations have been reported. Furthermore, less attention has been made in order to understand the relationships among 3D facial shapes, their 2D projected shapes and associated pose parameters, due to the fact that generating and processing a large corpus of 3D face databases is not a trivial task itself. We believe that facial geometry provides the most important information for facial pose, therefore, understanding the effect of pose parameters in 2D facial shapes is the most essential step toward solving the remaining challenges. In this paper, we provide an efficient solution for this problem by using a large corpus of 3D shape databases, easily acquired by using multiple 2D images. Instead of relying on dense facial shapes, we only utilize a maximum of 79 points in our 3D shape reconstruction. Then we rotate a 3D face shape model and obtain 2D projection shapes at any rotation interval. A novel Hidden Points Removal (HPR) algorithm is also proposed in order to deal with self occlusions. By using much smaller number of points, we present two different approaches for achieving generic 3D pose estimation and analyze the importance of facial shapes for pose estimation. We can show that generic 3D face pose estimation both in coarse and fine levels can be effectively achievable by utilizing a set of important facial feature points /11/$ IEEE

2 2. Background Face pose estimation is the process of retrieving the direction or the orientation of a face by transforming the pixel-level representation of human faces into a high-level concept of direction [13], i.e. three Degree of Freedom (DOF). Existing face pose estimation can be largely divided into two categories based on the type of features: appearance-based and shape-based representation approaches. Appearance-based approaches include flexible model-based methods [21] and ordinal pixel representation methods [2] [16]. These methods often require additional regression models, such as Support Vector Regression (SVR), Neural Networks (NN) [3], and Manifold Embedding methods [25] in order to learn or analyze the nonlinear relationships between appearance information and pose parameters [26]. Due to the difficulty in learning innumerable appearance changes in facial images, over-fitting or generalization is the key limiting factor for these appearance methods. These methods also require large computational burdens compared to shape-based methods and typically have problems handling three DOF in fine level pose estimation. On the other hand, shape-based methods can be further divided into two groups: point-based methods [23] [15] and geometry-based methods [6] [7] which utilize a set of important facial fiducial points. What is the challenging task in shape-based methods is not the pose estimator itself but the automatic detection of facial features in order to achieve three DOF. This automatic facial feature detection normally complicate the evaluation of pose estimation, since poor performance in facial feature detection typically results in low performance in pose estimation. Multi-view face detectors [11] and view-based flexible models [22] are often utilized to improve or initialize the feature localization steps for coping with off-angle face images. However, generic 3D face alignment [8] is another important research topic to be addressed and has room for improvement to deal with shapes under a wide range of 3D rotation changes. We expect that recently proposed facial feature alignment schemes [4] [12] may reduce this feature localization problem; however, even with accurate facial feature alignment, we believe that generic 3D pose estimation in a fine level is not a completely solved problem. Point-based methods typically require a set of points which describe an overall shape of a face (the number of points varies within the range of 50 to 100), while geometry-based methods often use a minimum number of points and typically do not require additional regression models. For these reasons, geometry-based methods have attracted researchers for achieving three DOF in a fine resolution level, requiring only the location information of the centers of the two eyes, nose tip, and two mouth corners. In [6], the facial normal, which contain three orientations of a face, is obtained by using these five points, by utilizing a set of fixed distance ratios between facial features (we will detail this method and provide an alternative solution in Section 5). However, many existing pose estimation methods may not be suitable for achieving both coarse and fine resolution together; some methods can only estimate poses in a very coarse level and have difficulty in achieving fine level pose estimation, especially for appearance-based methods. In order to achieve a fine resolution pose estimator for appearance-based methods, one should consider continuous variations of face images. As mentioned earlier, learning seamless appearance variations in face images is still another challenging problem. A good example of appearancebased pose estimation in an extremely coarsest level is view-based face detectors [11], which can retrieve single DOF pose information relative to the frontal view with three discrete values. Although these view-based or pixel-based approaches are popular in detecting faces, improved face detection with more discrete or continuous level pose angles may require a significant modification in their algorithms. To assist readability on three DOF, we characterize three angels: roll, pitch, and yaw. Roll is the rotation about the z axis, resulting in the rotation of the xy plane. Pitch is the rotation about the x axis, resulting in the rotation of the yz plane. Similarly, yaw is the rotation about the y axis, changing the xz values. Therefore, pose estimation is an inverse problem to retrieve these 3 pose parameters based on the projected features in 2D face images. Since the observations for 3D pose parameters require a minimum of 5 facial feature points, the 5 point-based method [6] is one of the most attractable solutions to achieve a generic 3D fine pose estimator for this reason. In this paper, we focus on analyzing the 5 point-based method with a much more simple solution and providing more comprehensive evaluation results towards generic people by using a large corpus of a 3D shape database, in which the relationship between projected 2D shapes and corresponding pose parameters can be easily manipulated and observed. Although there are several databases for pose estimation [13], most of them use projected 2D appearances and do not provide shape information, to best our knowledge. We believe that facial shape information is the most essential for pose estimation, thus our database can be of great importance to analyze statistical shape variations caused by 3D pose changes. In addition, we propose a way to improve our pose estimator more accurately by incorporating more number of facial features. The rest of the paper is organized as the following. The detailed procedure of generating 3D database and obtaining 2D projected shapes is presented in Section 3, along with a new HPR algorithm. Two different methods for generic 3D pose estimator are presented in Section 4, and Section

where each column contains a vector of (x, y, z) coordinates. The goal of the 3D reconstruction problem from multiple 2D observations is to recover 3D shape information under noisy conditions.

3 where each column contains a vector of (x, y, z) coordinates. The goal of the 3D reconstruction problem from multiple 2D observations is to recover 3D shape information under noisy conditions. Since a set of 2D points is an instance of a projection of 3D with the 2D translation vector t, we write: s 2d = Ps 3d + t (3) Figure 1. Overview of the proposed work. 5 contains performance evaluation results. Finally, in Section 6, we summarize our proposed works and discuss future works. 3. Proposed Work In this section, we provide necessary steps for building a sparse 3D shape database and introduce necessary functions for obtaining 2D shapes and pose parameters at any angle interval. A visual illustration of the overview of our proposed work is presented in Fig. 1. Based on the reconstructed 3D shape, we obtain a set of novel 2D projected shapes at a fine angle interval. During the projection step, we identify hidden points. Then, a 3D pose estimator is achieved by utilizing a different number of point sets with/without utilizing regression models. We detail each step of our proposed work in the following sections Sparse 3D Face Reconstruction 3D reconstruction from multiple 2D face images offers one of the convenient way of acquiring 3D facial shape information. A closely related research topic for sparse 3D reconstruction is known as Structure from Motion (SFM), which are successfully applied in many computer vision algorithms [9]. A basic theory behind SFM is presented in this section. We define the 2D shape matrix S 2xn as the 2D coordinates (x, y) of the n vertices: ( ) x1 x S 2xn = 2... x n (1) y 1 y 2... y n where each column contains a vector of (x, y) coordinates. Similarly, the 3D shape matrix can be represented by the 3D coordinates (x, y, z): S 3xn = x 1 x 2... x n y 1 y 2... y n z 1 z 2... z n (2) where s 2d = (x, y) T and s 3d = (x, y, z) T indicate each point in 2D and 3D respectively and P is the projection matrix, which needs to be specified depending on applications. For multiple camera observations (i) and multiple points (n), the goal of reconstruction is to minimize the overall error arg min s i 2dn (P i s 3dn + t i ) 2. (4) P i,t i,s i 3dn n,i=1 The Factorization algorithm [24] achieves the above minimization under Gaussian noise. The measurement matrix Y can be obtained by stacking the 2D observations (S 2xn ): The factorization algorithm can estimate S 3xn by decomposing Y = P 2ix3 S3xn through Singular Value Decomposition (SVD), by utilizing the fact that the rank of Y is at most 3. Then, the metric upgrade needs to be performed with additional constraints [24]. By using the SFM technique [24], we reconstructed 249 3D sparse faces (79 points) by using the first session of the MPIE database [17]. Five images (within ± 30 degrees off from the frontal view) of the same person under pose variations with the same expression are used for each reconstruction. We utilize these shapes for generating 2D projected shapes at arbitrary 3D rotations D Projected Shapes The procedures for synthesizing 2D projection shapes at any desired angle can be explained by using the following three steps. First, we need to normalize the reconstructed 3D shapes in order to compensate scale and rotation problems. Based on the reconstructed shapes, we perform an initial adjustment of the 3D shapes to ensure that each 3D shape is all frontal, and its 2D projections lie along the z axis (relative to the frontal viewing angle). In other words, we project each 3D shape under the scaled version of orthographic projection (weak-perspective projection), which is a fairly a good assumption for projective geometry. Second, we rotate each 3D shape at any desired angle interval and obtain 2D projected shapes. Depending on the interval, pose estimation can obtain either in a coarse or a fine level. This way, we can estimate pose in both levels without changing the main algorithms; what needs to determine is the degree interval in which 2D projected shapes belong to the same degree.

4 Finally, in order to handle occlusions while displaying, the visibility of these points should be determined. However, current methods for handling visibility needs improvement since they typically utilize texture (depth-buffer) or surface normal to decide if the points are visible at a viewing direction. This visibility problem is a crucial element to be addressed, especially for 2D projected shapes with a wide range of pose changes. We detail this HPR problem in the next section, since we develop an improved way of estimating HPR from a set of sparse points based on surface normals and relative depth compared to a viewing angle A New Hidden Points Removal Method Hidden Points Removal (HPR) is the process of determining the visibility of a point cloud from a given viewpoint. Closely related techniques include Hidden Surface Removal (HSR). HPR determines whether the visibility of points while HSR emphasizes on the visibility of surfaces. Popular approaches for determining visibility include the z-buffer method [14] [20] and surface reconstruction based methods [5] [18] [1]. These methods typically require dense points in order to obtain triangular meshes for computing surface normals smoothly. Recently, a new efficient approach is proposed without utilizing computationally expensive surface normal computation [19]. The author of [19] also provided an automated way of computing the radius; however, it cannot successfully handle points with self-occlusions and needs a dense set of points to reliably determine the visibility of points. In order to eliminate the selection of the radius in [19], we propose a new HPR algorithm based on surface normal and relative depth information to the viewing point. Since the surface orientation information alone is not sufficiently enough to deal with the point visibility, relative depth information should be also considered. Our proposed HPR work is similar to surface reconstruction methods [18] [1]. However, our work does not attempt to reconstruct smooth surfaces. Rather, simple triangulation information based on a set of points is utilized. An overview of our proposed HPR method can be explained as the following. We first compute meshes from a set of points (79 points), obtained by the 3D reconstruction method in the previous section. Then we compute surface normals and depth information relative to the viewing point. Each surface normal is used to determine each surface orientation to the viewing angle while the relative depth information is also useful to determine self occlusions. By utilizing these surface orientation and relative depth information, inferred from the triangles related to each point, we can efficiently determine the visibility of points. Finally, we obtain the 2D projected shapes along with the visibility and corresponding pose parameters. This determination of visibility is a valuable component toward extreme off-angle Figure 2. The proposed HPR algorithm from a toy example. Three points on the right side are identified as visible points from C after apply HPR on the original five points, shown in the left side. pose estimation. We discuss more detailed explanations of the proposed HPR method. Formally, the goal of ordinary HPR processes for human facial shapes can be stated as the following: Definition: Given a set of points S 2xn, which is considered a sampling of a continuous surface S 3xm, where m n, and a viewpoint (camera position) C, our goal is to determine whether S 2xn is visible from C. We write 2D shapes as an instance of a 2D projection of a 3D shape with the 2D translation vector t. This can be expressed by: S 2xn = P w S 3xn + t. (5) The weak-perspective projection matrix P w can be further decomposed into s c P 2x3 R, where s c is the scale factor, R is the 3D rotation matrix, and P 2x3 is the projection matrix: P 2x3 = ( ). (6) Therefore, the HPR process is to separate S 2xn into visible points (S v 2x(n nh) ) and occluded points (Sh 2xnh ). This can be denoted by: S 2xn = {S v 2x(n nh), Sh 2xnh} (7) where nh is the number of hidden points. Since our viewing point C lies in the z axis, we set C to [0 0 c], where c is an arbitrary large positive number. This way, we justify the use of the weak perspective projection in our 2D projected shapes. In addition, all 2D projected shapes and corresponding rotation parameters are generated based on this fixed viewing position ([0 0 c]). The following four steps, based on a toy example shown in Fig. 2, contain the main idea of the proposed HPR process. 1. After applying rotations on a 3D shape based on C, triangulate the 3D shape based on the points. For example, in Fig. 2, a 3D shape is comprised of five

Figure 4. Illustration of a de-rotation for reliably computing the surface normal from a single face image. Figure 3. Examples of hidden points removal process. points from S1 to S5.

For example, in case of three points (S1, S2, S3) are used, the resulting surface normal can be calculated by the following equations: n T123 = ((S2) (S1))X(S3) (S1)) (8) n T123 = n T123 /norm(n T123

5 Figure 4. Illustration of a de-rotation for reliably computing the surface normal from a single face image. Figure 3. Examples of hidden points removal process. points from S1 to S5. We obtain anti-clockwise triangles, generated from the Delaunay triangulation, i.e., T 123, T 243, T 135, T 345 in the figure. 2. Compute a surface normal per each triangle. For example, in case of three points (S1, S2, S3) are used, the resulting surface normal can be calculated by the following equations: n T123 = ((S2) (S1))X(S3) (S1)) (8) n T123 = n T123 /norm(n T123 ) (9) where X is the cross product of the two vectors and norm indicates the Euclidean norm. 3. Calculate the angle between C and n T123, i.e. θ = C arccos( n T123 ). If θ >= 90, the triangle is oriented backwards and the points are associated with this triangle can be the candidates of occlusion. If the angles of all other triangles for each point for the backward triangle θ >= 90, then the points are indeed invisible. If there exists at least one triangle is less than 90, then the point is visible. However, although θ <= 90, the points associated with the triangle can be also invisible due to self-occlusions. Therefore, following additional step is used to solve this problem. 4. Compute the relative distance from C based on the center of gravity of each triangle. For each triangle sorted by the distance from C, check if there are any points are projected inside the triangle for all other points. If there is any point inside the projected triangle, the point is indeed occluded. Based on the above HPR removal procedure, which mainly utilizes triangle information (surface orientation and relative depth) in order to infer the visibility of points, we apply on 2D facial shapes, obtained during rotating and projecting a 3D shape. The results are shown in Fig. 3. Visible points (S v 2x(n nh) ) are depicted with a blue color while occluded points (S h 2xnh ) are plotted with a red color. As visually evidenced by these figures, we can successfully identify the visibility of points. This HPR process in this paper is mainly used in order to identify commonly visible points at a certain viewing direction; however, it can be easily applied to face alignment problems which need an automated scheme for detecting points and identifying the visibility of points simultaneously [8]. 4. Generic 3D Pose Estimation Methods In this section, we present two different types of shapebased methods to achieve generic 3D fine pose estimation: a pure geometry-based method and a shape-based method with a multivariate regression model. A modified version of the geometry-based method, which utilizes 5 points, is explained first. Evaluation of these methods is also discussed in Section 5. In case of the shape-based method, we apply three different numbers (5, 50, and 79) of points and multivariate regression models. Since we are also interested in simultaneously learning the relationships between pose parameters and shape changes, the use of multivariate regression models enable us to achieve this. Other approaches, such as SVR and NN, are also good candidates to learn the non-linear relationship; however, we confirm that these two approaches are only suitable for coarse pose estimation and have difficulty in learning multiple features jointly and simultaneously. It is important to note that we focus on dealing with roll and yaw pose estimation throughout the evaluation, since pitch angle estimation can be simply achievable by using the centers of the two eyes Modified Geometry Based Method In [6], three DOF can be calculated by using a spherical coordinate system to estimate the face normal. Based on a set of fixed distance ratios of generic people, the tilt and slant angles are computed based on any viewing direction. Since there are several ambiguities exist in the original method [6], we re-interpret this method with a more intuitive solution. The major difference in our approach is the use of projected lengths of the face normal in both axis. In addition, we fix our viewing direction (C) and do not utilize the spherical coordinate system. In our work, we first start with correcting the z-axis rotation based on the centers of each eye. A visual illustration of the de-rotation on z-axis is shown in Fig. 4. Once the face

6 axis, the line intersecting a middle point of the two eyes and a middle point of the nose corners, is retrieved, we use a predetermined fixed point in the face axis and obtain a line intersecting with the nose tip. This line becomes the face normal n f, which contains the most important information about the 3D orientation of the face. However, decomposing the face normal into pose parameters is not a trivial task, because we have to set the fixed point, a point connecting the face normal, in the face axis. We elaborate this problem by using four points (fa, fb, fc, fd), demonstrated in Fig. 4. fa and fd the two middle points between the two eyes and the mouth corners, respectively. The line intersecting fa and fd is the face axis. fb is the point in the face axis orthogonal to the face normal, while fc can be calculated by the following equations: fc = (1 t) fa + t fd (10) where t should be determined. The author of [6] utilizes an empirically chosen value based on the following relative distance ratio: L f = fb fd fa fd (11) where. indicates the Euclidean distance and the author set this ratio to 0.40; however, our average ratio, calculated from our shape database, is about 0.45 and there is a significant variation. Besides this relative distance ratio, we consider the length of the face normal, which is closely related to the relative length of the nose height [6]. Unlike the original method for utilizing the generic nose height [6], we compute the projected distances of the face normal in the observed y-axis and x-axis. We write these projected distances by: dy = fb fc, dx = nf fb x (12) where. x is the Euclidean distance of a x direction. A visual illustration of these distances is shown in 4. However, these projected distances need ground-truth data; i.e. they vary from person to person. A common average 3D shape, obtained from our 3D shape database, and its projected distances in 2D are good candidates for this purpose. Therefore, the relationships between the generic nose height and the projected distance in both yaw and pitch should be utilized. Then eq. 12 needs to be normalized based on the mean distance and can be rewritten as: dy = fb fc fa fd, dx = nf fb x fa fd. (13) The generic relative nose length ratio can be found by: l n = L n fa fd (14) Figure 5. Summary of the proposed modified geometry-based method for a generic 3D fine pose estimator. where fa, fd are obtained from an average 3D shape. L n is the length of the average nose height and l n is the relative nose length normalized by the average length of the face axis. By utilizing our 3D database, we compute l n. The value of l n is set to 0.50, while the author of [6] set l n as Finally, each yaw and pitch angle is computed by using the observed relative distances divided by the maximum nose height (l n ). This can be written by: θ z = arcsin( dy l n ), θ y = arcsin( dx l n ). (15) The intuition behind this equation is that large angle rotations lead to larger distances within the maximum nose height. For example, for a 90 degree rotation in yaw, the projected distance along the x-axis reaches a maximum with the average nose height. A visual summary of our proposed modified geometry-based method is illustrated in Fig. 5. However, both original and modified geometry-based method might suffer from the shapes significantly deviate from the mean, i.e. there a significant variation around the mean and people with completely different facial shapes (including nose height) cannot guarantee that we can achieve reliable performance. We will present an alternative way of achieving generic fine 3D pose estimation in the next section Multivariate Linear Regression Model In this section, we utilize the shape information alone (location information of the points) without computing any necessary geometry between these points for pose estimation. Similar to the geometry-based method, a generic mean 3D shape is used for this purpose. We also utilize a multivariate linear regression model, which enables us to simultaneously learn multiple pose parameters (yaw and pitch) through the regression matrix. Furthermore, we can easily increase the number of points for improved performance in the regression model. An overview of the proposed work by utilizing different number of points with multivariate linear regression models is shown in Fig. 6. The major concern for this approach is to see if a single generic 3D shape and its

2D projected shapes are suitable for learning relationships toward a variety of people with different shapes. We provide an answer to this problem.

7 2D projected shapes are suitable for learning relationships toward a variety of people with different shapes. We provide an answer to this problem. The overall procedure of the proposed pose estimator with a multivariate linear regression model can be divided into training and testing stages. In the training stage, we need to learn the relationship between the rotation parameters and 2D projected shapes. The relationship between R θy and R θz (we do not model θ x, since θ x can be easily obtained by using the centers of the two eyes) needs to be computed by regression analysis. Then eq. 3 becomes: S 2dn = PR θy R θz S. (16) We store the centered and energy-normalized 2D shape projection vectors (f θyθ z ) over pose parameters (θ y and θ z ) into F θyθ z and corresponding parameter vector (r θyθ z ) into R θyθ z respectively. We write f θyθ z as: f θyθ z = (s 1d2n (θ y, θ z ) s 1d2n (θ y, θ z ))/ s 1d2n (θ y, θ z ) 2. (17) where s 1d2n is a vectorized version of S 2dn, Then the relationship between 3D and 2D of the pose parameters can be modeled by multivariate regression analysis. This can be denoted by: M θyθ z = R θyθ z (F θyθ z F T θ yθ z + ɛi) 1 (18) where ɛ is the regularization parameter and I is the identity matrix. In the testing stage, we need to estimate the rotation parameters given a 2D input shape. The estimation of the rotation parameters can be simply done by: (θ y, θ z ) T = M θyθ z s 1d2n (19) where s 1d2n indicates the centered and energy-normalized input 2D shape. 5. Experimental Results Since our goal is to achieve generic 3D fine pose estimation, we only utilize a set of fixed ratios and 2D projected shapes, obtained from a single global 3D shape model; we expect that if we have a person-specific 3D shape model, more accurate pose estimator can be achieved. Based on the two different shape-based methods, we evaluate the performance in this section. In order to test the generalization power, we only use an average model, obtained by using the first 100 3D shapes in the MPIE database [17], and then the remaining 3D shapes and their 2D projections are used for testing throughout the evaluations. The main advantage of the geometry-based method is that there are no training stages. In case of the point-based Figure 6. Overview of the multivariate regression model for pose estimation with varying number of points. Table 1. Performance comparison of pose estimation (Mean Squared Errors (MSE) are presented within -40 to 40 degrees in yaw and pitch, respectively.) Number of Points Stage Yaw Pitch 5 Point Geometry Testing Only Points + Regression Training Testing Points + Regression Training Testing Points + Regression Training Testing method, we consider three different sets points. By using the three different sets of facial features, corresponding regression models are obtained. By using an average 3D shape, we generate 1,500 2D projected shapes with a 2 degree interval in each angle. The average errors tend to increase if the input shapes are affected by large rotations, especially around θ y = 40, θ z = 40 and θ y = 40, θ z = 40. These average training and testing results are shown in Table 1. We achieve similar performances for both 5 pointbased approaches with/without regression models in both angles. The best performance comes from 50 point-based regression models with reasonable generalization errors. The overall shape information about the important facial features, such as the shapes of the eyes, nose, and mouth are important factors for achieving reasonable performance. It can be concluded that 5 point-based methods may serve as an initial but reasonable pose estimator; however, in some applications which require high accuracy on pose estimation, large number of points and a set of generic shapes should be utilized. Throughout the evaluation, we have utilized commonly visible points (79 points) under rotations (-40 to 40 degrees in yaw and pitch, respectively) from the frontal view. Since some points are invisible during such rotations, we identify

8 them by using our proposed HPR approach. It is worth mentioning that each multivariate regression model requires the same number of points in both training and testing stages. We achieve this by setting up the value of the spatial locations of these invisible points zero for each multivariate regression model. This way, we may avoid the effect of the hidden points for pose estimation in both stages. For handling profile faces, we expect that we can easily extend our proposed pose estimator. One important task is to fix a set of points that are statistically observed in profile faces; About 46 points, obtained by using our 3D shape database, can be utilized to handle (50 to 100 degrees) in right profile faces and (-50 to -100 degrees) in left profile faces. Exactly the same procedures in our pose estimator can be applied for handling profile images, though we focus on presenting our results by using frontal shapes within -40 to 40 degrees in yaw and pitch angles in the paper. It is important to note that we do not attempt to deal with face alignment; this paper only focuses on pose estimation. Since 3D face alignment is another challenging research topic [8], we expect that reducing the fitting errors in 3D facial alignment is the key step towards for our generic 3D pose estimation. Furthermore, the HPR process is also a crucial task to be considered in 3D face alignment. 6. Discussion and Future Work In this paper, we have presented two different ways of achieving a generic 3D fine pose estimation technique by utilizing facial shapes. We have shown that sparse shape information of human faces is a very crucial element for generic pose estimation with a fine angle interval. It is our ongoing work to develop a robust 3D face alignment method with our proposed pose estimator, which makes our pose estimator operate in a fully automated manner. References [1] L. V. B. Mederos, N. Amenta and L. Figueiredo. Surface reconstruction from noisy point clouds. Eurographics Symposium on Geometry Processing, pages 53 62, [2] D. Beymer. Face recognition under varying pose. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages , [3] C. M. Bishop. Neural networks for pattern recognition. Clarendon Press, Oxford, [4] D. Cristinacce and T. Cootes. Automatic feature localisation with constrained local models. Journal of Pattern Recognition, 41(10): , [5] C. S. D. Cohen-Or, Y. Chrysanthou and F. Durand. A survey of visibility for walkthrough applications. IEEE Transactions on Vision And Computer Graphics, 9(3): , [6] A. Gee and R. Cipolla. Determining the gaze of faces in images. Image and Vision Computing, 12(10): , [7] A. Gee and R. Cipolla. 3d pose estimation of the face from video. In Face Recognition: From Theory to Applications, NATO ASI Series F, Springer-Verlag, pages , [8] L. Gu and T. Kanade. 3d alignment of face in a single image. Proc. of IEEE Int l Conf. on Computer Vision and Pattern Recognition, [9] R. Hartley and A. Zisserman. Multiple view geometry in computer visoin. Cambridge University Press, [10] Q. Ji and R.Hu. 3d face pose estimation and tracking from a monocular camera. Image and Vision Computing, 20(7): , [11] M. Jones and P. Viola. Fast multi-view face detection. Mitsubishi Electric Research Laboratories, MERL-TR , [12] S. Milborrow. Locating facial features with an extended active shape model. European Conf. on Computer Vision (ECCV), [13] E. Murphy-Chutorian and M. M. Trivedi. ead pose estimation in computer vision: A survey. IEEE Transactions on Pattern Recognition and Machine Intelligence, 31(4): , [14] M. K. N. Greene and G. Miller. Hierarchical z-buffer visibility. SIGGRAPH, pages , [15] M. P. N. Kruger and C. von der Malsburg. Determination of face position and pose with a learned representation based on labeled graphs. Image and Vision Computing, 15(8): , [16] S. Niyogi and W. Freeman. Example-based head tracking. In Proc. IEEE Intl Conf. on Automatic Face and Gesture Recognition, pages , [17] J. C. T. K. R. Gross, I. Matthews and S. Baker. Multi-pie. Proc. of Int l Conf. on Automatic Face and Gesture Recognition, [18] S. Rusinkiewicz and M. Levoy. Qsplat: A multiresolution point rendering system for large meshes. SIGGRAPH, pages , [19] A. T. S. Katz and R. Basri. Direct visibility of point sets. ACM Transactions on Graphics (TOG), 26(3), [20] M. Sainz and R. Pajarola. Point-based rendering techniques. Computers & Graphics, 28(6): , [21] G. E. T. Cootes and C. Taylor. Active appearance models. In Proc. of the European Conf. on Computer Vision, 2: , [22] K. W. T. Cootes and C. Taylor. View-based active appearance models. In Proc. IEEE Intl Conf. on Automatic Face and Gesture Recognition, [23] D. C. T. F. Cootes, C. J. Taylor and J. Graham. Active shape models: Their training and application. Computer Vision and Image Understanding, 61(1):38 59, [24] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factorization method. Int l Journal of Computer Vision, 9(2): , [25] J. Wu and M. Trivedi. A two-stage head pose estimation framework and evaluation. Image and Vision Computing, 41(3): , [26] J. S. Y. Li, S. Gong and H. Liddell. Support vector machine based multi-view face detection and recognition. Image and Vision Computing, 22(5): , 2004.

In Between 3D Active Appearance Models and 3D Morphable Models

In Between 3D Active Appearance Models and 3D Morphable Models Jingu Heo and Marios Savvides Biometrics Lab, CyLab Carnegie Mellon University Pittsburgh, PA 15213 jheo@cmu.edu, msavvid@ri.cmu.edu Abstract