Image-based Motion-driven Facial Texture

Size: px

Start display at page:

Download "Image-based Motion-driven Facial Texture"

Martha Ray
6 years ago
Views:

1 Image-based Motion-driven Facial Texture Bing Zhang and Hai Tao and Alex Pang Computer Science and Computer Engineering University of California, Santa Cruz 1 INTRODUCTION Facial animation is a fundamental and difficult topic in computer graphics. Its applications include character animation, computer games, video conferencing, video telephony, and avatars etc. The difficulty of facial animation comes mainly from the following factors: complex geometric face models; countless creases and wrinkles; delicate nuances in color and texture; complicated facial movements; human s instinctive detection of subtle variations in facial expressions. Approaches in facial animation fall into two major categories, geometric manipulation based and image manipulation based. Geometric manipulation includes keyframing and geometric interpolation, parameterization, physics-based muscle modeling, pseudo or simulated muscle modeling etc. Image manipulation includes image morphing, texture manipulation, image blending and vascular expressions etc. Because of the complicated facial structure and movements, geometric manipulation is hard to control to get realistic animation. Image manipulation could be combined to improve the animation effect.

2 1.1 Related work Volker Blanz and Thomas Vetter [1] introduced a technique for modeling textured 3D faces. Starting from a set of 3D face models, they derive a morphable face model by transforming the shape and texture of the models into a vector space representation. New faces and expressions can be modeled by forming linear combinations of the prototypes. 3D faces can either be generated from one or more photographs, or modeled directly through the user interface. Shape and texture can be modified in a natural way. Frederic Pighin et. al. [6] presents a technique for creating photorealistic textured 3D facial models from photographs of a human subject, and creating smooth transitions between different facial expressions by morphing between these different models. Starting from several uncalibrated views, they deform the generic 3D face model with recovered 3D information, then extract texture from images for natural looking animations. Instead of creating new facial animations from scratch for each of new models, Junyong Noh and Ulrich Neumann [7] use existing animation data in the form of vertex motion vectors, and transfer them to target models with different geometric proportions and mesh structures. Facial expressoins exhibit not only facial feature motions, but also changes in illumination and appearance caused by the surface normal. Zicheng liu et. al. [4] proposed a skin-color independent (ERI to capture the illumination change of one person s expression and map it to other persons. 1.2 IBMDT We propose image-based motion driven textures (IBMDT as shown in Fig.1. We are not only interested in recover realistic facial animation, but also the explicit descrip-

3 tors about motion vertices and motion textures, so that facial animation could be generated on the fly. IBMDT avoids complicated mechanism behind physical face structures. Using only single view video sequences, IBMDT recovers rigid and non-rigid motions, and finds expression function from motion vertices to motion textures, which are motion driven textures. The project is under construction. The rest of this report is organized as follows: Section II describes preprocessing; Section III talks about registration; Section IV discusses motion capture processes. Section V presents motion driven textures. Section IV is the conclusion about the future work. Fig. 1. IBMDT structure 2 PREPROCESSING 2.1 Data Collection We obtained a set of laser scan 3D face models from Experimental Psychology lab. at UCSC. Focusing on one of Trevor Chen s neutral face model, we captured his facial expressions with three camcorders at different angles. Note the 3D models and video sequences are taken about three months apart. Starting with a neutral expression, each section of video sequences is of different facial expressions. The camcorders were not

4 mechanically synchronized, so we used a laser pointer to help align each segment of video sequences around the same time, though currently only sequences from the front view are used for experiments. Markers were sticked on the face around wrinkled areas so that we could obtaion both vertex motions and texture motions from single-view image sequences. 2.2 Mesh Simplification The mesh of the 3D face model is very dense. It has 191,594 vertices and 383,184 polygons. In order to speed up the rendering, we used a mesh similification software, Michael Garland s QSlim 2.0 [2]. It is very efficient, but a little problem is that feature points on the mesh can not be selectively preserved. In addition, we also programed to cut unnecessary polygons by thresholds, and delete unnecessary ones by interactive vertex picking. Extra work was done for conversions of data formats. After simplification, for example, the mesh size is reduced to 4,811 vertices and 9203 polygons. 2.3 Graphical User Interface We built a user friendly interface, Fig. 2. There are two windows, one for 3D face model, or graphics window (GW on the left, and one for image, or image window (IMG on the right. Mapping can be set up between 3D vertices in GW and 2D pixels on IMG. We can also pick up vertices on GW or mark pixels on IMG to export. Other major functions in GUI include editing for mapping list, synchronized 3D-2D matching, texture mapping, jump from frame to frame or animation etc. 3 REGISTRATION At this stage, we try to align the 3D mesh with captured 2D images. Starting from neutral face, we match 3D feature vertices in graphics window and corresponding 2D

These vertex and pixel feature points form a 3D-2D pair set, from which we can calibrate camera with modified Tsai s algorithm and recover camera s intrinsic and extrinsic parameters.

5 Graphics window Image window Fig. 2. Graphical User Interface. Graphics window vs image window feature pixels in image window by interactively picking and marking (see Fig. 2. These vertex and pixel feature points form a 3D-2D pair set, from which we can calibrate camera with modified Tsai s algorithm and recover camera s intrinsic and extrinsic parameters. We further optimize camera s rotation matrix with descent gradient and Newton s method. At last, with calibrated and optimized camera projection model, we reproject all vertices to the image to find corresponding texture coordinates in order to perform texture mapping on 3D model. 3.1 Tsai s camera calibration Assume pxl = K(RWT projection model, where pxl is image pixel coordinate, W is 3D world coordinate, K is camera matrix, R is camera rotation matrix, and T is camera translation matrix. Given matched 3D-2D feature point pairs, we try to recover camera s intrinsic and extrinsic parameters. For camera s intrinsic parameters, the camera distortion is ignored, and the center of the image is the principal point. Then camera matrix K is:.

6 <... Camera s extrinsic parameters are translation matrix T, and rotation matrix R,. Assume that is the 3D world coordinate of a point, and is the corresponding 2D pixel coordinate on the image, we have " $ From Eq.1 and Eq.2, we obtain * & %& ', If we denote the aspect ratio as -, 0 1 ' ( /., and let,,, 32 4-, 35 4-, 36 4-, (1 (2 (3, and '9:9;9 37, then Tsai s camera calibration algorithm can be implemented as the following steps: 1. construct A as: 9;9:9 =>=?=>=?=@=,=A=B=A=B=AC=>=,= 2. solving Av=0, we get the eigenvector corresponding to the smallest eigenvalue of AA, and ED F D HG IG or D JKG G - HG IG 3. obtain 37 0 and D - DL-. G IG. Here,,

7 .. D 4. obtain rotation matrix R. & %& % % D - %& % % 5. to make R orthonormal, we compute R=UDV with singular value decomposition, and update R= R=UIV, where I is 3x3 identity matrix. 6. next, the sign of D is determined by *. 7. given R, Here,, and D, we construct F and b to solve " and, i.e.. =A = ' ' = H= The solution is 9;9:9 = % % 9;9;9 = (= ' ' =. C 3.2 Implementation and Modification With known camera matrix K, rotation matrix R and translation matrix T, we reproject all vertices of 3D face mesh to image plane with equation pxl = K(RWT. By comparing the original pixel coordinates with reprojected ones for the 3D-2D pairs, we realize that there are some problems in Tsai s calibration algorithm. First, Tsai s camera calibration algorithm is very sensitive to the set and accuracy of matched feature point

8 D - - D - - D - - pairs; Secondly, we get unexpected negative texture coordinates from reprojection. For the experimental 3D-2D pair set (Fig. 2, the mean square error is as high as 218 pixels. In the beginning, we tried to normalize 3D world coordinates and 2D pixel coordinates to improve reprojection, but there is no much difference. Then we tried to get camera matrix K from OpenCV camera calibration filter instead of from Tsai s algorithm, because Zhengyou Zhang s calibration pattern [8] was used when we recorded the video sequences. The filter collected 30 frames of the calibration pattern, then out- put,, distortion factors, and principal point. We modified Tsai s algorithm with OpenCV s and. The result is better, and the mean square error of reprojection is 1.56 pixels. But it is still not satisfactory even with later optimization step to improve rotation matrix R. We realize that orthonormalization of rotaion matrix R in Tsai s algorithm (step 5 might be a problem for us to find a fitting projection, therefore we recalculate R without orthonormalization. It turns out to be good for both Tsai s original algorithm and modified version with OpenCV camera parameters. The mean square error of reprojection becomes 0.55 pixels. It proves that step 5 is really a blind step. However, R is NOT orthonormal as required here. More actions should be taken to solve this problem. As we know, rotation matrix R consists of three components, -, and D, which are rotation angles around X, Y, Z axes. Therefore we have D D D D D D D By observing the above matrix, we can easily derive from item. The choice for deriving - and D is a little tricky. Intuitively, matrix R should be close to orthonormal,

9 D D D D D D and items on the main diagonal should best represent its three component angles. Thus we choose to derive - and D from and.. The exception case is when. However such case is rare and it can be easily avoided by adjusting the camera s angle. If we then recalculate R with the derived -, and D based on the above matrix, R won t lose much its own identity and it is surely orthonormal. Therefore, instead of step 5 in Tsai s algorithm, we update rotation matrix R with newly derived -, and D. The reprojection result turns out to be very good. Experiments also show the choice on deriving - and D is best. 3.3 Optimization Optimization is also used to improve rotation matrix R with respect to -, and D. For simplicity, we normalize Eq.(1 and Eq.(2 as: & ' ' ' ( (4 (5 Here, A - D - D ' A - - ' A - ' A - D - D - - -

10 Improvement on R becomes a minimization problem: - D - D (6 - D - D - (7 Here is reprojected normalized pixel coordinate after camera calibration, and corresponds to the original normalized pixel coordinate in the picked 3D-2D pairs. In the beginning of optimization, Descent Gradient is used. Newton s method is also applied when -, and D are close to the solution. The optimization provides one order of magnitude reduction in mean square error. Method of Descent Gradient Suppose the function of is to be minimized, and it has continuous first partial derivatives. The method of descent gradient is defined by the iterative algorithm > (8 where is a non-negative scalar minimizing f(. From the point, we search along the direction of the negative gradient - to a minimum point. Applying descent gradient optimization method to Eq.(7, we have - D - D K@ (9

11 Rotation matrix R could be improved by iteratively updating R s three component angles with D (10 D D Newton s Method With Newton s Method, the function being minimized is approximated locally by a quadratic function, and this approximate function is minimized exactly. Near, can be approximated by the truncated Taylor series: $ (11 The rihgt-hand side is minimized at > $ B $ (12 Eq.12 is the pure form of Newton s method. Assume that at a relative minimum point,, the Hessian matrix $ is positive definite, then if has continuous second partial derevatives, the Newton s method is well defined near the solution. Therefore the Newton s method is used at the second optimization level for rotation matrix R when the Hessian matrix is positive definite. When the Hessian matrix is not positive definite, a scaled portion of identity matrix is added to the Hessian matrix to make it positive. Applying Newton s method to Eq.(7, we have - D (13

12 Rotation matrix R could then be improved by iteratively updating R s three component angles with - D - D - D - D ( Implementation Results Fig. 3. Registration result. Fig.3 shows the final result in the registration stage. The same purple marks on bare 3D model (left and texture-mapped 3D model (right demonstrate that the texture are well aligned onto 3D mesh. Note the original image is shoot from the front, therefore the texture mapping here is not view-dependent. 4 MOTION CAPTURE The facial motion on the image can be classified as rigid motion and non-rigid motion. Rigid motion is caused by the body and/or head movements with respect to the camera,

13 and non-rigid motion is caused by the movements of facial muscles, which also lead to different facial expressions. In order to capture both rigid and non-rigid motions, we put markers on the model s face surrounding the highly wrinkled areas. By tracking those markers and some feature points from frame to frame, we can recover rigid and non-rigid motions. Vertex motions from frame to frame can thuse be derived with gaussian basis functions. The steps at this stage are: 1. Detect markers (optional; 2. Determine feature points for rigid motion either interactively picking from GUI or automatically from marker detection. The feature points should be relatively still, and they may be markers or non-markers; 3. Track feature points for rigid motion frame-by-frame; 4. Determine feature points for non-rigid motion either interactively picking from GUI or automatically by marker detection. The feature points should be relatively active, and they also may be markers or non-markers; 5. Track feature points for non-rigid motion frame-by-frame; 6. Interpolate vertex motions with gaussian basis functions Note that tracking feature points picked or detected may not be well aligned on mesh vertices. In such cases, we will adjust tracking feature points to the closest pixels which are sitted right on 3D vertices. 4.1 Marker Detection The markers used are in red, green and blue. Because the percepted color is better represented by HSV color space than by RGB color space, we first convert the image from RGB to HSV. By sampling some red markers, green markers and blue markers we

14 obtain means and variances in HSV space for each individual color set. We then impose low and high thresholds based on means and variances to detect red, green and blue markers. Experiments show that detecting with only one channel is not enough, so we use all three channels, hue, satuation and value. One problem in implementation is that red markers are harder to detect because of the skin tone. Some dark spots on the face may also be misdetected as red markers. On the other side, green markers are easier to detect than other two kinds of markers. 4.2 Tracking Starting with the neutral frame, the tracking on the feature points are performed frameby-frame. For a feature point on the i th frame, we use a correlation window (2n1 by (2n1 centered at this point. Then we select a search area of size (2m1 by (2m1 around the same location in the next frame. Comparing the correlation window in the i th frame with each correlation window in the following frame within the searching area, we pick one which has the biggest correlation coefficient and mark it as a match to the feature point in the frame. Experiments show that the size of correlation window and the size of searching area are both critical to the tracking quality. The good size for the correlation window should be close to the marker s size. The searching area cannot be too much, because the tracking might be confused with nearby other markers and it is time consuming; on the other side, the searching area cannot be too small, becaused the tracked points might be out of field. When we do tracking for rigid motions, the initial tracking set of feature points are critical. We are in favor of relative areas such as ears, inner eye corners, nose tip etc. However, when we do tracking for non-rigid motions, we are in favor of relative areas which usually surround wrinkles. Selections for tracking sets should all avoid over coplanar of feature points. Fig. 4 is a series of frames showing tracked feature points.

15 Fig. 4. Tracking on feature points 4.3 Rigid Motion Body and/or head movements are classified as rigid motions. Without any loss, we assume that rigid motions were relatively caused by camera motion instead and thus there were no body or head movements. In other words, when only rigid motions are involved, mesh vertices won t move but their projection on image planes might change as the camera moves. In Fig.5 (a, W is 3D vertex on the face mesh, as camera moves from position C1 to position C2, W s projection on the image moves from p to p. (a (b Fig. 5. Rigid motion vs. non-rigid motion We also assume that rigid motions are continuous and smooth from frame to frame. Therefore, taylor expansion is used to derive camera s motions. Because camera motion

16 - D - consists of translation along X/Y/Z axes, and rotation around X/Y/Z axes. following the notations in Eq.(4 and Eq.(5, we have Jacobian matrix (15 From frame k to frame (k1, because the hessian matrix is zero, we have Therefore from Eq.(16, camera s motion from frame k to frame k1 is $ $ D (16 ( Non-rigid Motion At this step, we first track a set of feature points which are around highly active motion areas on the face. As before images change from frame to frame. Besides rigid motions, there are also non-rigid motions due to facial movements such as smiling, frowning etc. In Fig.5 (b, we observe that 3D mesh vertex moves from W to W. Due to dual effects of rigid motion and non-rigid motion, the image projection changes from p to p. Dash lines in the figure are for comparison with the pure rigid motion case in (a. Note in fact we don t know the depth of W,or where is the 3D position of W. The only thing we are sure is that W is located somewhere close to W along the projection line of C2p. Considering that a lot of facial motions are along the face s tangent

17 < surface, we assume that W is on the intersection of C2p and W s tangent plane as shown in Fig.6 (a, where is vertex W s normal. Although people may argue about this method, it is simple. Further, we don t really care if it is accurate at this step. We will see it in the following sections. (a (b Fig. 6. Heuristic solution to non-rigid motion Given W(X0,Y0,Z0 and (Nx,Ny,Nz for the k th frame, normalized pixel coordinate (PX,PY from tracking and camera motion (or rigid motion for the (k1 th frame, we construct A as: % % % ' and b as:

18 ' % % ' K Then the solution to W (X,Y,Z in the (k1 th frame is: < < <. Starting with the neutral face, normals of feature vertices on the first frame are easily accessible. However for the following frames, how do we estimate normals as the vertices move with non-rigid motions? We could repeatedly use normals from the first frame, but it would lead the new vertices move away from the face mesh further and further. One heuristic method is used as shown in Fig.6 (b, which is the enlarged version of (a. In the figure, triangles are supposed to be the neutral face mesh, and the blue dash line extended from C2p W intersects one of triangles at Q. We can obtain Q s normal on the triangle with tri-linear interpolation. By assigning new normal = for W, it will make the new vertices follow the face skeleton to some extend. Certainly there are other options. Constrains may also be imposed on how far away vertices can move. When the limitations are reached, motions along normal may also be considered. 4.5 Vertex Motion Disregarding the complexity of facial muscle structures, vertex motions can be understood as a simulation to the real physical motions from different facial expressions. We use gaussian basis functions to simplify the derivation of vertex motions. The feature

19 points picked for non-rigid motion tracking are used as control points for interpolation on vertex motions of the model mesh. Note that vertex motions do not involve rigid motions. They are caused by non-rigid motions. Fig. 7. Gaussian Basis Function Gaussian Basis Functions As shown in Fig.7, there are n control points marked in blue at their neutral positions %,...,. Their relative motions to the neutral positions in one following frame are,...,. The circles with gaussian curves mean that control points affect the surrounding areas in a way similar to gaussian functions. The red marked point p in the figure is like any other common vertex on the mesh, and its motion is determined by (18 Here / are coefficients for control points. They can be derived from (19 or let

20 < 9;9:9 9;9:9 9;9:9 9:9;9 9;9:9 9;9;9 9;9:9 and 9:9;9 Then the solution to coefficients is 9;9;9 < < <. 5 Motion Driven Textures Different facial expression caused by different and complicated muscle movements, and different textures are driven by different but simple vertex motions. In order to learn the relationship between textures and vertex motions, neuron network is used to do training. In our experiment, we use feedforward neuron networks. The training algorithm uses a gradient descent technique, backpropagation, which involves performing computations backwards through the network [5]. Proper trained backpropagation networks tend to give reasonable answers when presented with inputs that they have never seen. Typically, a new input leads to an output similar to the correct output if input vectors used in training are similar to the newly

21 presented input. This generalization property makes it possible to train a network on a representative set of input/target pairs and get good results without training the netwrok on all possible input/output paris. We know that people s expressions are numerous, but they are highly correlated. Same is true for vertex motions. Therefore, neuron network provides a convenient way for us to extract the driven force behind the vertex motions and textures. The steps in the training process are 1. Assemble the training data. Because both motion vertices and textures are highly correlated, we first apply Principal Component Analysis (PCA, [3] on each of them to reduce redundancy. The input of network is related to the motion vertices, and the output is related to the textures. 2. Create the network object. 3. Train the network. 4. Simulate the network response to new inputs. At this stage, we fulfill the mechanism for motion driven textures. When we feed the network with new motion vertices, we are supposed to get new desired textures. 5.1 PCA on Vertex With GUI, we pick n vertices in an interested area (such as highly wrinkled area from the first neutral mesh. Then we align their 3D world coordinates in a column vector, V1 = [,,,...,,, ]. For the following (m-1 frames, we know the 3D coordinates of the same vertex set from non-rigid motions, and we put them similarly into column vectors V2,...,Vm. Because these m vectors are highly correlated, we apply principal component analysis on them. Then we get some number, i, of significant principal components, which could be far less than m. We can reconstruct those vertex

22 vectors from linear combinations of i significant principal components without losing much fidelity. part of PCA transformation matrix, or coefficients of linear combinations for reconstruction are used for input to the neuron network. From current experiments, with four out of thirty prinicipal components, the reconstruction error could be less than PCA on Texture Motion Ficial textures are also highly correlated, but PC analysis on texture is more complicated. In order to perform PC analysis, we first need to provide textures in same size. Therefore, we find the texture patch, the bounding box in the first neutral image from the projection of picked vertices. The patch, shown as pink pixel grid on the left side of Fig.8, is the image projection of partial mesh covered by picked vertices (red and blue points. We then select three vertices (red points on the left, whose projections are closest to the middle-top, middle-left and middle-right of the texture patch, and use them as a basis for warping or affine transformation. Fig. 8. Warping Due to non-rigid motions, the corresponding picked vertices in the following (m-1 frames might move, so do their image projections (see Fig.8, right. By aligning three

23 red-marked points on the right image with those on the left image (arrows, we can derive the matrices for warping or affine transformation against the first neutral frame. Next, for each pixel in the texture patch on the first frame image, we backward project it on the following (m-1 frames images, and assign interpolated values (RGB values to it. This way, we obtain (m-1 texture patches for the following frames. They are in the same size, but the figures could be warped. Finally, like PCA on vertex, we align each texture patch into a column vector, in the form of [,,,,,,...]. Then we apply PC analysis on these m vectors. Because these m vectors are also highly correlated, we also get a few significant principal components with MSE error less than part of PCA transformation matrix, or coefficients of linear combinations for reconstruction are used for target of the neuron network. 5.3 Neuron Network A neuron network can have multiple layers. Fig.9 shows a m layer network. Each layer has a weight matrix, a bias vertor. Transfer function at each layer may be different, so does the number of neurons,, which decides the size of the output,. Input to the first layer is the training input P. Output at the last layer is the training target T. Input vectors and the corresponding target vectors are used to train a network unitl it can approximate a function, associate input vectors with specific output vectors. Network with biases, a sigmoid layer and a linear output layer are capable of approximating any function with a finite number of discontinuities.

24 Fig. 9. Neuron Network 5.4 Motion Driven Textures We create both one-layer and two-layer neuron networks. Although the training performance is different, there are no much difference on simulation results. With the trained network, we provide motion vertices (reused, not new, the feedback are new textures, so-called motion driven textures. Fig.10 display a sequence of motion driven textures on the left eye corner area, which seems well matching the real textures. Note because of memory limitation, programs currently cannot generate bigger texture patches unless they are stitched together. Fig. 10. Animation on motion driven textures

25 6 CONCLUSIONS A list of work to do in the future is: do more experiments on different models, different expressions; view-dependent texture mapping; improve tracking; explore different algorithms to find non-rigid motions; adapt to PCA on large data set; build facial texture database; generate real-time animations. ACKNOWLEDGMENTS We would like to thank Trevor Chen, Michael M. Cohen and Rashid Clark at Experimental Psychology, UCSC for their help with data collection and suggestions. References [1] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. Computer Graphics (ACM SIGGRAPH Proceedings, pages , August [2] Michael Garland. Qslim simplification software. garland/software/qslim.html. [3] Rafael Gonzalez and Richard Woods. Digital Image Processing. Addison-Wesley, [4] Zicheng liu, Ying Shan, and Zhengyou Zhang. Expressive expression mapping with ratio images. Computer Graphics (ACM SIGGRAPH Proceedings, [5] MATHWORK. Neural network toolbox, version 4. [6] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D.H. Salesin. Synthesizing realistic facial expressions from photographs. Computer Graphics (ACM SIGGRAPH Proceedings, pages 75 84, August 1998.

26 [7] Jun yong Noh and Ulrich Neumann. Jun-yong noh and ulrich neumann. Computer Graphics (ACM SIGGRAPH Proceedings, [8] Zhengyou Zhang. A flexible new technique for camera calibration. zhang/calib/.

Synthesizing Realistic Facial Expressions from Photographs

Synthesizing Realistic Facial Expressions from Photographs 1998 F. Pighin, J Hecker, D. Lischinskiy, R. Szeliskiz and D. H. Salesin University of Washington, The Hebrew University Microsoft Research 1