Model-Based Facial Feature Extraction

Size: px

Start display at page:

Download "Model-Based Facial Feature Extraction"

Whitney Griffin
6 years ago
Views:

1 Model-Based Facial Feature Extraction Master s Thesis Project done at the Image Coding Group Linköping University by Anna Nordgren Johan Scott Reg.nr. LiTH-ISY-EX-1926 Examiner: Robert Forchheimer Advisor: Jörgen Ahlberg Image Coding Group Department of Electrical Engineering Linköping University SE Linköping Sweden

3 Avdelning, Institution Division, department Department of Electrical Engineering Image Coding Group Datum Date Språk Language Svenska/Swedish Engelska/English x Rapporttyp Report: category Licentiatavhandling x Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version LiTH-ISY-EX Titel Title Modellbaserad extrahering av former i ansikten Model-Based Facial Feature Extraction Författare Author Anna Nordgren Johan Scott Sammanfattning Abstract In this report, a model-based method for extraction of facial features is introduced. The proposed model is piecewise linear, and the method is based on detection of tangents of edges in the areas and directions defined by the model. The method has been implemented for a model of an open mouth with two visible rows of teeth, and results are presented A survey of the previous work done in the area of facial feature analysis/extraction is also presented /lli Nyckelord Keywords facial feature analysis, extraction, survey, model-based image coding, MPEG-4

5 0CONTENTS 1. INTRODUCTION Background Model-Based Coding The MPEG-4 Standard Problem Definition and Delimitations Disposition 4 2. SURVEY OF PREVIOUS WORK Spatial and Temporal Methods Degree of Detail Basic Techniques Pattern Matching Gradient Methods Statistical and Probabilistic Methods Comparisons and Conclusions A MODEL-BASED APPROACH The Model-Based Feature Extraction Process Using Previous Knowledge in a Model Model Definition A Mouth Model Other Relevant Models LINE EXTRACTION Differentiating the Image Noise Reduction Line Mapping and Evaluation Line Merging ADAPTING THE MODEL Line Selection Line Model Robustness Enhancement The Mouth Model Extraction Constraint Motivation Contour Interpolation Experimental Results Feature Point Extraction MODEL GENERALIZATION Connection of Lines in Succeeding Angles Introducing New Model Primitives Generalized Models Adaption of Curves of Second Degree FUTURE WORK Further Combination Constraints Increased Line Property Utilization Generic Line Extraction Model Evaluation Local Refinement Utilization of Colour Information Further Adaptability of Tangent Direction Selection Usage of Temporal Information REFERENCES...61

7 1 1INTRODUCTION In this chapter, a background of facial feature extraction is given, as well as a presentation of its most important application, model-based, or semantic coding of video sequences, which is also included in the MPEG-4 standard. The issue covers a wide range of aspects, and it is a difficult problem if the characteristics of the images are not greatly limited. At the end of the chapter delimitations and assumptions are discussed along with the problem definition used in this work. 1.1 Background The work described in this report has been performed at the Department of Electrical Engineering at Linköping University, for the Image Coding Group. It is a part of the work done in the field of model-based coding, which will be introduced briefly below. The area of model-based coding can be divided into three parts; image analysis, parameter coding, and synthesis, where this work will be dealing with the first issue, image analysis with respect to facial feature identification. 1.2 Model-Based Coding Model-based image coding, or semantic coding is an approach where the shape and texture of an object in an image are separated. By identifying these properties and only transmit the shape, or model deformations and repositioning, very high compression rates can be achieved for some types of image sequences. In the case of videotelephony, we are dealing with images that are similar in the sense that they all contain a face. Therefore, we can construct a model of a face, and assume that it will be possible to describe all the images with the help of this model. This implies a significant reduction of the data to transmit in the case of facial images, but it also makes it impossible to code images that do not apply to the model. 1

8 2 CHAPTER 1: INTRODUCTION The three parts of model-based coding mentioned above will be presented here. First, the image has to be analysed, in order to adapt the given model to the image. This means, in the case of facial images, finding the location of the head, and further the locations of the facial features, maybe the rotation and tilt of the head, and the shape of the face and the facial parts. The texture of the face is analysed for the first frame, and this information is transmitted once. When the model is fitted, the image coding is done by calculating the model parameters, or the parameter differences from the previous frame. The parameters are transmitted, and the receiver will synthesise the image by construction and projection of a corresponding model. 1.3 The MPEG-4 Standard A new standard for model-based image coding of human faces is included in MPEG-4, becoming international standard in early 1998 [20][21]. The standard is actually a composition of coding schemes for several different target objects at different compression ratios, such as still images, audio and video sequences, where the model-based approach is one type of facial video sequence coding. The model-based facial image coding standard of MPEG-4 defines 84 feature points in the face. 3D-coordinates for these points are used to tell the decoder how to reshape its face model. Alternatively, a 3D-mesh can be used, describing the facial shape in detail. Facial Animation Parameters (FAPs) are defined as descriptors for facial expressions. 68 FAPs are defined, of which 2 are high-level descriptors that tell the decoder to show a face with a specified expression at a specified intensity, for example the expression surprise can be shown with an intensity of 44, on a scale ranging from 0 to 63. The remaining 66 FAPs are low-level descriptors that express movement of one feature point in a certain predefined direction. Facial Texture is also considered in MPEG-4. Textures can be downloaded to be mapped onto the face model. The texture must be accompanied with 2D-coordinates telling which points on the texture correspond to the feature points. The texture is compressed by the scalable still image wave- Model Based Facial Feature Extraction

9 1.4 PROBLEM DEFINITION AND DELIMITATIONS 3 let coder defined in MPEG-4 Visual. See Figure 1.1 to get a complete illustration of the MPEG-4 feature points [21]. Figure 1.1: The MPEG-4 feature point definitions, with information of which feature points are affected by FAPs. 1.4 Problem Definition and Delimitations The goal has been to study and present the previous work and results accomplished in the area of facial feature analysis/ extraction, and to develop, implement and evaluate an alternative method for facial feature analysis. We have concentrated on analysing mouth shapes, maybe the most difficult facial object due to its wide range of deformability. We assume previous knowledge about the approximate size and location of the mouth. Further, we presuppose that the image quality and resolution is good enough; the resolution must let the lips have a width of at least three pixels, this is to be Image Coding Group, Linköping University

10 4 CHAPTER 1: INTRODUCTION able to distinguish both lip edges from each other. The images we have dealt with are intensity images, no colour information has been used. The method should be able to extract feature points on the contour of the mouth according to some definition, e.g. the feature points defined by the MPEG-4 standard, but other points also are of interest since the actual face polygon model normally is a lot more complex than the frame given by MPEG-4. To illustrate the method, a fairly simple model of an open mouth with two tooth rows will be used. 1.5 Disposition In the next chapter, a discussion about the work done earlier in the area of facial feature detection/ facial feature analysis is provided. The methods are classified, discussed and compared at the concept level. Next, the model-based approach and its features are presented. In the following chapters, a new method with a slightly different approach and ambition is presented and followed by a theoretical generalization and some suggestions of future work. Model Based Facial Feature Extraction

11 2 2SURVEY OF PREVIOUS WORK The goal of this chapter is to present a summary of the main work done in the area of mouth feature detection/analysis. This work is connected to other areas such as face detection, face recognition and facial expression analysis, and therefore it is of great importance in a large area of research. The methods for detecting and analysing facial features can be classified in several natural ways, depending on which criteria we consider the most important. Three different classification criteria for detection of the mouth are presented and discussed, followed by a more thorough presentation of six of the suggested methods. 2.1 Spatial and Temporal Methods In the field of facial feature detection, we often have image sequences, where the goal is to find a part of the face and track it. In temporal methods we utilize the information from previous frames, whereas in spatial methods we only consider one frame at a time. There are many ways to make use of the information from the previous frames. One of the most obvious is to calculate the difference between two consecutive frames, to detect movement in the image [8]. In the case of faces, we know, for example, that the motion is greatest in the areas of the eyes and the mouth. Another way to utilize temporal information is to keep the position of the model from the previous frame, and adapt it to the new frame, [2][9][17]. This is often achieved with the help of some energy minimization function. In temporal methods, Kalman filtering is often used for prediction and smoothing [4][10]. Although temporal information can be very useful, this survey is focused more on the spatial methods. This is due to the fact that the new method described in this report utilizes only spatial information. The temporal methods are also mostly used for tracking of the approximate position of 5

12 6 CHAPTER 2: SURVEY OF PREVIOUS WORK the feature of interest, although they can also be utilized for more accurate positioning. 2.2 Degree of Detail Different methods of facial feature extraction use different degree of detail, depending on the applications. The methods can be divided into two main groups, according to this criterion: Some of the methods extract only one or two points of a part of the face, for example the midpoint/corners of a mouth or an eye, or the tip of the nose. This can be sufficient for an approximate adaptation of a face model, or a calculation of the 3D global position and rotation of the head. This degree of detail is applied in [4][8][17]. Other methods analyse the shape of a part of the face, and approximate this shape according to a model; this could be a set of parabolas, parts of ellipses, linear functions or feature points. A more detailed description than the one above is needed in all applications where it is necessary to follow the local facial movements. Applications of this degree of detail are found in [1][2][12][14][15][16][19]. 2.3 Basic Techniques The methods found in the literature can be classified according to the underlying techniques used. Three different classes are described in more detail below: Pattern matching, Section 2.4: A technique that computes the correlation between different parts of the image and a known template, to find out which part of the image is most similar to the template. Gradient methods, Section 2.5: These methods are based on detecting the edges in the image, and from this information draw conclusions about the features. Statistical and probabilistic methods, Section 2.6: Statistical properties of the face are utilized. This could for example be statistical knowledge about the geometrical properties of the face or information about the colour distribution in different parts of the face. Many of the actual methods make use of several of these techniques or use additional techniques, such as colour distribution and profile tracking. It is possible to use the chromatic information in the picture; the skin colour distribution is limited to a relatively small part of the colour space, which makes it possible to decide with high statistical certainty whether a pixel is a skin pixel or not [4]. In [7], a method for locating the face using colour information is used. It is shown, that the skin colour distribution of a specific person is very limited. When computing a colour histogram, a few bins will be significantly filled, and these will specify the face. Chromatic information has also been used to distinguish the lips and the skin, in for example [12][16]. Model Based Facial Feature Extraction

13 2.4 PATTERN MATCHING 7 In the case that additional information in the form of an image of the head s profile is available, this can be used to extract information about the position of some features, by analysing the profile s curvature [3]. Work has also been done, in which auditory information from speech has been utilized to help estimating the shape of the mouth [13]. 2.4 Pattern Matching Work 1 A method using pattern matching to detect the corners of eyes and mouth In [17] a method based on pattern matching is used to find the corners of the mouth and the eyes. This method is applied on sequences, and a face tracking algorithm is used. From this algorithm we achieve information about the location of the centres of the eyes and the mouth, and this makes it possible to greatly reduce the search areas for the corner points. Figure 2.1: Global search areas extracted from the tracking algorithm information. The templates used were created from a set of test sequences. The eye and mouth corners from these sequences were extracted and changed to equal size, and then the average eye and mouth corners were computed. These average eye and mouth corners were used as templates. The templates were scaled using the information about the size of the head, and inclined according to the head s inclination. The correlation between the image and the corresponding template is computed in each search area, and points with high values of correlation are extracted as potential corner point positions. A probability measure f is evaluated: f = ck factor ( x, y), (2.1) where ck factor ( x, y) are the correlations between the real image k and a corner template with a different size factor of 0.8, 0.9,..., 1.4. The point with the highest values of f in each potential area is selected as 2D corner point coordinates. When the corner points have been estimated, their correctness is verified according to the geometric conditions of the face, e.g.: 1.The lines connecting the corner points of the mouth and of each eye are approximately parallel to the line connecting the centre points of the eyes. Image Coding Group, Linköping University

14 8 CHAPTER 2: SURVEY OF PREVIOUS WORK 2.The length of the left eye is equal to that of the right eye. 3.The distance from the centre of the mouth to the right corner point is equal to the distance from the centre point to the left corner point. 2.5 Gradient Methods Work 2 Extraction of feature points in the mouth area using gradient summation A method described in [18] makes use of gradient methods to find a limited number of feature points belonging to the mouth. First, the horizontally oriented gradient in the mouth region is summed along vertical lines. The positions of the corners of the mouth are on the maximum negative and positive slope, respectively. summation of the horizontally oriented gradient projection along vertical lines Figure 2.2: Detection of the positions of the mouth corner points. Then the absolute value of the vertically oriented gradient projection is summed along horizontal lines, and the local maxima of the resulting function are taken as candidate points for the lip outlines. candidate points for lip outlines Figure 2.3: Detection of candidate points for lip outlines. The candidates for the lip outlines are then evaluated by taking the texture of a candidate lip, defined by two candidate points, and mapping it on the average upper or lower lip from a database of face images. The candidate lips that correspond to the best maps for the upper and lower lip are kept. The texture of the lower lip has a greater variation than that of the upper lip, and therefore the position of the upper lip is determined first, and then constraints are put on the lower lip according to the properties of the upper lip, e.g. the thickness of the lower lip is supposed to be correlated to the thickness of the upper lip. Model Based Facial Feature Extraction

15 2.5 GRADIENT METHODS 9 Work 3 Mouth shape detection using a deformable mesh A method to detect the shape of the mouth using an active mesh, is described in [15]. The mesh is contained in a rectangular box and consists of 24 nodes. Six of the nodes are active, this implies that they are allowed to move freely. Four nodes are fixed at the corners of the box. The rest of the nodes, called fictitious nodes, are calculated from the positions of these ten nodes and user defined constants. fixed nodes fictitious nodes active nodes Figure 2.4: The mesh structure for a non-tilted mouth. Even though the active nodes are allowed to move, their movement is constrained; the mouth corner points are on the same horizontal line, and the four active nodes on the lip outlines are aligned on the same vertical line. The mesh belonging to a tilted mouth is simply a rotation of a mesh of a non-tilted mouth. The feature detection algorithm is performed in four steps: 1.A box is positioned so that it contains the mouth. The size of the box is assumed to be estimated in advance, and the positioning is performed by searching for the location where the summation of the magnitude of the image gradient along the vertical direction is largest. 2. Generally there exists a middle, approximately horizontal line in the box, along which the luminance intensity is lowest; this line will be referred to as the valley line. If the mouth is closed, this line is located where the lips meet. If the mouth is open, it is positioned along the upper outline of the lower lip or the lower outline of the upper lip, or if the mouth is widely open, between the teeth. In any of these cases, the valley Image Coding Group, Linköping University

16 10 CHAPTER 2: SURVEY OF PREVIOUS WORK line passes through the corners of the mouth. The location of the valley line is determined by searching for local minima in each vertical stripe in a selected horizontal slit in the centre of the box. θ Figure 2.5: Illustration of the valley line detection. 3. Now the intensity profile along the valley line (Figure 2.6) is calculated by summing vertically neighbouring pixels. The profile usually has the shape of a basin, and the corners of the mouth are detected by finding the left and right edges of the basin. Note that the corner points are not necessarily on the same horizontal line, since the vertical position of the valley line may vary. The angle of the line connecting these corner points from the horizontal line is used as an estimate of the tilt angle θ. left corner Figure 2.6: Typical intensity profile along the valley line. right corner 4. The four central nodes are detected by analysing the intensity profile along a vertical line (tilted with θ) in the centre of the box. Three types of profiles are considered, one corresponding to a closed mouth, one to an open mouth with teeth in between the lips, and one to an open mouth with a dark middle region. Based on this profile, the type and the nodal positions are determined. Now the positions of the rest of the nodes can be calculated, and the mesh structure is complete. In [15] is also described a method for nodal tracking, using an energy minimization approach. 2.6 Statistical and Probabilistic Methods Work 4 A statistical method for automatic estimation of the mouth features using deformable templates In [16] a method is described, that adapts a deformable template consisting of parabolas to the shape of the mouth, utilizing statistic methods. Colour information is also utilized; the input image is supposed to be provided in a colour space. Model Based Facial Feature Extraction

17 2.6 STATISTICAL AND PROBABILISTIC METHODS 11 First, the method in [15] (Work 1, described in Section 2.4) is employed to find the corner points of the mouth. The line connecting these points will be called the midline. Further, the edges are detected, and the new image is binarized and morphologically adjusted. Candidates for the lip outlines are chosen as the edges intersecting with the line perpendicular to the midline, see Figure 2.7. candidate point corner point Figure 2.7: Candidates for the lip outlines. If the number of candidates both above and below the midline is greater than or equal to two, the mouth is considered to be open, otherwise it is assumed to be closed. This assumption is later verified, based on latter calculations. Figure 2.8: Mouth-closed and mouth-open deformable template, respectively. The cost function to be minimized in the case of an open mouth is defined as: where f 1 is f o = k 1 f 1 + k 2 f 2 + k 3 f 3, (2.2) 4 1 f 1 = E, (2.3) L Y ( X )ds i i = 1 with E YLi being the edge strength, computed from the Y component of the picture. are the lengths of four parabolas y i, (i=1, 2, 3, 4) chosen as a set of the candidates that forms a template, see Figure 2.8, mouth-open case, chosen as a set of four of the candidate points, see Figure 2.7. y i Image Coding Group, Linköping University

18 12 CHAPTER 2: SURVEY OF PREVIOUS WORK f 2 is f 2 = mu mo m l m o + σ u + σ o + σ l (2.4) and f 3 is f 3 = σu σo + σ o σ l + σ l σ u, (2.5) where m u, m l, m o, σ 2 u, σ2 l and σ2 o are the means and the variances of the colour components of the image in the regions of the upper lip, the lower lip and in the region between the lips. Coefficients k i (i=1, 2, 3) are weighting functions, which were set to 1 in the experiments. The term f 2 makes sure the colour values differ between the regions, while it is varies as little as possible in each region. The term f 3 considers that the variance of the camera noise is the same in each region. Now, the upper lip outline parameters are selected from the candidates above the midline, and the lower lip outline parameters among the candidates below the midline, Figure 2.7. For each possible combination, f o is calculated, and the combination with the minimum value of f o is selected as the lip outline parameters. The cost function to be minimized in the case of a closed mouth is defined as: where f 1 is f c = k 1 f 1 + k 2 f 2 + k 3 f 3, (2.6) 3 1 f 1 = E, (2.7) L Y ( X )ds i i = 1 y i analogous to f 1 in the mouth-open case, Eq. 2.3, except that only three parabolas are considered at a time according to Figure 2.8. f 2 is f 2 = mu ml + σ u + σ l (2.8) and f 3 is f 3 = σl σu, (2.9) where the notation is the same as in the mouth-open case above. Lip outline parameters are selected from the candidates, Figure 2.7. For each possible combination, the value of f c is calculated, and the combination with the minimum value of f c is chosen as the lip outline parameters. When the lip outline parameters have been estimated, they are verified according to the geometrical conditions of the mouth, i.e. the upper lip can not be thicker than the lower lip. The parameters are also verified according to the colour distribution, i.e. in the case of an open mouth, the value of the C r component must be greater on the lips than between them. If the mouth is closed, it is verified that the values of the C r component of the upper and lower lip do not differ more than a known threshold. Model Based Facial Feature Extraction

19 2.6 STATISTICAL AND PROBABILISTIC METHODS 13 Work 5 A probability based method using deformable templates Described in [12] is a method using deformable templates for eye and lip tracking. y1 y2 y3 x1 x2 Figure 2.9: Deformable template consisting of two parabolas, used for eye and mouth. The template used consists of two parabolas and is completely specified by the 5-dimensional parameter λ =(x1,x2,y1,y2,y3), Figure 2.9. For each pixel value, given in the colour space, we can compute the likelihood that the pixel is part of the feature of interest, or part of the background. The image is divided into two parts, foreground (eye or mouth, inside the template) and background. For these regions we have different probability density functions, and we can compute: PIλ = b fg ( Ixy (,)) b, (2.10) bg ( Ixy (,)) (,) xy fg (,) xy bg where I is the image, λ is the parameter describing the template, I(x, y) is the pixel value at location (x, y) and b fg and b bg are the foreground and background probability density functions. This value is used as the energy function to be maximized over λ. To avoid unreasonable amounts of computations, we need to know the approximate position of the template in advance, and in [9] this method is used for tracking. A real-time tracking system is proposed, where lookup tables are used for storing the log-likelihood, Pxy (, ) = log( b bg ( I( x, y) )) log( b fg ( I( x, y) )) (2.11) for every possible pixel value, and a log search is used to find the best fitting template. Work 6 A probabilistic method for facial feature tracking This method, described in [11] is a method for tracking a head with varying 3D-pose, and the first part of it estimates the global, rigid motion of the head. The second part, the local feature tracking, is performed with a Bayesian network. A Bayesian network combines probabilistic information from Image Coding Group, Linköping University

20 14 CHAPTER 2: SURVEY OF PREVIOUS WORK different areas to make the decision that is most likely to be correct. One important area where Bayesian networks have been used is support for decisions regarding medical diagnoses. The network is trained from front view images only, but after applying the global transformation to it, it can be used to model images from any view. eye corners and nostrils position mouth centre position left eyebrow position right eyebrow position left eye feature positioning local feature positioning local feature positioning local feature positioning right eye feature positioning Figure 2.10: The structure of the probabilistic network. The structure of the network is shown in Figure The root node contains the 2D positions of the triangle formed by the outer eye corners and the midpoint of the nostrils. This triangle is also used for the global head pose estimation. The next level contains the positions of the centres of feature groups such as mouth and eyes; these positions are modelled as a 2D Gaussian distribution relative to the root triangle. Further down in the network, it is described how an individual feature point is correlated with other feature points in the same group. Figure 2.11: Feature point probability density functions around the mouth. Consider the four feature points in the mouth group, as in Figure The positions of these points are p l, p r, p u and p b. The 2D distribution of p l when the other three points are given is a 2D Gaussian distribution denoted as Pp l p r, p u, p b. The point p l will be given a value S pl = αpp l p r, p u, p b + c( p l ) where c( p l ) is the result of local template matching in p l, and α is the strength of the constraint. The goal is to find ( p l, p r, p u, p b ) that gives the maximum value of S = ( Spl + S pr + S pu + S pb )P( p c ), (2.12) Model Based Facial Feature Extraction

21 2.7 COMPARISONS AND CONCLUSIONS 15 where p c is the position of the centre of the mouth, and P( p c ) is the probability for this centre given a certain root node triangle. The goal is reached by searching for the maximum value of each of the feature points iteratively, with the rest of the points fixed. The iteration is continued until it converges. The Bayesian network is a way to make use of the information from all of the face while searching for each feature point. This is done by structurizing the impact the different points and groups of points have on each other, and combining this high level information with the low level information from the template matching. 2.7 Comparisons and Conclusions As has been established earlier, it is hard to compare the performances of the methods, due to the differences in prerequisites, the widely varying testing methods, etc. In spite of this, we will make an attempt to draw some conclusions. An important issue is the possibilities of real-time usage, since one of the most obvious applications in this area is video conferencing. Here we can conclude that the pattern matching approach is rather time consuming. In the methods using deformable templates, the templates and decision functions need to be relatively simple, in order to reduce computational time. As has been discussed in Section 2.2, there are considerable differences in the degree of detail of the results of the different methods, and consequently different methods are appropriate for different applications. The methods resulting in information about a very small number of feature points, like the pattern matching method described in Section 2.4 and the first of the methods described in Section 2.5, Gradient Methods, are more appropriate for calculating an approximate position of a face model or maybe as starting points for a more accurate method. For each application, we have to appraise the conditions of our data, and the importance of detailness of the result, performance in time and robustness, and from this make a decision of the methods/basic techniques that are the best suited for the specific problem. Image Coding Group, Linköping University

22 16 CHAPTER 2: SURVEY OF PREVIOUS WORK Model Based Facial Feature Extraction

23 3 3A MODEL-BASED APPROACH In this chapter, a model-based approach is introduced as a mean for utilizing semantic knowledge in the approximation of the shape of the mouth. In fact, every feature extraction scheme is using a model of some kind, but in this approach the model is in focus and explicitly given. A model definition is suggested, and an example of a mouth model based on this definition is presented. This model will be used later, in the extraction process. The chapter ends with some examples of other relevant models in the area of facial feature extraction, being relevant since this method is intended to be generic. 3.1 The Model-Based Feature Extraction Process The core idea of this approach is to detect edges, or tangents of curve edges in predefined directions. This is done in directions and angles where we expect to get good responses, for example the open non-tilted mouth will have strong oblique tangents in the outer lip areas. Edges not considered to be reliable according to semantic knowledge are avoided, e.g. vertical tangents in the open mouth case, since strong edge responses are often caused by a number of teeth. This semantic knowledge is used to build a model. An illustration of a feature extraction process for an open mouth with two tooth rows can be viewed in Figure 3.1. When the model extraction is completed, an interpolation is performed to increase the natural appearance of the model, and also to achieve more accurate extraction of feature points along the curves. Otherwise, i.e., if keeping to the linear model, the corners of the model will differ from the edges in the image. 17

24 18 CHAPTER 3: A MODEL-BASED APPROACH The model-based approach does not limit the number of points that can be extracted. This is an advantage, since we have a robust method, and we can adapt the degree of detail according to the application. extraction interpolation real image model result Figure 3.1: The extraction process for the open mouth with two rows of teeth. 3.2 Using Previous Knowledge in a Model There is a lot of previous knowledge regarding the mouth, and this is of course the case for other facial parts as well. The basic idea is to utilize as much of this knowledge as possible in order to analyse the shape of the mouth. Some characteristics of how a model can be used and the consequences that can appear are listed below: The models that have been used are piecewise linear, and will be described more carefully in Section 3.4, A Mouth Model. Since we have previous knowledge about the approximate shape of the mouth, we know where to search for edges in different directions. This gives us a possibility to extract the information in the image that is likely to concern the shape of the mouth, and ignore information of less interest. To be able to adapt a piecewise linear model to the actual non-linear curves that appear in the image and constitute the mouth, we have to detect tangents of these curves. As the tangents we are searching for are dependent on the model, the result will be depending on the model and not only the image. If the model is well-designed for our purpose, it will be easy to achieve good results. If we, on the other hand, use a model that is not appropriate, it will be very difficult to find a good approximation to the real world data. We must find the best balance between what we can assume to know and what we want to find out. If we need a better approximation than the linear model can give, we can use the information from this model to adapt another more sophisticated model based on curve segments. Model Based Facial Feature Extraction

25 3.3 MODEL DEFINITION Model Definition With the restriction that the model should be piecewise linear, we are able to model the facial part of interest in many ways. The number of lines in the model is an important factor determining how well the model will approximate the facial part. Whether or not the directions of the lines are adaptable is another issue of importance. One reason to search only in a limited number of directions is that we gain robustness when we increase the knowledge of what we are looking for. This results in a lower demand of resolution, and the computations become a lot easier. Further, by searching for lines with a predetermined direction, we reduce the number of degrees of freedom to one. Another reason is that since we have previous knowledge about the shapes of the facial parts, it is likely that we can choose the right directions and thereby avoid most of the deviation from the real curves. Consequently, our models consist of a number of straight lines, that are positioned in relation to each other in a determined way. Figure 3.2: Examples of possible models for a closed mouth consisting of 7 line segments, a closed happy mouth with 16 line segments and an open mouth with 23 line segments. When a model is defined, we search for lines in the model-defined positions, and possibly, directions. This search will leave us with a number of possible lines, and the next problem is to decide which line is the most probable for each line position in the model. It is suitable to perform the search in a predetermined order as some tangents can be assumed to be stronger and/or more unambiguously located. Information of the locations of these more easily determinable lines is then used to constrain the location of more difficult or weaker lines. The same technique can be used to allow concave models, e.g., by first determining a convex hull of some kind. The line selection is done by weighting combinations of lines according to their agreement with the model. This weighting starts with combinations of a few lines, and their correspondence to some part of the model. Then these combinations are combined with each other to determine the best-fitting line combination for the complete model. There are several constraints on the model, and parts of it, that are not visible in the figures. These constraints are based on semantic information regarding the relations of the lines and are very important for deciding which line combination is the most probable one. In this weighting process, the weights of the original lines are also taken into consideration. Further, the constraints for the model correspondence could be of different importance for different parts of the model. For exam- Image Coding Group, Linköping University

26 20 CHAPTER 3: A MODEL-BASED APPROACH ple, the oblique lip edges are usually strong and easy to detect without the help from semantic constraints, whereas the horizontal outer edge of the lower lip is easily confused with the edge of a shadow below the lip. To avoid this confusion, we can utilize information about the thickness of the oblique parts of the lower lip. The line selection decisions must be made with respect to previous knowledge about the mouth. This can for example be performed as described in Section 5.1, Line Selection. If we want to be able to extract a generic mouth, we will probably need several different models, depending on the degree of detail we want to achieve. Maybe this can be done hiearhically by a tree search, and maybe this can be done for the whole face by gradually increasing the degree of detail in different areas. See Chapter 7, Future Work for a further discussion. 3.4 A Mouth Model The model that has been used in our work, is a model of an open mouth, with two rows of teeth visible, see Figure 3.3. Figure 3.3: The mouth model for the case of an open mouth with two visible rows of teeth. The model is completely defined by the x- and y-positions of the twelve vertices that define the lips in the model, and two y-positions for the teeth. In the method described in Chapter 4, we search for lines in a number of directions to find the correct angles of the eight oblique lines. This is not the case for the horizontal lines, which we suppose are possible to find in the predetermined direction. After this search, there is a number of candidate lines for every line in the model, and the best combination of lines shall be extracted. The model includes several constraints that are not shown in the figures, but they are still a very important part of the model. There are actually no rules for how these constraints shall be defined, as long as they are associated to the lines. The constraints should seize the semantic information about the object. Some examples of how the constraints could be formulated are: Symmetry of different types must be fulfilled, e.g., lip widths or centres of lines in different projection directions. The lower lip should be thicker than the upper lip. The edges should be of specific signs, for example the upper lip should be darker than the skin. Model Based Facial Feature Extraction

27 3.5 OTHER RELEVANT MODELS Other Relevant Models The model definition from Section 3.3 can also be used to model other parts of the face than the mouth. Models for the eyes or the nose can easily be constructed with the help of previous knowledge about their constitution, for example as in Figure 3.4. Figure 3.4: Examples of two nose models, two eye models and a face outline model. As illustrated above, the models can be defined using different degree of detail, and information from a coarsely extracted model can be used to extract other features. If the face outline is extracted, then good approximations of, for example, the locations of the eyes and the mouth can be made. Image Coding Group, Linköping University

28 22 CHAPTER 3: A MODEL-BASED APPROACH Model Based Facial Feature Extraction

29 4 4LINE EXTRACTION In this chapter, a method for extracting the lowest primitive used by the model-based approach, the tangent line segment, is presented. This is done by differentiating the image along different directions and analysing lines orthogonal to these directions to estimate how probable an edge at that line is. Post processing of the set of lines is also performed, as described at the end of the chapter. 4.1 Differentiating the Image In order to detect tangents of curves in the image, we calculate the gradient s projection on a certain direction ν. The directional derivative of a continuous field f is given by: f ν (,) xy = f(,) x y ν (4.1) f(,) x y = f x (,) xy f y (,) xy (4.2) In the case of a discrete field, an image I, the gradient calculation has to be discrete. To approximate the derivative we use a simple three point mid difference, see Figure 4.1, which has shown to work well in the current resolution (approximately 40 x 70 pixels with the object occupying the main part of the image). Figure 4.1: The approximated discrete one dimentional derivate (3 point mid difference) along the axis of derivation. 23

30 24 CHAPTER 4: LINE EXTRACTION The choice of kernel size is, in general, depending on the range of frequencies we want to detect. A smaller filter without the midpoint, a simple two point difference, could also be used. This, however, would respond to the noise with the highest frequency and we would like the significant lines to be reinforced to a greater degree. Larger filters, on the other side, did not improve the performance, and of course larger filters limit the narrowest detectable lip width, and we must allow this width to be rather small, in order to be able to detect all lip edges. To extend the kernel from Figure 4.1 to two dimensions, the new kernel size will be 3 x 3 pixels, and some different kernels can be used to accomplish this. Two slightly different methods will be presented, with insignificant difference in performance, as far as we can judge. In the first method, we approximate the x- and y-components of the gradient by lowpass filtering in the orthogonal direction, and differentiating in the wanted direction. The gradient is then formed as a linear combination of these components. An argument for this solution is that the edge detection is improved by lowpass filtering along the edges, but on the other side, we are not lowpass filtering in exactly the right direction. The lowpass filtering is done by calculating a simple arithmetic mean, but since we are only interested in the relative intensity changes we drop the normalizing factor, which leaves the filtering with a plain sum. I( x 1, y + 1) I( x + 1, y + 1) Θ I( x 1, y 1) I( x + 1, y 1) Figure 4.2: Two Discrete 2D derivation filter kernels in orthogonal directions. d(,) xy d x ( xy, ) = = d y ( xy, ) 1 1 I( x + 1, y + i) I( x 1, y + i) i = 1 i = I( x + i, y+ 1) I( x + i, y 1) i = 1 i = 1 (4.3) If we now use the angle Θ, as the polar angle in the direction of the unit vector ν, the resulting differentiated image D Θ ( I), can be expanded into: D Θ ( I) (,) xy = dxy (,) νθ ( ) = d x ( xy, ) cosθ + d y ( xy, ) sinθ (4.4) = cosθ( I( x + 1, y) I( x 1, y) ) + sinθ( I( x, y + 1) I( x, y 1) ) + ( sinθ + cosθ) ( I( x + 1, y + 1) I( x 1, y 1) ) + ( sinθ cosθ) ( I( x 1, y + 1) I( x + 1, y 1) ) (4.5) Model Based Facial Feature Extraction

31 4.1 DIFFERENTIATING THE IMAGE 25 Note that we gain four terms, all consisting of a difference between two opposite points with respect to the central pixel. The coefficients in front of these difference images depend on Θ and just by calculating different sums of these four images we can generate the approximated derivate in an arbitrary angle. The other method differentiates three equidistant bilinear interpolated points on the line in the direction of differation, that crosses the central pixel, see Figure 4.3. I( x 1, y+ 1) I( x α, y β) I( x 1, y 1) I( x + 1, y + 1) I( x + α, y + β) Θ I( x + 1, y 1) Figure 4.3: Differentiation with a 1D filter kernel, along an arbitrary line in the 2D space. The bilinear interpolation with inter pixel distance 1, is done as usual as I( x + α, y+ β) = αβi( x + 1, y+ 1) + α( 1 β)i( x + 1, y) + ( 1 α)βi( x, y + 1) + ( 1 α) ( 1 β)i( x, y), (4.6) with constants α and β as polar transforms of Θ as α = β = cosθ sinθ. (4.7) A drawback with this method is that several cases have to be considered depending on the current quadrant. In the first quadrant the expression for the differentiated image D Θ ( I) yields: D Θ ( I) (,) xy = I( x + α, y + β) I( x α, y β) (4.8) = αβi( x + 1, y + 1) + α( 1 β)i( x + 1, y) + ( 1 α)βi( x, y+ 1) + ( 1 α) ( 1 β)i( x, y) ( αβi( x 1, y 1) + α( 1 β)i( x 1, y) + ( 1 α)βi( x, y 1) + ( 1 α) ( 1 β)i( x, y) ) = cosθsinθi( x + 1, y + 1) + cosθ( 1 sinθ)i( x + 1, y) + ( 1 cosθ) sinθi( x, y + 1) cosθ( 1 sinθ)i( x 1, y) ( 1 cosθ) sinθi( x, y 1) cosθsinθi( x 1, y 1) (4.9) (4.10) Image Coding Group, Linköping University

32 26 CHAPTER 4: LINE EXTRACTION The expression is a bit more complicated than for the first method and here we need nine differentiation images with different displacement, to form D Θ ( I) in an arbitrary direction instead of four. The images are divided into groups of six, depending on the quadrant of interest. The six displacement images needed to form a differentiation image in the first quadrant have been evaluated in Eq Worth noticing is, that in the corresponding 2 x 2 case, the two different methods will yield exactly the same result. This is not true in the 3 x 3 case as we have seen above. The difference is shown in Figure 4.4 and reveals that the bilinear method has a sharper response which depends on the lowpass filtering effect of the first method. This fact influences the unwanted noise too, and it also gives sharper response to edges far from the wanted angle, so the question of which method that gives the best result with respect to our application is still open, and has not been clarified by our tests. However, the second method has been used in the results which follow. Figure 4.4: The filter response with differentiation angle 60 from the first differentiation method, the second method using bilinear interpolation and the magnified difference between then. 4.2 Noise Reduction Since a filter kernel of only 9 pixels is used, the resulting differential image will be quite sensitive to noise, and when we in practice only can detect four pure directions (with regard to signs), it will also have a very limited performance determining the right tangential angle. An image could contain high energy and, consequently, give a high response in many directions. As a result, the differential image will contain long curve sectors that originate from lines in a wide angular spectrum. To restrain this phenomenon we can use information from the differential image in the orthogonal direction; we want the difference in the orthogonal direction to be less significant. Let D // denote the differential image of the wanted direction and D the differential image of the orthogonal direction. The new filtered image D //,can then be calculated as follows: sgn( D D // )( D // kd ) // = 0, D //, D // > kd < kd (4.11) Model Based Facial Feature Extraction

33 4.3 LINE MAPPING AND EVALUATION 27 Now, if the differential image in the orthogonal direction has some response, then we subtract the absolute value of that response (weighted with a constant k ) from the original response s absolute value. A large disturbance or a tangent not exactly in the wanted direction will then be surpressed according to this error, weighted with k. In Figure 4.5 one can see how tangents with an erroneous direction or noise are surpressed with increasing k, but the tangent of 60 in the first quadrant still remains. From now on we will assume that k will be set to 1. k=0 k=0.5 k=1 k=1.5 k=2 Figure 4.5: A differentiated image, differentiation angle 60 with noise reduction using different constants k. 4.3 Line Mapping and Evaluation To find tangents with a certain direction, we will have to concatenate the tangential response points from the differential image in some way. A natural way to do this is along straight lines, orthogonal to the direction of differation, determined by the angle Θ. In this way we can continue lowpass filtering along the orthogonal direction at the same time. Basically the line mapping is a rotation, but it is also practical to change the representation to indexed polar coordinates and simultaneously do some filtering. The transform or mapping can therefore be illustrated, for the line i, as in Figure 4.6. D // ( Θ) r i f i ( x) D // ( Θ) r i Θ x f i ( x) Figure 4.6: The differential image to line mapping. Where r i denotes the shortest distance from the centre of the image to the tangential line, and f i ( x) the line function, as the outsliced differentiated image along that line. Image Coding Group, Linköping University

34 28 CHAPTER 4: LINE EXTRACTION The result can then be lowpass filtered again if needed, as the very high, but not very wide, peaks have been surpressed. Now we can easily determine where the actual line endings o ri and o li are, by starting at a point s x, that we know is on the line segment, and then follow the function until it reaches a local minimum (or drops under a low threshold). The starting point s x, can be a centre of mass estimate, if we only take the positive val- The calculation of r i and f i ( x), from which the weight later shall be determined, is at this point given in a straightforward manner by: r i = i (4.12) f i ( x) = D // ( r i cosθ + xsinθ, r i sinθ xcosθ) (4.13) Let us now, before we deal with the weight calculation, introduce some new concepts. Let offsets o ri and o li denote the maximum distances from the search line - the line emerging from the midpoint in the direction of the differentiation- to the right and left end, respectively, of the section of line i where the tangent response is most significant. This can be used for validation and confirmation of different hypotheses later on, see Section 7.2, Increased Line Property Utilization. If we have one fixed model, two offsets are not really needed, because the line segments should then be extended so that they form a closed polygon. But these offsets are used internally anyway, when calculating which part of the line that should be included in the weight determination. The evaluation of the line function f i ( x), has shown to be a tricky problem, since the function can be very noisy. To calculate the weight, we only want to summarize the values that are included in the actual tangent section and not the noisy values outside, which could be very high. Ordinary lowpass filtering is not exactly what we want, since it only reduces the very high noise peaks which could still be too high, although we want to acheive a smoothing of the function. A median filter is good at removing single peaks, but here we cannot always relay on a single value, we still want a smoothing effect. A middle way can be to sort the values in the filter kernel and then not just take the middle value, as in a median filter, but a mean value which weights the middle value with the highest weight factor and the outermost values with the lowest. Figure 4.7: Example of some values that are lowpass filtered by a gaussian type of filter after they been sorted. The gray dot indicates the result of the filtered point using a five point kernel. Model Based Facial Feature Extraction

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion