Facial Landmark Detection using Active Shape Models

Size: px

Start display at page:

Download "Facial Landmark Detection using Active Shape Models"

Loreen Wade
5 years ago
Views:

1 Departament de Teoria del Senyal i les Comunicacions Escola d Enginyeria de Terrassa Universitat Politècnica de Catalunya Facial Landmark Detection using Active Shape Models Author Gonzalo Lopez Lillo Professor: Josep Ramon Morros Barcelona Octubre 2014

2 1.1 Introduction I would like to thank my brother, who encouraged me to study this degree and to do this project in English. A special appreciation to the blonde who lives in the attic. I am thankful to my team for their great human quality. Also a special appreciation to the coffee girl with a flower in her head. And last, but not least special mention to Ramon Morros, who is the only one that has given me the opportunity to be here. 1

3 1.1 Introduction Contents Illustrations... 5 Chapter Introduction Project objectives Some clarifications of the project Project description... 9 Chapter History and Literature Review Active Shape Model Active Shape Model with SIFT MARS Active Shape Model with SIFT and MARS Chapter Active Shape Model Aligning the Train Set The Shape Model PCA in the Shape Model Classic Active Shape Model Multi-resolution The New Active Shape Model HAT & SIFT Scale-Invariant Feature Transform (SIFT) Histogram Array Transform (HAT) Multivariate Adaptive Regression Splines (MARS) Chapter System Description Block Diagram Face Detector

4 1.1 Introduction 4.3 Database model ASM The ASM model Init Models and Detector Parameters bmax and negis Align the Shape Align with the Rectangle Align with the Eyes Align with the Eyes and Mouth Select ASM Model ROI Shape to frame Mouth Identification Chapter Database Facial Dataset Open and closed mouth dataset Chapter Open or Closed Mouth State Detection Active Shape Model in Mouth Vector of features Classifier Training Classification Chapter Experimental Results NEW ASM The me17 measure: Compare the different start shape models Graphs for comparing models Image error with me Landmark error with me Classic ASM vs New ASM Empirical cumulative distribution function Visual evaluation with empirical test Mouth State Detector Confusion Matrix The ROC curve

5 1.1 Introduction The ROC curve for dataset validation Results Chapter Conclusions and Possible Future Research Conclusions Possible Future Research Appendix A Face landmarks Bibliography

6 1.1 Introduction Illustrations Figure 1 Image Face with landmarks. Image data base from BioID Figure 2: Aligning position, rotation and scale the shape Figure 3 Algorithm of aligning the train set Figure 4 the mean shape Figure 5 varying the first three shape parameters between i b i Figure 6 Applying an Active Shape Model in a face and iterating until convergence Figure 7 three resolutions of the image Figure 8 Block Diagram about the New Active Shape Model Figure 9 Block diagram of SIFT Figure 10 an area in the patch map to extract a gradient Figure 11 convert the gradient with 8 bins to a histogram Figure 12 General block diagram of software Figure 13 Block Diagram of software in small modules Figure 14 Input image before to pass by the face detection. And the ROI image after the Face Detection Figure 15 Block diagram of model training used for Stephen Millborrow and Fred Nicolls for create a frontal model Figure 16 Align ROI with the rectangle model Figure 17 Align ROI with the eyes detector Figure 18 Align ROI with the eyes and mouth detector Figure 19 select and adjust the size of the start shape using a triangulation of the eyes position with the mouth position Figure 20 Fitting the Shape using HAT and MARS Figure 21 The BioID face database Figure 22 Frames to Maria Konnikova: How to think like Sherlock Holmes from big think channel Figure 23 Frames to the importance of knowing from big think channel Figure 24 Frames from to how to revise like a Sherlock by Mind Place from Maddie Moat channel Figure 25 Sam Harris Big Think All from Big Think channel Figure 26 telenotíces TV3 (TN) Figure 27 Landmarks located in the mouth Figure 28 the distance between the two inner landmarks lips and the width mouth distance Figure 29 Histograms with the Gaussian distribution. The left histogram is about closed mouth and the right histogram is about the open mouth Figure 30 the two histograms with their Gaussian distribution in the same plane for looking for the intersection between the Gaussian distribution and find the threshold Figure 31 Images with many landmark errors found in the results of the graphs. Order from left to right and from top bottom these image correspond to the number 259,282,392, 901,423,640,740,803 of the Graph Figure 32 BioId landmarks Figure 33 classic ASM with 68 landmarks Figure 34 New ASM with 77 landmarks

7 1.1 Introduction 6

8 1.1 Introduction Chapter Introduction In today s digital world the consumer electronics industry is undergoing massive transformation as consumers become more connected than ever to an increasing array of digital products and devices, anywhere, anytime. Devices such as smartphones and tablets are commonplace. Therefore increasing exponentially the large amount of visual data that is generated, both video and images. Creating the need of knowing how to deal with the information it contains. One of the most important issues of this pre-processing is knowing how to retrieve the information related to people from those multimedia devices. Learning how to use this information lets us interact with systems using cameras to automate processes. One of the most relevant features of a person in their face. In the last couple of years there has been extensive efforts around facial detection and recognition, the key in this algorithm is the detection of facial features (eyes, mouth, nose, etc...). Because by effectively identifying these facial features we can improve the facial recognition techniques, for example be able to estimate the pose of face, recognize emotions etc.. This project we will be focused on one of the tools used to localise facial features. Specifically on some variations of the Active Shape Model [1]. ASM is a method based on a model created by interest points, it uses algorithms to try and find the maximum coincidences between the model and the image data. The information obtained from ASM will be used to recognize expressions. This information have several applications. For example an eyes detection could let you control your smartphone with the eyes. The focus will be focus on how to use this information to create and open or closed mouth state detection, which could be used to improve the recording of sound in videos. 7

9 1.2 Project objectives 1.2 Project objectives The main objective is to adopt the stasm by Steven Milborrow and Fred Nicolls [11] code for facial recognition and extraction of facial features to Imageplus (the new development platform in C++ language for the Video and Image Processing Group at UPC). In order to achieve the main objective the following task have been completed. All the objectives are focused on relating images to a faces and facial features: Given an image with a face, find the position where the face features are located. The set of local points describing each face, as shown in Figure 1 are called landmarks. Figure 1 Image Face with landmarks. Image data base from BioID. To compare the new with the old Active Shape Model, it uses SIFT [10] descriptor and PCA [16] and it has 68 landmarks. New Active Shape Model use SIFT descriptor and MARS[18] and it has 77 landmarks. To recognize the facial expressions, specifically open and closed mouth detection using Active Shape Model and classify it with a binary classifier. 8

10 1.3 Some clarifications of the project 1.3 Some clarifications of the project The feature extraction process is extremely complex and there are many techniques. Obviously this project is not starting from scratch. In order to avoid confusion or pitfalls, it is necessary to indicate some aspects that are not included in this project: Face detection. The face position must be located before using Active Shape Model Viola-Jones [17] detector of OpenCV[14] is used in order to see if there are some faces present in the image. Only to accept frontal face images. Our ASM implementation does not contain any side view models for ASM. Face recognition. This project does not identify the present people within an image. 1.4 Project description The project is divided into seven main parts: - State of the art, a brief review of the literature. - ASM, SHIFT, MARS and HAT theoretical contents. - The software structure. - Database validation set and test set. - Open and closed mouth detection. - Summary of results and a comparison of them with published results. - Conclusions, discussion and possible future research. 9

11 2 History and Literature Review Chapter 2 2. History and Literature Review 2.1 Active Shape Model This statistic approximation is used for shape models and feature extraction. ASM was presented for Cootes[1], in the same document he also presented AAM. This technicque is able to estimate the shape of an object with high level accuracy. In addition to this, they have a lot of types of algorithms and results. This allows to establish a comparative framework, which allowed the development and improvement of different techniques. In October of 2004, Fei Zuo[2 ]and Peter H. created a face recognition system. They used a deformable shape model with haar-wavelet based in local texture attributes. In 2005, Kwok-Wai [3] proposes an ASM, which is able to adapt to facial images under different orientations. To represent the face in different poses, a model is used for each view. In 2006, Mohammad H.[4] proposes to extract the facial features with RGB information. He used enhanced active shape model. 2.2 Active Shape Model with SIFT There are a lot of proposals related with the usage of SIFT inside ASM. However, we are going to mention just a few based schemes. In 2007, Kanaujia and Metaxas[7] used SIFT descriptors with multiple shape models clustered on different face poses. In 2009, Zhou[5] made an automatic landmark with a SIFT descriptor with Mahalanobis distances. Zhou and Petrovska improved the eye and mouth distances. In the same year, Li Z.[6] searched facial feature using statistical models and SIFT descriptors. Finally, in 2011 Belhumeur used SIFT descriptors with SVM and RANSAC.[8] 2.3 MARS Multivariate Adaptive Regression Splines (MARS) is a general-purpose regression technique introduced by Jerome Friedman in a Leathwick in a 2005 use MARS to predict the distributions of New Zeland s freshwater diadromous. Fish and Vogel use of MARS for concentration can explain two thirds protein abundance variation in a human cell line[9] 10

12 2.4 Active Shape Model with SIFT and MARS MARS has demonstrated a good performance on different applications, although it is not well known in the image processing community. 2.4 Active Shape Model with SIFT and MARS In 2014, Steven Milborrow and Fred Nicolls used the ASM with SIFT descriptor and MARS[11]. Histogram Array Transform is a simple modification version of SIFT. They used a HAT descriptor: HAT descriptor is essentially unrotated SIFT descriptor with a fixed scale. This project is based on the Milborrow s model. From now on. This model will refer to as New Active Shape Mode [11l. 11

3 Active Shape Model Chapter 3 3. Active Shape Model Active Shape Model is used to represent objects in images. The shape of the object is represented by a set of n points which may be x dimensions.

13 3 Active Shape Model Chapter 3 3. Active Shape Model Active Shape Model is used to represent objects in images. The shape of the object is represented by a set of n points which may be x dimensions. These points are called landmarks. x ( x1, y1, x2, y2..., x n, y n ) Equation 3-1 In order to choose the best landmarks points it is necessary to find the same points in other images. It could be placed at clean corners of object boundaries or in face cases, landmarks are usually located in places where these could be boundaries and resemble with most precision the shape of the faces. The objective is to build a model that allows us to create new shapes and to synthesise shapes similar to those in a training set. The training set comes from hand annotation, though there are automatic landmark systems. 3.1 Aligning the Train Set It is necessary to align the train set to work in the same reference systems. By aligning position, rotation and scale of each shape, we achieve the minimum distance between the shapes. T Figura 4.2: Figure Punt 2: osaligning ut ilizados position, para rotation alineación and scale dethe modelos shape de forma. 12

14 3.2 The Shape Model 1. Translate each example so that its centre of gravity is at the origin. 2. Choose one example as an initial estimation of the mean shape and scale so that x Record the first estimation as x0 the reference to define the default reference frame. 4. Align all the shapes with the current estimation the mean shape. 5. Re-estimation mean from aligned shapes. 6. Apply constraints on the current estimate of the mean by aligning it with x0 and scaling so that x If not converged, return to 4 point. 3.2 The Shape Model After aligning all the shapes and before building the shape model, we have n vectors of 2 dimensions. With that, we could generate new examples that are similar to those in the original training set. With them, it is possible to examine new shapes to decide whether they are valid examples. In order to reduce the dimensionality of the data from n x 2. We use Principal Component Analysis (PCA) PCA in the Shape Model In our case the shape have an Figure 3 Algorithm of aligning the train set n shapes=77 landmarks in 2D dimension. x x, y, x, y..., x, y ) i ( Equation 2 We search the mean shapes in the average of the aligned training shape n n T x i x n 1 shapes Equation 3 n shapes i1 x i 13

15 3.2 The Shape Model The covariance matrix Figure 4 the mean shape S S of the training shape points is computed. S S n shapes n 1 1 shapes i1 Equation 4 x xx x T i i Then the eigenvectors and corresponding eigenvalues i of S. If it contains the t eigenvectors corresponding to the largest eigenvalues, we can approximate any of the training set, x by using x x b Equation 5 Being b is a vector given by the expression Equation 6 where t is the number of eigenvectors considered. By applying limits of b T x Equation 6 x 3 i b 3 i to the parameter i shape generated is similar to those in the original training set. b we guarantee that the The Figure 5 presents the effect of varying the first three shape parameters between 3 i b 3 i parameters at zero. standard derivations from the mean value, leaving all other 14

16 3.3 Classic Active Shape Model Figure 5 varying the first three shape parameters between 3 i b 3 i 3.3 Classic Active Shape Model The ASM fits the model points with a new image using an iterative technique. A search is made around the current position of each point to find a close point which best matches the model. The parameters of the shape model controlling the point positions are then updated to move the model points closets to the points found in the image. Before calculating b for the new image, we need to find T. T is a similar transform, which correlate the model space into the image space. x X T y Yt t s cos s sin Equation 7 s sin x s cos y The transform is needed because the face shape x could be anywhere in the image space. Then we have to straighten out the image in order to work in the same origin point. Cootes and Taylor [1] section 4.8 describes an iterative algorithm for finding b and T. x T( x b) Equation 8 Instead of starting from scratch we are going to start from an approximation. Once we have chosen the parameter b for the model, we define the shape of the object in an object centred coordinates. We can create an instance X of the model in the image frame. Nevertheless, model points are not always settled on the strongest edge in the locality. They might represent a weaker secondary edge or another image structure. 15

17 3.3 Classic Active Shape Model The best approach is to use the training set for learning what to look for in the destination image. For this we need to develop statistical models of the image structure around the point and during the search simply find the points which best match the model. During the training stage we build a model for each landmark, we sample along a profile k pixels either side of the model point in the i training image. We have 2k+1 sample, which can be put in a vector g i. We normalise the sample g i g i 1 j g ij Equation 9 g i We create a mean profile g and a covariance matrix S g 1 g n shapes n shapes i1 g i S g 1 k 1 k i1 Equation 10 Equation 11 g gg g T The assumption is that the profiles are approximately distributed as a multivariate Gaussian, and thus can be described by their mean and covariance matrix. The distance between a search profile g and the model profile g is calculated using the Mahalanobis distance Distance T 1 g g S g g g Equation 12 This distance is equivalent to maximise the probability that g i g i g g comes from the distribution. This process is repeated for each landmark, Cootes and Taylor[1] section 4.8 describes and iterative algorithm. 16

This implies making a search in a coarse image, then refining this on finer resolution images.

18 3.4 Multi-resolution Figure 6 Applying an Active Shape Model in a face and iterating until convergence. 3.4 Multi-resolution In order to improve the efficiency and robustness of the algorithm, a multiresolution framework is implemented. This implies making a search in a coarse image, then refining this on finer resolution images. Figure 7 three resolutions of the image When looking at the resolution, we need to determine if the resolution is good or not. When the resolution is bad, then we look for another resolution and when resolution is good we stop the search. Techniques for testing convergence is repeated by Cootes and Taylor[1] in section 7.3 describes an iterative algorithm. 17

However, it uses a simplified form of SIFT descriptors for template matching, replacing the 2D gradient

It also incorporates MARS to measure descriptor matches replacing the Mahalanobis distance.

19 3.5 The New Active Shape Model 3.5 The New Active Shape Model New Active Shape Model uses the classic ASM Classic to locate landmarks. However, it uses a simplified form of SIFT descriptors for template matching, replacing the 2D gradient descriptor profiles. This simplified form of SIFT is called HAT. It also incorporates MARS to measure descriptor matches replacing the Mahalanobis distance. HAT MARS Aligning the Face Shape Model Resolution Image Figure 8 Block Diagram about the New Active Shape Model In the next section, the new descriptor HAT is going to be described. Also we are going to describe the MARS algorithm used to measure descriptor matches. 18

1 Scale-Invariant Feature Transform (SIFT) Scale-space Extreme Detection Keypoint Localization Orientation Assignment Keypoint Descriptor Steven Milborrow and Fred Nicolls

The main objective or applications are object detection, recognizing people, video tracking and 3D modelling.

20 3.5.1 HAT & SIFT HAT & SIFT Scale-Invariant Feature Transform (SIFT) Scale-space Extreme Detection Keypoint Localization Orientation Assignment Keypoint Descriptor Steven Milborrow and Fred Nicolls Correspondence points of interest Figure 9 Block diagram of SIFT SIFT is an algorithm to detect and describe local features of images. The main objective or applications are object detection, recognizing people, video tracking and 3D modelling. This algorithm locates points within images based on the amount of information surrounding that point. This local information can be edges, textures or stable transformation points. The original SIFT is made up of the followings processes: 1 Scale-space Extreme Detector It searches on different sizes and in different regions the Difference of Gaussians (DoG) and the maximum local difference over space and scale. This produces a strong response to corners sized to fit with scale. 2 Keypoint Localization It locates the points of interest, refined to sub-pixel accuracy and it uses a threshold to discard those which are not relevant. 19

21 Histogram Array Transform (HAT) 3 Orientation Assignment An orientation to each point of interest is assigned to ensure with respect to the rotation of the images. In order to do that, the neighbouring points are taken at each interest point and the magnitude and direction of the gradient are calculated. Then we make a histogram with the directions that we mentioned before weighting with the gradient magnitude. The largest peak in the histogram indicates the orientation of the interest point. If there are other peaks above 80% of the most important, the new point is used to create other points of interest in the same position and scale but with different orientation. 4 Keypoint Descriptor For each point a group of neighbourhood points of 16x6 are taken. These are divided into 4x4 sub-blocks and for each point an orientation histogram is created. 5 Correspondence between points of interest The correspondence between the points of interest of two images is obtained through a search of the nearest points in the space of descriptors of points of interest Histogram Array Transform (HAT) HAT is the descriptor used in the new version of ASM (new ASM). It is a simplified form of SIFT. This version takes advantages of the algorithm that is previously used in ASM to simplify the task descriptor. That does not repeat processes, therefore it is a smaller version of SIFT. SIFT first makes a scale-space analysis of the image to discover which points in the image are keypoints. In contrast, in ASM the keypoints are predetermined, they are the facial landmarks and the face is escalated to a constant size before the ASM search begin. In the SIFT algorithm the intrinsic scale of a descriptor has been determined in an additional pre-processing step. The local structure of the image around the point of interest is analysed to determine the gradient orientation. However, in ASM before to beginning the search we find the rotation of the entire image so the eyes are horizontal. Thus, the automatic orientation for SIFT is unnecessary. When SIFT localizes each pixel in the patch, the SIFT must be mapped to a position in the array of histograms. To do this mapping, SIFT scales and rotates patches. But HAT does not use that scaling, rotation nor the mapping. Therefore, it depends only on the histogram array dimensions and the image patch width. 20

3.5.1.2 Histogram Array Transform (HAT) This step generates a descriptor for the local image region that is highly distinctive at the beginning.

The descriptor is created by computing the gradient magnitude from a rectangular patch around the image point of interest.

These dimension were determined during their training. We use 8 bins per histogram. That is, 360º/8=45 degree per bins. The array of histograms is stored internally as a vector with 4x5x8=160.

22 Histogram Array Transform (HAT) This step generates a descriptor for the local image region that is highly distinctive at the beginning. It is as invariant as possible to remaining variations, such as change of illumination. The descriptor is created by computing the gradient magnitude from a rectangular patch around the image point of interest. Figure 10 an area in the patch map to extract a gradient We use a 15x15 patch and we take an array of histograms. These arrays have a 4x5 dimension. These dimension were determined during their training. We use 8 bins per histogram. That is, 360º/8=45 degree per bins. The array of histograms is stored internally as a vector with 4x5x8=160. Figure 11 convert the gradient with 8 bins to a histogram The gradient magnitude at a pixel is added to the histogram bin designated for this orientation, down weighted for smoothness by Gaussian distance of the pixel from the centre of the path. Likewise a small change in the orientation at a pixel could cause an abrupt jump in assignment from one histogram bin to another. Therefore, a gradient with a 45º orientation is shared equally between the bin for 0-45 and the bin for 45-90º. 21

23 3.6 Multivariate Adaptive Regression Splines (MARS) 3.6 Multivariate Adaptive Regression Splines (MARS) MARS is a type of regression analysis. It is a non-parametric regression technique and could be seen as an extension of linear models and automatically this type of regression analysis makes non-linear models and interactions between variables. MARS was used to measure how well the descriptor matches the facial feature of interest. match max( 0,1.514 b 0.111max( 0,2.092 b max( 0, b max( 0,1.574 b ) 10 ) 1.255) Equation 13 This is a MARS formula to estimate the descriptor match at the bottom left eyelid in the full scale 13 ) The MARS model-building algorithm generates the formula from training data. There is a similar formula for each landmark at each pyramid level. The bins enter the formula via the max functions rather than directly as they would in a linear model. The innate structure of the image feature makes some bins more important than others. 22

24 4 System Description Chapter 4 4. System Description In this chapter, the system used to carry out the face features extraction is described. The extraction is based on the chapter 3.5 and the following modules. Input Frame Face Detector NEW ASM Mouth Classification Load ASM Model from disk Load Mouth Model from disk Disk Figure 12 General block diagram of software Input Frame: It is the image or set of images, that are going to be entered to the face detection in order to extract their associated features. These tools are intended for input video. Face detection: To find the facial features, the faces must first be detected. When the images are entered to the system, the Viola-Jones technique has been used to perform the face detection. NEW ASM: This Module is responsible for aligning shape and extracting the face features. It will look for the initial shape and adjust it to the face, it will also fit the landmarks in the shape of the face. This module is based on a code developed by Stephen Millborrow and Fred Nicolls on C++ [11]. This code has been adapted to be used on the ImagePlus. 23

25 4 System Description Mouth identification: This module is responsible for recognizing if the people have their mouth opened or closed. This module uses the extracted features in previous module and the training model to determine the condition of the mouth. 24

26 4.1 Block Diagram 4.1 Block Diagram Figure 13 Block Diagram of software in small modules. 25

4.2 Face Detector 4.2 Face Detector All the faces that are in an image are extracted through a face detection system.

developed by Viola-Jones. Figure 14 Input image before to pass by the face detection. And the ROI image after the Face Detection.

Viola-Jones uses AdaBoost, which is a machine learning algorithm that combines different weak classifiers in order to create a very powerful classifier.

27 4.2 Face Detector 4.2 Face Detector All the faces that are in an image are extracted through a face detection system. The face detector used in this project is the one that we can find in the library ImagePuls, but it s implemented by OpenCV library [14], which is based on the algorithm developed by Viola-Jones. Figure 14 Input image before to pass by the face detection. And the ROI image after the Face Detection. The Viola-Jones uses a method that accumulates the results of many weak classifiers, each of them based on very simple image features. Viola-Jones uses AdaBoost, which is a machine learning algorithm that combines different weak classifiers in order to create a very powerful classifier. It also uses an algorithm to construct a cascade of classifiers, which achieves increased detection performance while radically reducing computation time. This implementation can work with frontal and side profile faces, but our face features detection only can work with frontal faces. Therefore, we only configure face detector tool for frontal faces. 26

Rotation and Scale Reduce Dimensionality using PCA.

Millborrow and Fred Nicolls for create a frontal model. 4.

This data comes from a set of face images that have previously been manually

These images are part of the database XM2VTS.

Therefore, the exclusive usage of the model is for front faces.

28 4.3 Database model ASM 4.3 Database model ASM Manual Annotation Data Base Shapes Aligning Position, Rotation and Scale Reduce Dimensionality using PCA. Create Shape Model Figure 15 Block diagram of model training used for Stephen Millborrow and Fred Nicolls for create a frontal model. 4.4 The ASM model ASM model is based on training data. This data comes from a set of face images that have previously been manually labelled with facial features interest points. These images are part of the database XM2VTS. This database only includes frontal faces. Therefore, the exclusive usage of the model is for front faces. This model was trained by Stephen Millborrow and Fred Nicolls [11]. All the models are built using all 77 landmarks. In the Appendix A is described all the landmarks and their position. The appointment process of creation is explained in section 3. In the area 3.2.1, the algorithm and the software used to build the model is described. 27

29 4.5 Init Models and Detector 4.5 Init Models and Detector This module belongs to a class of facial_features_detector class. In this module the detectors and the configuration files are loaded and initialized. In this case we only have one version of ASM. This model is named yaw00. The class model are configured and initialized following the next attributes: Estart Detectors that will be used to align and adapt the shape for the face, this will be described in section 4.6 The shape model is initialized as mentioned before. Parameters bmax and neigs. They have a value from empirical testing Parameters bmax and negis During the landmark search, a shape by profile matching is generated and the suggested shape is conformed to the shape model. For the best result is needed to choose the appropriate neigs and bmax parameters for the shape model must be chosen. neigs is the number of eigenvectors used by the shape model. bmax is a parameter controlling the maximum allowable value of elements of b in Equation 8 For face shape there are 154 eigenvectors for 76 landmarks. The chosen values after empirical testing are neigs=20 and bmax=1.5. These parameters have been chosen by Stephen Millborrow and Fred Nicolls from empirical testing. 28

30 4.6 Align the Shape 4.6 Align the Shape The goal of this module is to align the start shape with the ROI, given the face rectangle. Depending on the estart filed in the model, we detect the eyes and mouth and use those to help fit the start shape Align with the Rectangle With the model estart=estart_rect_only, the start shape is created by aligning the model mean face shape to the rectangle. (The face rectangle is found by the face detector). Figure 16 Align ROI with the rectangle model. It aligns mean shape to the face detector rectangle and returns it as start shape. Therefore, ignoring the eyes and mouth Align with the Eyes With the model estart=estart_eyes the start shape is created as follows. Using the face rectangle that is found by the face detector, Stasm [11] searches the eyes (using the Viola-Jones algorithm for eye detection, OpenCV[14]) in the appropriate subregions within the rectangle. Figure 17 Align with the eyes detector 29

4.6.3 Align with the Eyes and Mouth If both eyes are found, the face is rotated so the eyes are horizontal. The start shape is then formed by aligning the mean training shape to the eyes.

31 4.6.3 Align with the Eyes and Mouth If both eyes are found, the face is rotated so the eyes are horizontal. The start shape is then formed by aligning the mean training shape to the eyes. If either eye is not found, the start shape is aligned to the face detector rectangle Align with the Eyes and Mouth With the model estart=estart_eye_and_mouth the start shape is generated as above, but it search for the mouth(using the Viola-Jones algorithm for eye and mouth detection, OpenCV) and use it if is detected too. Figure 18 Align with the eyes and mouth detector The central idea is to form a triangular shape of the eyes and bottom-of-mouth from the face detector parameters, and to align the same triangle in the mean shape to this triangle. 4.7 Select ASM Model The goal of this module is to provide a shape aligned with the face. To adjust the size of the shape and choose the ASM model. Here the size of the face will be estimated triangulating the eyes position with mouth position. The location of the eyes and mouth have been found in the previous module. Figure 19 select and adjust the size of the start shape using a triangulation of the eyes position with the mouth position 30

32 4.8 ROI Shape to frame 4.8 ROI Shape to frame This is the final module for ASM. In this module the optimal value for parameter b is searched. We iterate the shape model until convergence. (See section ). Then, calculate HAT descriptor of the image resize for previous module. Finally, we fit the shape using MARS algorithm. This technique is described in the previous section 3.5. Figure 20 Fitting the Shape using HAT and MARS 4.9 Mouth Identification This module uses the detected and extracted features in the previous modules, namely those of mouth. The goal is to determine whether the person has the mouth open or closed. This section has three important blokes: The necessary characteristics for detection. The classification algorithm. Evaluation system. This techniques are described in the section 6 31

1 Facial Dataset In the case of facial features detector two manual landmarks datasets are used. For the creation of the ASM model (created by Steven Milborrow) the following database is used.

33 5 Database Chapter 5 5. Database In this chapter we will indicate all the databases that are used. These databases are for training, validating and testing models of facial features detector and another one for detecting whether the mouths are open or closed. 5.1 Facial Dataset In the case of facial features detector two manual landmarks datasets are used. For the creation of the ASM model (created by Steven Milborrow) the following database is used. We should add that information because although we do not work directly with it, we work with the model. It is important to know the characteristics that may influence the behavior of the detector. 1. The University of Surrey XM2VTS[15] database, frontal sets I and II. The XM2VTS frontal image sets contain x256 color images of 295 subjects. The pose and lighting is uniform, with a flat background. The faces are manually landmarked with 77 points. The XM2VTS data must be purchased and cannot in general reproduced. Figure A.1 shows a landmarked XM2VTS face. To test the facial features detector and compare results with other publications, we have used the following manually landmark dataset. 2. The BioID face database with FGnet markup[12] The BioID dataset consists of x286 pixel monochrome images of 23 different subjects. The image are frontal views with a variety of backgrounds and face sizes. The background is an office interior. The face are manually landmarked with 20 points. The data set is freely downloadable and it seems that the faces may be reprinted without legal issues Figure 21 The BioID face database 32

For training the classifier two videobloguers from youtube have been used. 1. Maria Konnikova: How to think like Sherlock Holmes from big think channel.

34 5.2 Open and closed mouth dataset 5.2 Open and closed mouth dataset In this case we have used a few datasets: two dataset for training classification, another two for testing the classification and a different one for validation. For training the classifier two videobloguers from youtube have been used. 1. Maria Konnikova: How to think like Sherlock Holmes from big think channel. This video consists of x720 color image of 1 subject. The pose and lightings are uniform with a white background. The subject is a woman and she speaks in front of the camera. Many frame are frontal face with open and closed mouth. Figure 22 Frames to Maria Konnikova: How to think like Sherlock Holmes from big think channel 2. The importance of Knowing: An introduction to Epinets from big think channel. This video consists of x720 color image of 1 subjects. The pose and lightings are uniform with a white background. The subject is a man and he speaks in front of the camera and sometimes he speaks in his better profile. Many frames are frontal face with open and closed mouth. He does not open his mouth too much. Figure 23 Frames to the importance of knowing from big think channel For validating the train set and regulating the sensitivity of classifier, one dataset has been used. 3. Database of Artificial Intelligence Laboratory of FEI. FEI contain x480 color images of 200 subjects between 19 and 40 years old. The pose and lightings are uniform and white homogenous background. The subjects are men and women and their have a frontal position and open and closed mouth. The number of male and female subjects are exactly the same and equal to

5.2 Open and closed mouth dataset Finally two videobloguers

consist of 48800 640x480 color image of 1 The pose and

a man and he speaks in front of the Figure 25 Sam Harris

35 5.2 Open and closed mouth dataset Finally two videobloguers from youtube and another database have been used for testing set. 4. How to Revise like Sherlock by Mind Place from Maddie Moate channel. This video consists of x720 color image of 1 subject. The pose and lighting are not uniform and it has a shelf background. The subject is a woman and she speaks in front of the camera. Figure 24 Frames from to how to revise like a Sherlock by Mind Place from Maddie Moat channel 5. Sam Harris BigThink (ALL) from Big Think channel. This video consist of x480 color image of 1 subject. The pose and lightings are uniform with white homogenous background. The subject is a man and he speaks in front of the camera. Figure 25 Sam Harris Big Think All from Big Think channel 6. Telenotícies TV3 (TN). TN contain x576 color image of 3 subjects. These subjects are two women and a man. These subjects are TV presenters. The pose of subjects are frontal and there are uniform lightings. The background is not homogenous. Figure 26 telenotíces TV3 (TN) 34

36 5.2 Open and closed mouth dataset The strategy is to use many frame of videos. We use 162 frames with open mouth and 162 frames with closed mouth from Channel big think (data base 1 and 2) for training. In validation case we have used 200 image from FEI dataset, we used 100 frames with closed and 100 frames with opened mouth. Finally for testing we used for opened mouth: 23 frames from NT database, 82 frames from Big think channel (database 1 and 2). For closed mouth: 37 frames from NT database, 93 frames from Big think channel and Maddie Moate channel (database 4 and 5). 35

37 6 Open or Closed Mouth State Detection Chapter 6 6. Open or Closed Mouth State Detection The information obtained from Active Shape Model can be used to recognition of expressions. In this chapter, we will focus on how to use this information to create an open or closed mouth state detector. 6.1 Active Shape Model in Mouth In New Active Shape Model we have a lot of landmarks in mouth, in particular 17. These Landmarks are situated all over the mouth (upper lip, lower lip, inner lip ). With these landmarks, we can know the position of the lips and extract the distance between them. For this we will focus on three groups: Outer Lip (Landmarks 62 and 74) Inner Lip (Landmarks 67 and 70) Corner Mouth (Landmarks 59 and 65) Figure 27 Landmarks located in the mouth 6.2 Vector of features Once we have the positions of the landmarks, we need to calculate the distance between them. This distance will be used to determine if the subject has the mouth open or closed. The initial strategy was to use the outer upper lip and outer lower lip. However, the problem with this approach is that there are many different lips and if the subject has big lips, we could have many false positive. Given a two-dimensional vector v i with the position of the inner upper lip and another one v i with the position of the inner lower lip. 36

38 6.2 Vector of features v 70 = (x 70, y 70 ) v 67 = (x 67, y 67 ) d lip = (x 70 x 67 + y 70 y 67 ) This is the distance of the height of the mouth. However, it is not enough to determine whether the subject has an opened or closed mouth. This distance depends on the size of the mouth and face size. To solve this, we have used the width of mouth. It is not a static distance because the width of mouth is variable in the same subject and it has some proportionality with the height of the mouth. Corners landmarks mouth have been used to calculate this mouth distance. v 65 = (x 65, y 65 ) v 59 = (x 59, y 59 ) d width mouth = (x 65 x 59 + y 65 y 59 ) D = d lip d width mouth This parameter D will be used to classify whether the subject has the mouth open or closed. Figure 28 the distance between the two inner landmarks lips and the width mouth distance 37

Before deciding the parameter value, the frames have to be manually labeled as open or closed mouth.

39 6.3 Classifier 6.3 Classifier This classifier is a binary classifier, it has a threshold to decide the state of the mouth. In order to decide the value of the threshold parameter the training data set has been used Training In this section the train data set has been used to decide the value of threshold. Before deciding the parameter value, the frames have to be manually labeled as open or closed mouth. After this, the image frame goes through the face detector and we extract their features using New ASM. And finally the parameter D for all the image frames is calculated. Once all parameters have been calculated, the histogram for all parameter D from closed mouth are. The histograms are modeled using a Gaussian distribution. Figure 29 Histograms with the Gaussian distribution. The left histogram is about closed mouth and the right histogram is about the open mouth. The same process is repeated for the open mouth set. Once the two histograms have been calculated, they are placed with their Gaussian distribution in the same plane. The intersection point between the two Gaussian is called threshold. Theoretically it is the point that accumulate less mistakes. 38

6.3.2 Classification Figure 30 the two histograms with their Gaussian distribution in the same plane for looking for the intersection between the Gaussian distribution and find the threshold.

40 6.3.2 Classification Figure 30 the two histograms with their Gaussian distribution in the same plane for looking for the intersection between the Gaussian distribution and find the threshold. The chosen value after the training set and validation set is T= The process about validation will be described in chapter Classification By the T parameter calculated previously. T needs to be compared with parameter D. In this case the distance between these parameters will determine the state mouth of the subject. { D T the subject smouth is closed D > T the subject s mouth is opened 39

41 7 Experimental Results Chapter 7 7. Experimental Results In this chapter all the experiments results will be exposed which will then be compared with others. Additionally all the measurement methods used to compare these results will be explained. Search the best way to align the start shape with the ROI and look for a start shape model. Look for the worst and the best principal landmark and compare the landmarks with the methods analyzed in the previous step. Look for all the cases when the New ASM has failed. Finally compare the classical ASM with the New ASM. On the other hand, for the mouth state detection the followings points will be analyzed: Method to measure the classifier. Analysis of histogram and curve ROC Validation process. The final test and results with test dataset. 7.1 NEW ASM The me17 measure: The me17 is a normalized measure to determine the amount of error in a facial landmark fitting. This measure is focused on the 17 landmarks in common to all the shape models. These are the most important landmarks. internal BioID landmarks Appendix A Table 11 Following Cristinacce [13], the me17 is calculated as follows: The 17 points are the Calculate the distance between each of 17 points located by the search and the corresponding manually landmarked point. v manual = (x i, y i ) v search = (x s, y s ) d i = (x i x s + y i y s ) Calculate the distance between the eyes pupils from manually landmarks points. 40

42 7.1.2 Compare the different start shape models. v eyel = (x l, y l ) v eyer = (x r, y r ) d eyes = (x r x l + y r y l ) Divide the results of distance between each of 17 points (step1) by the distance between the eyes pupils (step2) D = d i d eyes If there are more than one image, we take the mean me17 for all the landmarks for all the images I m. 17 I m = 1 17 D i With this measure, we can compare this model with the others ones and extract the conclusions Compare the different start shape models. In the chapter 4.6, we described different forms to align the start shape with the detected face and look for a start shape model. The authors propose three methods: ASM0align the rectangle and find the initial shape looking for the left eye and find the right eye searching the symmetry. ASM1align the rectangle searching the eyes in the appropriate subregions within the rectangle and find the initial shape looking for the left and right eyes. ASM2align the rectangle searching the eyes and mouth in the appropriate subregions within the rectangle and find the initial shape triangulating the eyes and bottom-of-mouth from the face detector parameter. i=0 41

43 7.1.3 Graphs for comparing models Graphs for comparing models Image error with me17 In this step it calculates the me17 measure for all the images from BioID database. The ASM0 vs ASM1 vs ASM2 will be compared. BioID has 1521 images. However, we only work with 1233 images because the face detector did not detect the other faces before. In the next graph you see in the X axis identifier of the images and in the Y axis the me17 error for these images. For ASM0 we have the following results: Graph 1 all the image from BioId data set with the me17 error using ASM and aligning start shape with input rectangle In this case, in general we have good results. However, we can differentiate three groups in this graph: 1. Between 0 and 0.2 we have very good results. These have a little error. 2. Between 0.2 and 0.4 we have a good result but a bit of error. The majority of the images are in this case. 3. Between 0.4 and 1 we can consider a poor outcome. They have plenty of errors. In chapter the images of group 1 and 2 will be displayed. 42

44 Image error with me17 By ASM1 we have the following results: Graph 2 all the image from BioId data set with the me17 error using ASM and aligning start shape with eyes detector In this case we can find better results. However, we can see several cases where the ASM1 has failed and have an error. Also there are a few groups with a bit of error. In the next step failed cases will be studied. This ASM1 is better than ASM0. By ASM2 we have the following results: Graph 3 all the image from BioId data set with the me17 error using ASM and aligning start shape with eyes and mouth detector In this case we have a good result in general. Nevertheless, we can see several cases where the ASM2 has failed and has an error. We can also see, that the ASM2 is worse than the ASM1 in several points. 43

45 Image error with me17 Graph 4 all the image from BioId data set with the me17 error. Colour blue is ASM0, the colour green is ASM2 and the red colour is ASM1. If we compare the three ASM we can see that although we have a good result, the ASM1 has a better fitting and it has less error. Whereas, the ASM0 is worse than the others. Mean_me17 ASM ASM 1 0,0733 ASM Table 1 mean_me17 ASM0 vs ASM1 vs ASM2 44

46 Landmark error with me Landmark error with me17 In this step the me17 measure for all the landmarks from BioID database is calculated. The ASM0 vs ASM1 vs ASM2 will be compared. The work was done with 1233 images and calculated the maximum, minimum and mean me17 for each landmark. Additionally, groups of landmarks have been separated. For example: landmarks from right eye, left eye mouth The following results are ASM0: Graph 5 Distance error landmarks aligning start shape with input rectangle. The red points are maximum error. The green points are the minimum error of each landmark. The blue points are the mean error In this graph, it is easy to see that there is bit of error in the eyes landmarks. In contrast, the mouth landmarks have more error. The landmarks that have had maximum error are the landmarks that are located in the edges of the face, which are, the ones that are at eye level. 45

47 Landmark error with me17 The following results are of ASM1: Graph 6 Distance error landmarks aligning start shape eyes detector. The red points are maximum error. The green points are the minimum error of each landmark. The blue points are the mean error If we compare this graph with the previous one, we can see that this graph is better. The mean error is minimum. In this graph we can also see that the better landmarks are the landmarks located in the eyes, but we have better results in this mouth landmarks. Despite this, the maximum error is now in one landmark in the mouth. The worst mean is still in the landmark that are located in the edge of the face. the following results are of ASM2: Graph 7 Distance error landmarks aligning start shape with eyes and mouth detector. The red points are maximum error. The green points are the minimum error of each landmark. The blue points are the mean error 46

7.1.3.3 Classic ASM vs New ASM The results of this graph are worse than the results of ASM1, but ASM2 is better than ASM0. We have the same results as the ASM1, but with a higher error.

48 Classic ASM vs New ASM The results of this graph are worse than the results of ASM1, but ASM2 is better than ASM0. We have the same results as the ASM1, but with a higher error. Mean_landmark_me17 Max_landmarks_me17 ASM ASM ASM Table 2 mean_landmark_me17 ASM0 vs ASM1 vs ASM Classic ASM vs New ASM Graph 8 all the image from BioId data set with the me17 error using classic ASM from UPC It is described in the chapter 3.2 and we have called it classical ASM. This graph is the worst. We can observe many errors in this graph, even though there are some images without error. Mean_me17 ASM classic 0,3790 ASM 1 0,0733 Table 3 mean_me17 ASM1 vs classic ASM 47

49 Empirical cumulative distribution function Graph 9 Distance error landmarks using classic ASM. The red points are maximum error. The green points are the minimum error of each landmark. The blue points are the mean error This graph follows the same pattern as the others ones. Paradoxically, the landmark located in the chin is better in this model than in the other models. Mean_landmark_me17 Max_landmarks_me17 ASM classic ASM Table 4 mean landmark ASM1 vs ASM classic Empirical cumulative distribution function Graph 10 the empirical cumulative distribution function. The red line represented by ASM1, the blue line by ASM0, the grey line is represented by ASM2 and finally the green line is represented by classic ASM. 48

7.1.4 Visual evaluation with empirical test The Empirical cumulative

The faster the curve goes up and the quicker it reaches the top, the

Given a vector of me17 measurements vme17, the pseudo code to plot an

landmark errors found in the results of the graphs.

the number 259,282,392, 901,423,640,740,803 of the Graph 2.

Specifics errors that only appear in one image or errors that we can

Watching the images, we can say that the worst is 259, because the ASM

50 7.1.4 Visual evaluation with empirical test The Empirical cumulative distribution function is a standard way to compare many models. The faster the curve goes up and the quicker it reaches the top, the better the model is. Then, we can say that the best model is ASM1. Given a vector of me17 measurements vme17, the pseudo code to plot an ECDF on Matlab is x = sort(vme17); 1: lengh(vme17) y = lengh(vme17) ; plot(x, y); Visual evaluation with empirical test Figure 31 Images with many landmark errors found in the results of the graphs. Order from left to right and from top bottom these image correspond to the number 259,282,392, 901,423,640,740,803 of the Graph 2. In these graphs we have seen some errors. Specifics errors that only appear in one image or errors that we can find in different images. Watching the images, we can say that the worst is 259, because the ASM does not find the face features. The typical mistake that appears depending of the subject is that the ASM confuses the nose with the mouth. Finally, we can say that in some images the ASM does not fit the landmarks that are located in the edge of the face. 49

51 7.2 Mouth State Detector 7.2 Mouth State Detector Confusion Matrix Matrix confusion is a specific layout that allows visualization of the performance of algorithms and to evaluate the classifier. Each column of the matrix represents the instance in an actual class and each row represents the instances in a predicted class. Positive Condition Negative Condition Test outcome positive True Positive (TP) False Positive (FP) Test outcome negative False Negative (FN) True Negative (TN) Table 5 Confusion Matrix. Condition positive: All of which the ground truth identifies as having the positive condition (PC). Condition negative: All of which the ground truth identifies as having the negative condition (NC). True positive: when the prediction is the same as the actual class. In a binary case, it is positive when the case that we have searched is considered. False positive: when a test result has the positive condition, but finally the ground truth indicates that it is not positive. True negative: when the prediction is the same to an actual class and it has the negative condition. False negative: when a test result has the negative condition, but finally the ground truth indicates that it is not negative. Once we have the matrix confusion we can calculate the statically measures. This statically measures display the performance and with that we could compare the classifier with the other classifiers. Recall Recall is the sensitivity, it measures the proportion of actual positives which are correctly identified as such. 50

52 7.2.2 The ROC curve R ecall = TP PC R ecalltotal = R ecall i i=0 Precision Precision measure the proportion of positive correctly identified with all the identified as positive. F_score TP P recision = Test Positive P recisiontotal = P recision i F_score is a measure of test s accuracy that combines precision and recall. The F_score can be interpreted as a weighted average of the precision and recall. Accuracy i=0 F score = P recision R ecall P recision R ecall 2 1 F scoretotal= 2 F score i The Accuracy is the proportion of the results both true positive and true negatives in the population. Accuracy = 2 i=0 TP + TN TotalPopulation The ROC curve The ROC curve positive likelihood ratio is the relation between the true positives with the false positives. In contrast, the ROC curve negative likelihood ratio is the relation between the true negatives with the false negatives. In our case, this curves display the possibility that the mouth be opened or closed. The faster the curve goes up and the quicker it reaches the top, the better is the classifier. The point (0, 1) would be the perfect classification and the point (1, 0) would be worst classification. One classifier is good considerate if it is located above of the random guess line. If 51

53 7.2.3 The ROC curve for dataset validation the points are above of the line the result is better than expected and if the points are below of the line the result is worse than expected The ROC curve for dataset validation Graph 11 The ROC curve the left graph is the ROC curve positive likelihood ratio for the open mouth and the right is the ROC curve Negative likelihood ratio for the closed mouth. The red line is the random guess line. To validate the classifier the ROC curve was analyzed. Therefore enabling to see the quality of the classifier and knowing if the threshold is good or not. True positive are for opened mouth and true negative are for closed mouth. As can be seen in the ROC curve positive likelihood ratio the curve is above the random guess line. Then, the classifier will find the 60% of opened mouth without error. On the other hand, the ROC curve negative likelihood ratio is also above the random guess line, but it is not as good as the ROC curve positive likelihood ratio. That is, there is more probability that the recall of the open mouth is higher than the recall of the closed mouth. Graph 12 The ROC curves with the proofs validation (green points) and the finally test (red points) Plenty of proofs have been done with the validation of the database, and the results are better than expected. Consequently, a good threshold has been chosen and 52

54 7.2.4 Results finally it can be seen that the result is good because it is above of the line of classifier (red point) Results Open mouth Closed mouth Open mouth Closed mouth Table 6 Confusion Matrix of the test database In this table the confusion matrix can be seen. The classifier has guessed 91 true positive and 25 false positives (positive in our case is open mouth). On the other hand, the classifier has guessed 59 true negative and 13 false negatives (negative in our case is close mouth). Recall Precision F_score Open mouth Closed mouth Table 7 statically measures In that one the measures of classifiers can be seen. As mentioned before in the ROC curve, the recall of open mouth is higher than the recall of the close mouth. The F_score that combines precision and recall is better in the open mouth. Total Accuracy Precision Recall F_score Table 8 average of statically measures. Finally, we can see that the mean of the total measure is around 80%. 53

55 8 Conclusions and Possible Future Research Chapter 8 8. Conclusions and Possible Future Research 8.1 Conclusions The objective of this project is to study the face features. This information enables to interact with systems using cameras and automated process. To achieve that ASM has been used the classic ASM has been studied and we have tried to improve it by using a variation of ASM. This ASM was created by Stephen Millborrow and Fred Nicolls. And it is different from the classical version because the new ASM uses the descriptor HAT instead of gradient descriptor in 2D. Another difference is that the classic ASM uses the distance Mahalanobis instead of MARS. In the new ASM the three methods to align the rectangle and find the initial shape have been analyzed. Displaying that the best one is ASM1, which aligns the rectangle that searches the eyes with appropriate subregions within the rectangle and finds the initial shape looking for the left and right eyes. In contrast, the worst one is ASM0. This has been an unexpected result, under the initial assumption that the ASM2 would better, because it aligns the rectangle searching the eyes and mouth in the appropriate subregions within the rectangle and it finds the initial shape triangulating the eyes and bottom-of-mouth from the face detector parameter. Therefore, it is very deficient for it does not find the mouth. Taking in to account that sometimes the mouth is not well detected when the subject has a moustache. On the other hand, the Active Shape Model has been used to recognize the facial expressions. In order to do that, we have focused on how to use this information to create an open or closed mouth state detection. This can have a future application to help for the audio recording in video. It is well known that the people who record the audio in the video have problems knowing when the actor is beginning to speak when there is noise in the video. This classifier is a basic binary classifier based on a threshold. We have had good results as we could see in the positive and negative ROC curve. This has been surprising taking into account that there was no design of a more complex algorithm such as SVM or KNN. 54

56 8.2 Possible Future Research In conclusion, we have demonstrated that ASM is a good method for extracting the features in faces and the information obtained from Active Shape Model is useful to recognize the expressions. 8.2 Possible Future Research The ASM is a technique that gives good results, although it has some features that have to be improved. In this project, only frontal faces were accepted because the side face model does not exist. A future project would be the creation of this side face model. It would only be necessary to do one side because the face is symmetric. On the other hand, a classifier of open and close mouth state detector has to be created using other algorithms. For example, SVM and KNN. 55

0 Face landmarks Appendix A Face landmarks Figure 32 BioId landmarks LEyeBottom 0 REyeBottom 1 LMouthCorner 2 RMouthCorner 3 LOuterEyeBrow 4 LInnerEyeBrow 5 LEyeInner 10 RInnerEyeBrow 6

57 0 Face landmarks Appendix A Face landmarks Figure 32 BioId landmarks LEyeBottom 0 REyeBottom 1 LMouthCorner 2 RMouthCorner 3 LOuterEyeBrow 4 LInnerEyeBrow 5 LEyeInner 10 RInnerEyeBrow 6 REyeInner 11 ROuterEyeBrow REyeOuter 7 12 LTemple 8 RJaw1 13 LEyeOuter 9 NoseTip 14 Table 9 Landmarks descriptors from BioID LNoseBot 15 RNoseBot 16 MouthTopOfTopLip 17 MouthBotOfBotLip 18 TipOfChin 19 56

58 0 Face landmarks Figure 33 classic ASM with 68 landmarks 57

59 0 Face landmarks Figure 34 New ASM with 77 landmarks 58

Scale Invariant Feature Transform

Scale Invariant Feature Transform Why do we care about matching features? Camera calibration Stereo Tracking/SFM Image moiaicing Object/activity Recognition Objection representation and recognition Image