Integral Channel Features with Random Forest for 3D Facial Landmark Detection

Size: px

Start display at page:

Download "Integral Channel Features with Random Forest for 3D Facial Landmark Detection"

Benedict Porter
5 years ago
Views:

1 MSc Artificial Intelligence Track: Computer Vision Master Thesis Integral Channel Features with Random Forest for 3D Facial Landmark Detection by Arif Qodari February EC Supervisor/Examiner: Prof. dr. Theo Gevers Sezer Karaoglu Accessor: dr. Leo Dorst dr. Jacco Vink Informatics Institute University of Amsterdam

2 Abstract Detecting facial landmarks are important to understand human faces. While 2D imagebased approaches have been well studied in literature, a 3D-based approach remains a challenging problem due to several reasons, e.g. performance issue in noisy data, run time issue because of model complexity, and the robustness issue of pose variations. In this thesis, we investigate the performance of random forest based models combined with integral channel features to detect 3D facial landmark. We study the influence of using heteregeneous information computed from multiple channels features to obtain an accuracy 3D landmark detector. A variant of a random forest algorithm that utilizes multiple channel features is proposed to localize 3D facial landmarks. Multiple channel features provide rich and divers information. These features are efficiently computed using integral image from noisy RGB-Depth images. Finally, we present our experimental results evaluated on the Biwi Kinect dataset containing a large range of head pose angles. The results show that adding more channel features, more specifically gray and gradient channels, has a positive influence on the accuracy of our detector rather than using a single depth channel. Moreover, using the additional gray and gradient channels also increases the robustness of the detector against head pose variations. We also demonstrate that our approach produces higher mean accuracy compared to a 2D-based state-of-the-art method.

3 Acknowledgements I would like to thank to Theo Gevers for giving me the opportunity to work on an interesting topic under his supervision. Many thanks to Sezer Karaoglu for his valuable advice and feedback to improve the quality of this thesis. He helped me intensively with the project and writing process. During the project, we have had many meetings to discuss both the theoritical and technical aspects. Despite his busy schedule, I could always contact and discuss the problems I faced. I also want to thank my wife and my parents for supporting me along the way. iii

5 Contents Abstract ii Acknowledgements iii Contents List of Figures List of Tables iv vii ix 1 Introduction Motivation Goal Thesis Outline Random Forest for 3D Facial Landmark Detection Related Work D Facial Landmark Detection Random Forest Training Forest Binary Test Objective Function Testing Vote Clustering Integral Channel Features Related Work Integral Image Channel Features Evaluation on Computation Time Experiments Experiment Setup Dataset Labelling Parameter Settings Evaluation Measure Results v

6 Contents vi Number of Trees Accuracy Accuracy vs Efficiency Channel Features Comparison Frontal face dataset Full dataset Analysis Performance under Head Pose Variations Comparison with 2D-based State-of-the-Art Conclusion 33 Bibliography 35

7 List of Figures 2.1 The pipeline of training process Vote clustering Integral image Multiple registered image channels Landmark annotation result The influence of number of trees Accuracy Accuracy vs efficiency in frontal face test set Accuracy vs efficiency in full test set Channel features performance comparison in frontal face dataset Examples of failure cases in frontal face test set Channel features performance comparison in full test set Examples of failure cases in full test set Performance under head pose variations Examples of failure cases in large head poses vii

9 List of Tables 3.1 Average time needed to generate image channel Summary of the performance of channel features Evaluation result on 4-fold cross validation Performance comparison with 2D-based method ix

11 Chapter 1 Introduction 1.1 Motivation Facial landmark detection is the problem of detecting point of interests (e.g. eyes, mouth corners, and nose tip) on human faces. Facial landmarks are important for many facial analysis tasks such as face recognition [1], facial expression recognition [2], and facial animation [3]. Therefore, detecting facial landmarks is an essential aspect to understand faces. There has been many approaches developed to robustly detect facial landmarks. These methods can be categorized into 2D image-based and 3D-based approaches. 2D imagebased approaches operate in 2D and hence return the detected landmarks in 2D coordinates, while 3D-based approaches detect 3D facial landmarks from depth images. Facial landmark detection from 2D images has been well studied in the literature. Recently, a number of real-time performing approaches have achieved high detection accuracy on face images collected in the wild [4][5][6][7][8]. However, the performance of 2D image-based methods usually deteriorates under varying illumination conditions (e.g. highlights, shadows and dim light). In addition, 2D image-based methods often require an initial face region obtained by a face detection algorithm. As a consequence, the performance of these methods are limited by the accuracy of the face detector. Prior work on 3D-based approach shows that it is more robust than a 2D image-based approach against lighting conditions and head pose variations [9][10]. For instance, the method proposed by Baltrušaitis et al. [9] integrates depth and intensity images to alleviate the problem caused by poor lighting conditions. Since depth information is independent of light, the appearance of objects is not affected by lighting condition. The authors reported accurate results for detection and tracking of 3D facial landmarks. 1

12 Chapter 1. Introduction 2 Another relevant example is the method by Papazov et al. [10]. The method exploits depth features to detect 3D facial landmarks under varying head poses. The method obtains high detection accuracy in real-time. Detecting 3D facial landmarks remains a challenging problem because of several reasons: performance issue due to noisy input data, run time issue due to computational complexity, and robustness issue for pose variations and local deformations. Another challenge of this topic is the lack of annotated 3D facial landmark datasets. Some of the prior work used synthetic a head model [9] or high-quality face scans [11] [12] leading a performance gap when tested in noisy depth data acquired by depth sensors. In our experiments, we used the Biwi Kinect dataset [13] which has more than 15K RGB-Depth streams for various head poses. This dataset does not provide any landmark annotations. However, head rotation angles and head center locations are provided. In section 4.1.1, we present an algorithm to annotate rigid landmark points through all frames using the provided head rotation angles and center locations. Random forest has been widely used in many computer vision tasks including 2D [14][7] and 3D facial analysis [12][14][13]. A random forest consists of multiple decision trees. A single tree maps complex feature spaces into simpler decision spaces. Random forest has the capability to handle large data inputs efficiently. Moreover, the concept of randomness also brings an advantage to avoid overfitting. A number of random forest based models have been proposed for 2D facial landmark detection. Dantone et al. [14] proposed a conditional regression forest to detect 2D facial landmarks in 2D images. The proposed regression forest is conditional to global face properties, e.g. head pose. This method employs multiple channel features: raw and normalized gray values and Gabor filter banks [15] computed for varying parameters. The authors demonstrated the benefit of using a random forest based model to map complex high-dimensional features into a multi-class decision model. A similar approach was proposed by Kazemi et al [7], namely Ensemble Regression Trees (ERT). Unlike the method proposed by Dantone et. al., ensemble trees work in cascaded architectures and are trained by a gradient boosting algorithm. The method utilizes differences of intensity values at a pair of pixels as features. The combination of random forests using these features accurately predicts 2D facial landmarks efficiently (about 1ms to process one image).

13 Chapter 1. Introduction 3 In the context of 3D facial landmark detection, Fanelli et al. [12] proposed a random forest to localize 3D facial landmarks under various facial expressions from high-quality face scans. The method utilizes generalized Haar-like features [16] computed from the depth channel. The authors reported real-time and accurate detection results. Today, advances in device technology allow us to record depth information as well as RGB information at low cost, e.g., MS Kinect and Asus Xtion. Inspired by the work of Fanelli et al. [12], we employed a similar approach to detect 3D facial landmarks from RGB-D images. The difference is that we exploited multiple sources of information, i.e. RGB and depth, and analyzed the influence of data diversity on the 3D facial landmark detection performance. A number of methods have shown that integral channel features [17] are effective for many computer vision tasks, including object recognition [18], pedestian detection [19], and local region matching [20]. Integral channel features capture rich information from the different and diverse channels in images. In addition, the features can be efficiently computed using integral images. For these reasons, we combined integral channel features with a random forest model to detect 3D facial landmarks. Our main contribution is the combination of integral channel features with random forest to detect 3D facial landmark using RGB-D images. We study the influence of various channel features on the 3D facial landmark detection performance. We also investigate the robustness of our approach under varying head poses. 1.2 Goal This thesis focuses on investigating the combination of different channel features for 3D facial landmark detection. The following research questions are: 1. How to integrate multiple channel features into a random forest based model for 3D facial landmark detection? 2. What are the best performing channels for detecting 3D facial landmarks? 3. How does the landmark detector perform under various head poses?

14 Chapter 1. Introduction Thesis Outline In Chapter 2, we first summarize prior work related to 3D facial landmark detection and a variant of random forest based approaches for facial analysis tasks. We then explain the details of the random forest algorithm specific for 3D facial landmark detection. Chapter 3 describes the integral channel features and approaches for integrating the features into a random forest. It discusses three different channel types: depth, gray and the gradient histogram. This chapter also provides a discussion on the computational time. Implementation details (e.g. dataset annotation, parameter settings, and evaluation metric) and experiments are discussed in Chapter 4. Finally, we present our conclusions and possibilities for future research.

15 Chapter 2 Random Forest for 3D Facial Landmark Detection Typical random forest algorithms work in a supervised way, i.e. the algorithm constructs trees from a set of training data annotated with the desired output labels. We call this as the training process. A tree is constructed to maximize the information gain by mapping complex input spaces into simpler discrete (classification) or continuous (regression) output spaces. The mapping process is done in every non-leaf node, while the leaf node stores the information to be used for prediction. Once the forest is constructed, a testing process is conducted to evaluate the generalization ability of the trained forest from given unseen data. A set of testing data are propagated down the trees where each tree gives a prediction vote. The forest determines final prediction by either averaging the votes or choosing the majority votes. This chapter discusses a specific variant of the random forest algorithm for 3D facial landmark detection. Section 2.1 presents a literature review related to 3D facial landmark detection and random forest based solutions for facial analysis. Training and testing are discussed in section 2.2 and 2.3, respectively. 2.1 Related Work D Facial Landmark Detection A number of methods have been proposed in the literature for detecting 3D facial landmark from noisy and high-quality input. 5

16 Chapter 2. Random Forest for 3D Facial Landmark Detection 6 Baltrušaitis et al. [9] proposed a 3D Constrained Local Model (CLM-Z), which is an extension of the Constrained Local Model [21], for facial landmark tracking under varying pose. Depth and intensity channels were integrated to reduce missed detections caused by poor lighting condition. This model has shown a robust performance for varying lighting conditions and poses. Ju et al. [22] combined a 3D shape descriptor with binary neural networks to detect nose tip and eyes. The descriptor is invariant against illumination variations. The reported accuracy was over 99, 6% in the presence of facial expressions. Zhao et al. [23] introduced the Statistical Facial Model (SFAM) which combines local variations of texture and geometry around each landmark with global variations between landmarks. A robust fitting algorithm was proposed to localize landmarks under facial expressions and occlusions. Although high accuracy results were reported, the proposed algorithm is computationally expensive. Papazov et al. [10] proposed Triangular Surface Patch (TSP) features extracted from 3D point clouds to jointly estimate the head pose and 3D facial landmarks. The authors demonstrated that these features are efficient to compute, are viewpoint-independent and they are insensitive to pose changes. The proposed approach achieved high accuracy and real time Random Forest Random forest, as introduced by Breiman [24], is an ensemble learning method that consists of multiple decision trees [25]. Each tree in the forest is constructed from a randomly sampled subset of training data. Starting from its root node, every non-leaf node generates a number of candidate splits and finds the optimal split of the incoming data input. The optimal split φ is defined as the one which maximizes the information gain: φ = arg max IG(φ), (2.1) φ IG(φ) = H(P) w i H(P i (φ)), (2.2) i {L,R} where w i is the ratio of data input propagated to each child node and H(P) is uncertainty measure of the input set P. After the split, the results are sent to the left and right child nodes. The procedure is then repeated until all leaves are created.

17 Chapter 2. Random Forest for 3D Facial Landmark Detection 7 In the context of 3D face analysis, random forest based approaches have been applied to estimate head pose from high-quality head scans [11]. The authors achieved realtime performance without requiring a Graphical Processing Unit (GPU). The authors extended their work [26] to use noisy depth data obtained from a consumer depth camera and still managed to obtain low regression error. However, the result was not as accurate as the previous system due to more noisy data input. In their subsequent paper [12], the authors extended their work for facial landmark detection. The method was evaluated on high quality face scans containing facial expressions and head pose rotations. High accuracy results were reported. Another relevant work by Fanelli et al. [13], proposed a random regression forest to steer fitting of an Active Appearance Model (AAM) [27]. The authors achieved robust performance by integrating depth and intensity channels. 2.2 Training Forest A forest is basically a collection of decision trees. To construct a decision tree T in the forest T = {T t }, a set of randomly sampled training images is provided. Every single image has multiple registered channels, which will be discussed later in chapter 3. Next, a set of fixed-size image patches are extracted from each training image and each channel. The patches are extracted around the facial landmark points (positive samples) and outside the face region (negative samples). More specifically, a patch is considered as a positive sample for a landmark point k if the distance d k between the center of the patch and the landmark point is below a certain radius. We follow the parameter setting from [12], in which the radius is defined as one fifth of the radius of an average human face. In other words, d k 0.2r, r is the radius of an average human face. Figure 2.1 illustrates the pipeline of the training process. Each patch P i consists of multiple channel features I i = (Ii 1,..., IC i ) and annotated with a class label c i 0, 1,..., K and an offset vector θ i = (θ 1 i,..., θ K i ). K is the number of landmark points and c i = 0 means that the patch is sampled from background, e.g. hair, body. The offset vector θ k = (θx, k θy, k θz k ) represents the relative position of landmark point k from the patch center. Each tree is constructed using a different set of training patches to make sure that the trees are less correlated. Reducing correlation between any two trees in the forest reduces the error rate [24]. This is because a single decision tree can be seen as a predictor with a high variance. Adding more trees and averaging the results will move the final prediction close to the actual value.

Chapter 2. Random Forest for 3D Facial Landmark Detection 8 Figure 2.1: The pipeline of training process: (1) RGB and depth images are aligned using calibration matrix.

18 Chapter 2. Random Forest for 3D Facial Landmark Detection 8 Figure 2.1: The pipeline of training process: (1) RGB and depth images are aligned using calibration matrix. (2) Multiple channels are generated from each image in the training set. (3) A set of positive and negative training patches are extracted from the registered image channels. (4) Training patches are used to construct trees. A tree is grown from its root node until all leaf nodes are created i.e. when either the maximum depth tree is reached or less than a certain number of patches are left. The algorithm for growing a tree in the forest is summarized as follows: 1. Sample with replacement N training images from the original training set. 2. Randomly extract a number of positive and negative samples from training images. 3. Starting from root node: (a) Generate different sets of parameters to perform binary tests {φ = {f, R 1, R 2, τ}}. The detail of binary test will be described in section (b) Perform binary tests for all generated parameters. (c) Select the optimum parameters which maximizes the objective function. The optimum parameters are then stored in the current node. The detail of objective function will be described in section (d) Divide the incoming patches P into two subset P L and P R and send them to the appropriate child nodes. 4. Repeat step 3 until all leaves are created. Once a leaf node L is created, it stores two kinds of information: (a) Probability of each class in that leaf p(c = k L), computed as the ratio of positive samples of class k arrive at that leaf.

19 Chapter 2. Random Forest for 3D Facial Landmark Detection 9 (b) Distribution over offset vectors for each facial landmark. The distribution is simply modelled by multivariate Gaussian, similar to [14]: p(θ k L) = N (θ k ; θ k ; Σ k ), (2.3) where θ k and Σ k are the mean and covariance matrix of the offset vectors of facial landmark k Binary Test As described in the previous section, a binary test is performed to split incoming patches into two subsets. In order to find the optimum split, typically a large number of candidate splits are generated. This means generating a large number of candidate parameters and then evaluating them using a binary test. The binary test is defined as follows [12]: 1 R 1 I f (q) 1 R 2 q R 1 q R 2 I f (q) > τ, (2.4) where I is the image channel, f is channel s index, R 1 and R 2 are two rectangular subpatches within the patch, and τ is a threshold. The parameters f, R 1, R 2, and τ are generated randomly and the result of this test determines how to split the incoming image patches. A patch is sent to the right child node if the test returns true, otherwise is sent to the left child node. It can be derived from equation 2.4 that the test measure is the difference between the average values of two rectangular sub-patches. Using the average pixel values reduces the effect of missing information in noisy data. Section 3.2 discusses how to compute the sum of pixel values over any rectangular region R using integral images Objective Function As mentioned in section 2.1.2, a forest is trained to maximize the information gain in every node in the tree, which results in minimum uncertainty measure. In this particular case where we want to localize facial landmarks in image patches. Therefore, the term H(P) in equation 2.2 can be replaced by a classification uncertainty measure and it is defined by: K H(P) = p(c = k P) log(p(c = k P)), (2.5) k=0

20 Chapter 2. Random Forest for 3D Facial Landmark Detection 10 where K is the number of classes (number of landmarks + 1) and p(c = k P) is the probability of class k in the patch set P. The probability p(c = k P) is aproximated by computing the ratio of positive patches for landmark k in the set P. A complete objective function is obtained by substituting equation 2.5 into equation 2.2. The optimum split is the one which maximizes this objective function. 2.3 Testing Once a complete forest has been trained, we would like to test the performance of the trained forest to detect 3D landmarks for unseen images. A set of dense patches are extracted from a test image with a predefined stride parameter. Stride parameters control the distance between patches. These patches are then sent to the trained trees. In each tree, the binary test with optimum parameters is performed to lead a patch from the root node until it reaches a leaf node. The information in the leaf node is used to compute a prediction vote. So, for each patch P, we will obtain a set of prediction votes from the trees. However, not all votes are considered. A leaf node L is allowed to vote for the location of landmark point k, if the following conditions are met: 1. The probability of class k stored in the leaf node is higher than a threshold, i.e. p(c = k L) tr prob 2. The trace of the corresponding covariance matrix (Equation 2.3) is below a maximum variance, i.e. Tr(Σ k ) < tr var. The optimal values for tr prob and tr var are 0.75 and 300, respectively. Those values are found by trial-and-error experiments. This criteria ensures that only votes with high confidence is considered for prediction. After sending all patches to the trees, K different sets of votes {vi k } are obtained. Each set represents the set of location candidates for the corresponding landmark k. Location candidates are calculated by adding the patch center coordinates with the mean offset vector θ k stored in the leaf node. Finally, a mean shift clustering [28] is performed for each vote set k to get the final prediction. The next section describes the vote clustering algorithm.

21 Chapter 2. Random Forest for 3D Facial Landmark Detection Vote Clustering Since our approach does not involve any face or head detection, a bottom-up clustering with a predefined radius (the radius of the average human face) is computed to localize head positions and to filter out the outliers. Outliers are identified if the number of votes in the resulting cluster is below a threshold that defines minimum number of votes. We follow [12] to set the threshold value. Within each head cluster, a mean shift clustering is performed for each landmark k. Mean shift is a non-parametric iterative algorithm that can be used to find the mode of a density function. The algorithm assumes that the given data are sampled from a probability density function where the dense region corresponds to local maxima or the mode of the density function. Starting from an arbitrary location, mean shift operates by defining a window around it and computes the weighted mean of the data within the window. The window size is defined by a kernel function. There are many choices to define a kernel function, e.g. flat kernel, Gaussian kernel. Next, the center of the window is shifted into the new weighted mean. This procedure is then repeated until converges or it reaches a maximum number of iterations. Given a set of landmark votes {vi k } and a Gaussian kernel K, the clustering procedure is summarized as follows: 1. Set initial estimate m k t=0 with the mean of landmark votes 2. Repeat until m k converges or maximum iteration: (a) Update the weighted mean m k, m k vi t+1 = k K(vk i mk t )vi k v ki K(vk i mk t ), (2.6) ) where K(vi k mk t ) = exp ( vk i mk t 2 and h = 0.2r. 2h 2 The bandwidth parameter h determines the size of clustering window, we set its value to one fifth of the radius of average face.

Chapter 2. Random Forest for 3D Facial Landmark Detection 12 In each iteration, the window determined by a Gaussian kernel K is be shifted to a more dense region.

22 Chapter 2. Random Forest for 3D Facial Landmark Detection 12 In each iteration, the window determined by a Gaussian kernel K is be shifted to a more dense region. So, at the end, it will reach the peak of the density function. The final prediction for landmark k is given by the final value of the weighted mean m k. Figure 2.2 illustrates the votes for all landmark points and the final prediction. Figure 2.2: All votes for each landmark are represented as point clouds in different colors. The centers of the circles represent the final prediction of landmark positions.

23 Chapter 3 Integral Channel Features In the previous chapter, we have explained random forest-based models for 3D landmark detection. The performance of these models are not only determined by the learning algorithm itself, but also the feature representation. Thus, the choice of features is an important aspect to develop a robust landmark detector. In this chapter, we discuss integral channel features and how these features are integrated into a random forest model. The idea of integral channel features is simple but effective. A number of image channels are generated from a given image. These channels can be generated in many different forms. For instance, depth and color channels are obtained directly from an image. A channel can be computed using linear transformation (e.g. Gabor filters), non-linear transformation (e.g. gradient) or even a pointwise transformation. Once the channels are generated, features such as local sums, histograms, and Haar features are computed for each channel. The features capture heterogenous and richness information from different types of channels. Furthermore, these features can be computed efficiently using integral images. In the first section, we summarize the related work on integral channel features that have been applied in different computer vision tasks. Section 3.2 describes integral images. Section 3.3 explains the different channel types, and how the channels and features are integrated into a random forest model. Lastly in section 3.4, we present the evaluation result of different channels in terms of computational time. 3.1 Related Work The notion of integral channel features is inseparable with the concept of an integral image. The first work adopting integral images in computer vision domain was the work 13

24 Chapter 3. Integral Channel Features 14 of Viola and Jones [29]. Viola and Jones proposed cascaded AdaBoost classifiers with Haar-like features for object detection. They achieved real time performance with high detection accuracy. Their work was a breakthrough in computer vision in which their proposed feature representations are proven efficient yet effective for object detection. Later, similar framework has been adopted in many other applications. Integral channel features, in particular, have proven to be effective for many computer vision tasks, e.g. object recognition [18], pedestrian detection [19] and local region matching [20]. In the medical imaging domain, Tu et al. [30] introduced a probabilistic boosting tree framework with various image channels for MRI brain segmentation. The authors computed Gabor filters and edge response channels at different scales combined with 3D Haar filter channels on top of them. Dollar et al. [31] trained an edge detector using a large number of channel features. These channels included gradients at various scales, Gabor filters and Gaussian filters obtained high accuracy. In their subsequent paper [17], Dollar et al. explored different types of channel features and studied the performance of different channel types for pedestrian detection. Their proposed method succesfully outperformed other features including Histogram of Oriented Gradients (HoG). A variant of integral channel features, named aggregate channel features was proposed by Yang et al. [32] to train multi-view face detector. The authors adopted the Viola- Jones learning framework and utilized different types of color and gradient channels to deal with different poses from ranging frontal faces to profile faces. The algorithm achieved high detection accuracy for face images in the wild Although many methods have reported that use integral channel features in many different tasks, only a few methods utilized integral channel features in 3D related problems. In this thesis, we are interested to exploit multiple channel features computed from RGB-Depth images. 3.2 Integral Image In image processing, an integral image is known as the algorithm to efficiently calculate the sum of values (pixel values) in a rectangular image area. Figure 3.1 illustrates how the integral image algorithm works. At each location (x, y), an integral image contains the sum of the pixel values above and to the left of (x, y). It is formally defined by:

25 Chapter 3. Integral Channel Features 15 I(x, y) = x x y y i(x, y ), (3.1) where i is image input and i(x, y ) is pixel values at location (x, y ) in image input. Once the integral image has been computed, the sum of values over any rectangular area (x 0, y 0, x 1, y 1 ) within the integral image can be calculated in constant time O(1) using four references: i(x, y) = I(x 0, y 0 ) + I(x 1, y 1 ) I(x 0, y 1 ) I(x 1, y 0 ) (3.2) x 0 <x x 1 y 0 <y y 1 Figure 3.1: (a) Input image and (b) Computed integral image. Sum values in region A can be computed using four references: L 1 + L 4 L 2 L 3. The concept of integral images has been extended in various ways. For instance, integral images can also be used to compute the local product of any rectangular area within an image. This can be done by taking the log of pixel values and to compute the sum, since exp ( i log(x i)) = i x i. Lienhart and Maydt [33] extended the integral image representation to compute the sum of pixels in rotated rectangular regions. They proposed rotated Haar-like features with boosted classifiers for object detection. These features were reported produced more robust and accurate detection. Another variant is the integral volume representation, which is the three-dimensional generalization of the integral image. Ke et al. [34] exploited this kind of volumetric features in the spatiotemporal domain for event detection in video sequences. The method achieved real time performance with low errors.

Chapter 3. Integral Channel Features 16 3.3 Channel Features In section 2.2.1, we have defined features such s average pixel values over two rectangular regions within a patch.

26 Chapter 3. Integral Channel Features Channel Features In section 2.2.1, we have defined features such s average pixel values over two rectangular regions within a patch. This kind of features can be computed from any type of channel. The only prerequisite is that a channel C has to be translational invariant. That means that if two images I and I are related by a translation, the generated channels C and C are related by the same translation. This criteria allows us to efficiently compute features from any rectangular within the image channel. An image channel C is only generated once rather than for every image patch. Computing features in an image patch is done by using integral images. In this section, we study three different channel types: depth, gray and gradient histograms. These channels are illustrated in Figure 3.2. The rationale behind selecting these channels is that these channels capture local information about the face surface and its contours. In particular, depth values indicate how the face surface looks like. Gray values capture the texture of each face surface. The image gradients capture information about the rate of texture changes and edge responses along different angles. Figure 3.2: Examples of generated image channels: depth, gray, and gradient along 4 different angles. Sum over any rectangular region within the image is computed using integral image. 1. Depth channel It is the channel that is obtained directly from the RGB-D image. The sum over any rectangular region of the depth channel is computed directly by using integral images. 2. Gray channel The gray channel is generated from the RGB color channels. Normalized (raw) gray values are used to minimize the effect of illumination variations. 3. Gradient Histogram The algorithm to compute the histogram using integral images was first introduced by Porikli [35]. Gradient histograms are the most commonly used variants of

27 Chapter 3. Integral Channel Features 17 integral histograms. Gradient histograms are generated by quantizing the gray image into a number of gradient angles. Each value within the quantized image is weighted by its gradient magnitude. Q θ (x, y) = G(x, y)1[θ(x, y) = θ], (3.3) where 1 is indicator function, θ is gradient angle, G(x, y) and Θ(x, y) are the gradient magnitude and the quantized gradient angles at pixel location (x, y), respectively. In our settings, instead of combining the obtained quantized images into histograms, we adopted the quantized images themselves as multiple individual channels. The only parameter to be set here is the number of quantized images that are computed. This parameter influences the performance of the model. The impact of this parameter will be discussed in chapter 4. This technique can also be applied to approximate HoG features, as in [19], by combining all quantized images into histogram and normalize it with gradient image computed at a different scale. Integrating channel features into our random forest model is as follows. Since RGB-D images are used, the RGB and depth images have to be aligned first. In the training process, each training image is transformed into multiple image channels. A set of patches extracted from these channels, are then used to grow trees. During the training stage, the learning algorithm will select an optimum channel for every non-leaf node which maximizes the information gain. The same approach is applied in testing phase. Intuitively, the more channels used in the model, the richer are information the model is collecting to classify patches correctly. However, adding more channels also increases the complexity of the model and increases the computational time. Our experiments study which combination of channel features produces the best performance to detect 3D facial landmarks. Our experiment results reported in chapter Evaluation on Computation Time To demonstrate the efficiency of integral channel features, we perform experiments to measure the average time needed to generate each individual channel and combination channels. All experiments are conducted on the same standard PC. Table 3.1 illustrates the average time needed to generate different channel types plus the computation of integral images from the channel. Computing integral images from

28 Chapter 3. Integral Channel Features 18 Channel Types Time (ms) 1 Gradient Gradients Gradients Depth + Gray + 1 Gradient Depth + Gray + 4 Gradients Depth + Gray + 9 Gradients Table 3.1: Average time needed to generate image channel depth and gray images takes around 2 ms. While time to compute the sum values over any random rectangular region within the channel is also negligible since it is O(1) operation.

29 Chapter 4 Experiments 4.1 Experiment Setup Dataset Labelling We evaluated our model on the Biwi Kinect head pose dataset 1. The dataset contains 24 sequences of 20 subjects (14 men and 6 women) with more than 15K frames in total. Each frame has both a RGB image and a depth image as well as information about the head rotation and location. The head rotation angles in each subject varies: ±60 pitch, ±75 yaw, and ±50 roll. The dataset has no landmark annotation. To annotate landmarks for each subject, we used the following algorithm: 1. Manually annotate landmark points in the first frame. Any facial landmark detector can also be used to automate this step. In our setting, we annotated the first frames using the 2D landmark detector proposed in [6]. This step will result in facial landmark annotations in 2D coordinate. 2. From 2D landmarks, compute 3D landmarks using the corresponding depth image and the intrinsic matrix. p = M 1 x (4.1) x is a vector representing 2D landmarks, vector p denotes the landmarks in 3D, and M is the camera intrinsic matrix. 1 Biwi Kinect dataset is available at: 19

Chapter 4. Experiments 20 3. Shift the location of head center to the origin of coordinate system. In other words, substract the position of head center from the landmark position. 4. Transform the landmarks with the inversed rotation matrix.

30 Chapter 4. Experiments Shift the location of head center to the origin of coordinate system. In other words, substract the position of head center from the landmark position. 4. Transform the landmarks with the inversed rotation matrix. This results in landmarks in 3D camera coordinates. p0 = R 1 1 p (4.2) where the rotation matrix R1 is the rotation matrix at the first frame. 5. From frame 1 until frame N, transform the landmarks p0 using rotation matrices at each frame. Final landmark positions are obtained by translating the transformed landmark positions with the original head center location. Figure 4.1 illustrates landmark annotation results for different head poses. pn = (Rn p0 ) + h (4.3) pn and Rn are respectively final 3D landmarks and the rotation matrix at frame n. Figure 4.1: Examples of annotation results for different head poses. The landmarks (green dots) are visualized in 2D. The black dots represent landmarks that are not visible when projected in 2D. We identified that for large head pose, several landmarks are not aligned on face surface. In addition, these points are visible when the 3D image is projected onto a 2D image, as illustrated by the third and fourth images in Figure 4.1. Considering this, we performed an additional step to verify whether a landmark is located on the face surface. A landmark that has neighbourhood point clouds within a certain radius is categorized as visible, otherwise it is not visible. This visibility information used when evaluating the performance of the landmark detector. Only visibile landmarks are considered in the evaluation. Once the dataset is annotated, we followed the settings of [12] to split the dataset into training and testing sets. The testing set contains only 2 subjects: man and woman with large pose variations (subject number 01 and 12). The rest subjects are used in the

31 Chapter 4. Experiments 21 training set, except for subjects 06, 17 and 19. These subjects have facial expressions and missing depth data in one or more fiducial points, e.g. eye corners. As a consequence, the position of rigid landmark points in these subjects are hard to approximate. In order to analyze the robustness of our landmark detector in the presence of head pose variations, we conducted two experiments. First, we trained and evaluated the model with subsets of frames, having less than 20 head rotation (frontal face). This constraint ensures that all landmarks are visible and surrounded by sufficient facial surface to be computed. The second experiment, we relaxed the constraint and constructed a forest from the full training set. The trained trees are then evaluated using the unconstrained test set. In this experiment, the evaluation is conducted only from visible landmarks. We study optimum parameters of random forest, error thresholds and comparison of different combinations of channel features. Moreover, we compare the performance of the landmark detector with other 2D-based method to gain insight about the advantage of this method Parameter Settings In order to fairly compare the performance of the channel features, we trained multiple forests using identical training images and patches. For training, we fixed the following parameters: 1. Number of image samples on each tree: 1000 (frontal face set), 3000 (full set). 2. Maximum tree depth: Number of positive patch samples extracted from each image: Number of negative patch samples extracted from each image: Patch size: pixels. 6. Minimum number of patches required for a split: Number of binary tests in each non-leaf node: different combinations of R 1, R 2 and f in Equation 2.4, each with 25 different threshold τ. In the testing phase, the following parameters are applied: 1. Threshold variance: 300

32 Chapter 4. Experiments Threshold class probability: Bandwidth parameter for mean shift clustering: 0.2r, r is radius of average face Evaluation Measure We measured the error for each landmark as an Euclidean distance between the predicted location and the ground truth (Equation 4.4). We also measured the ratio of correctly detected points if the error produced is less than an error threshold. The optimum error threshold is discussed in section error(y k, t k ) = (y k x t k x) 2 + (y k y t k y) 2 + (y k z t k z) 2 (4.4) y k and t k are predicted location and the ground truth location of landmark k in 3D coordinate, respectively. 4.2 Results Number of Trees The experiment was conducted with three different channel combinations both for the frontal face and full dataset. The results of this experiment are presented in Figure 4.2a and 4.2b. The graphs show the mean Euclidean error (in milimeters) as function of the number of trees when the maximum tree depth is fixed to 20. Both graphs illustrate that adding more trees gives a positive impact to reduce the error of the landmark detector. The same trend happens for all combinations of channel features in both sets. In Figure 4.2a, we can derive that the accuracies for Depth channel and Depth + Gray channels stabilize at about 7 trees, while the accuracy for Depth + Gray + 4 Gradients converges even faster after using 3 trees. We noted that when 7 trees are used, Depth + Gray channels and Depth + Gray + 4 Gradients channels perform equally well. In Figure 4.2b, the combination of depth, gray and 4 gradient channels outperforms the other channels. Using additional gray and 4 gradient channels is able to reduce the error especially for the landmarks that have small variations of depth values, e.g. eye corners. Our following experiments are conducted with optimal number of trees (7) for frontal face dataset and (20) for full dataset.

33 Chapter 4. Experiments Depth Depth + Gray Depth + Gray + 4 Gradients Mean Error (mm) #Trees (Depth = 20) (a) Frontal face test set [ 20, 20 ] Depth Depth + Gray Depth + Gray + Gradient 300 Mean Error (mm) #Trees (Depth = 20) (b) Full test set Figure 4.2: The influence of number of trees, measured with mean euclidean error, averaged over all landmarks and all images in test set Accuracy In section 2.2, we have defined positive samples for each landmark k by a certain radius. To preserve consistency, we also evaluated the accuracy of the detector with a certain error threshold. Any prediction that produces an error larger than threshold is considered as a missed detection. Figure 4.3a and 4.3b depict the accuracy as function of different error thresholds evaluated on both the frontal face and full test set, respectively. Stable accuracy is achieved when the threshold is set to 20mm. Once again for the frontal face, the combination of Depth + Gray channels and Depth + Gray + 4 Gradients channels have similar performance. While for the full test set, the combination of Depth + Gray + 4 Gradients channels provides higher accuracy than the other combinations.

34 Chapter 4. Experiments Accuracy (%) Depth 10 Depth + Gray Depth + Gray + 4 Gradients Error Threshold (mm) (a) Frontal face test set [ 20, 20 ] Accuracy (%) Depth 10 Depth + Gray Depth + Gray + 4 Gradients Error Threshold (mm) (b) Full test set Figure 4.3: Detection accuracy as function of error threshold, averaged over all landmarks and all images in test set Accuracy vs Efficiency In this experiment we study the effect of the stride parameter in terms of accuracy and efficiency. We measured average time needed to test a single image after it has been loaded into the memory, and compared it with the resulting accuracy. Figures 4.4a and 4.4b show the evaluation results on the frontal face test set. Figure 4.5a and 4.5b show the evaluation results on the full test set. The results illustrate that the value of the stride parameter is negatively correlated with the accuracy. Using a smaller stride value yields high accuracy (Figures 4.4b and 4.5b). However, it comes with the expense of processing time (Figure 4.4a and 4.5a). When we

35 Chapter 4. Experiments Depth Depth + Gray Depth + Gray + 4 Gradients 300 Time (ms) Stride (pixel) (a) Depth Depth + Gray Depth + Gray + 4 Gradients 90 Accuracy (%) Stride (pixel) (b) Figure 4.4: 4.4a Execution time as function of the stride parameter. 4.4b Accuracy as a function of the stride parameter. Time and accuracy are averaged over all landmarks and all images in the frontal face test set. use larger stride values, it fastens the process but decreases the accuracy. By comparing the results, we can conclude that the choice of the stride parameter controls the trade-off between accuracy and efficiency. In the case when execution time is not a constraint, 5 pixels stride can be utilized since it mantains high accuracy with computational time still under 1 second. For real time applications, larger stride values can be considered. Our following experiments are conducted with a 5 pixel stride.

36 Chapter 4. Experiments Depth Depth + Gray Depth + Gray + Gradient 1200 Time (ms) Stride (pixel) (a) Accuracy (%) Depth Depth + Gray Depth + Gray + 4 Gradients Stride (pixel) (b) Figure 4.5: 4.5a Execution time as function of stride parameter. 4.5b Accuracy as function of stride parameter. Time and accuracy are averaged over all landmarks and all images in full test set Channel Features Comparison Our approach differs from other facial alignment approaches because we do not build a landmark or shape model beforehand to fit test image. This makes our detection results sensitive for landmark combination. For this reason, we prefer to evaluate each individual landmark separately. We present the performance results of different channel features evaluated on both the frontal face and full test set. Table 4.1 summarizes the experimental results.

37 Chapter 4. Experiments 27 Depth Depth + Gray Depth + Gray + 1 Gradient Depth + Gray + 4 Gradients Depth + Gray + 9 Gradients Accuracy (%) Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out Figure 4.6: Channel features performance comparison in frontal face dataset (head pose range [ 20, 20 ]). Note that in this dataset all landmarks are visible Frontal face dataset Figure 4.6 illustrates the performance of different combinations of channel features. The nose tip landmark is the most correctly predicted landmark followed by the inner eye corner landmarks. This is not surprising since for nearly frontal faces, the nose area is the most distinctive area in the face. To detect these landmarks, using only depth channels already achieves 100% or close to 100% accuracy. Adding more channels does not yield performance improvement. In contrast, the chin is the landmark that is often misplaced when only the depth channel is used. Since we use relatively small patches, this implies that the depth values around the chin are not distinguishable enough compared to the other regions. Using an additional gray channel is effective to reduce the misdetection rate. Adding gradient channels also increases the accuracy but still cannot outperform the combination of depth and gray channels. Another example of misdetection is between the outer eye corners and mouth corners. For a number of subjects, the detector wrongly predicts the mouth corners as outer eye corners. We identified that this condition happens when the features between these regions contain small variations. Some examples of the failure cases are presented in Figure 4.7. Overall, the best performing channels to detect 3D landmarks for frontal faces is the combination of depth, gray and 4 gradient channels. This combination produces highest the mean accuracy and lowest error, as shown in Table 4.1.

38 Chapter 4. Experiments 28 Figure 4.7: Examples of failure cases in frontal face test set [ 20, 20 ], randomly selected from all channel combinations. Chin and outer eye corners are the most often misplaced landmarks Full dataset Depth Depth + Gray Depth + Gray + 1 Gradient Depth + Gray + 4 Gradients Depth + Gray + 9 Gradients Accuracy (%) Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out Figure 4.8: Channel feature performance comparison for the full test set. The accuracy is computed only using the visible landmarks. The performance results for large head pose variations are shown in Figure 4.8. We can derive that adding gray and gradient channels provides significant improvent to the accuracies of chin and outer eye corners. For the nose tip and inner eye corners, adding gray and gradient channels only results in a small improvement, since depth channel already produces at least 85% accuracy. However, the results also imply that chin landmarks are still the most difficult landmarks to detect. When only the depth channel used, our detector only achieves a 32% accuracy. Adding gray and 4 gradient channels augments the accuracy up to 68%. Even with 36% improvement, its accuracy is still the lowest compared to the others. The other landmarks have at least 80% accuracy when the gray and 4 gradients channels used. A number of misdetection cases are shown in Figure 4.9. It shows that landmarks obtained using only the depth channel compared to landmarks obtained using additional gray and gradient channels.

39 Chapter 4. Experiments 29 Figure 4.9: Examples of failure cases in full test set. First row: Examples of failure cases when only depth channel used. Second row: Results from the same images when gray and gradient channels are added. The comparison results are summarized in Table 4.1. These results lead us to the same conclusion as the previous experiment. We can conclude that the best performing channels to detect the 3D landmarks for large head pose variations is the combination of depth, gray and 4 gradient channels. Dataset Channel Features Mean Accuracy (%) Mean Error (mm) Frontal face Depth Depth Depth Depth Depth Gray Gray + 1 Gradient Gray + 4 Gradients Gray + 9 Gradients Full Depth Depth Depth Depth Depth Gray Gray + 1 Gradient Gray + 4 Gradients Gray + 9 Gradients Table 4.1: Summary of the performance of different channel features, averaged over all landmarks and all test images in dataset.

40 Chapter 4. Experiments 30 Lastly, using the best performing setting, we computed a 4-fold subject-independent cross validations on the entire Biwi kinect dataset. presented in Table 4.2. The result for this evaluation is Mean Error Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out Table 4.2: Evaluation result on 4-fold subject-independent cross validation performed with the best setting. The numbers represent euclidean error in mm Analysis Performance under Head Pose Variations In this section, we further analyze the accuracy of our detector for different poses. To do this, we test our best performing model on the discretized test set. The test set was discretized according to head poses in areas and the accuracy was computed for each range separately. Hence, the performance is known for the detector in each discretized head pose. Figure The result of this experiment is presented as a heat map in Yaw Pitch Success Ratio Figure 4.10: Evaluation result on test set, discretized in areas depending on their head pose angles. The colors and numbers represent success ratio of the detector, averaged over visible landmarks and test images in each area. The optimal settings were applied (20 Trees with Depth + Gray + 4 Gradient channels) The graph shows that the detector achieved highest accuracy in frontal faces ( 20 head pose 20 ). This result is consistent with our previous result (Table 4.1), where for frontal faces the success rate is 1 or close to 1. We can also see from the graph that the success rates naturally decrease when the head pose angles become larger, especially when the angle is larger than 40.

Chapter 4. Experiments 31 For large poses, the detector often wrongly predicts the areas that have similar texture or depth to the ground truths as landmarks.

41 Chapter 4. Experiments 31 For large poses, the detector often wrongly predicts the areas that have similar texture or depth to the ground truths as landmarks. Even the areas that do not belong to the face region. For instance, ears and hair are misdetected as eye corners since they have similar textures as well as the depth values. Another factor that contributes to the performance drop is that the lack of training images for large poses. We noted that frontal faces have many more training images than the faces with large poses. Adding more training images or oversampling the images for large poses would resolve this. Figure 4.11: Examples of failure cases in large pitch (top row) and yaw (bottom row) angles. The detector often mistakenly predict areas such as ear, hair line, and neck as landmarks Comparison with 2D-based State-of-the-Art For the last experiment, we compared the performance of our detector against the 2Dbased method, that is Ensemble Regression Trees (ERT) [7]. We used available source code from DLIB Library [36] and ran 4-fold cross validation on entire Biwi Kinect dataset. Since this method relies on the face detection, we also provided 100x100 pixels ground truth face boundary as input. The trained model detects landmarks in 2D and then convert these landmarks into 3D coordinate to be evaluated. The result of this evaluation is presented in Table 4.3. Method Mean Error Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out ERT [7] RF (ours) Table 4.3: Performance comparison with Ensemble Regression Trees [7]. The numbers represent Euclidean distance in mm (lower is better).

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images 1 Introduction - Steve Chuang and Eric Shan - Determining object orientation in images is a well-established topic