Intensity-Depth Face Alignment Using Cascade Shape Regression

Similar documents
Generic Face Alignment Using an Improved Active Shape Model

arxiv: v1 [cs.cv] 16 Nov 2015

Face Alignment Under Various Poses and Expressions

Locating Facial Landmarks Using Probabilistic Random Forest

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

OVer the last few years, cascaded-regression (CR) based

AAM Based Facial Feature Tracking with Kinect

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI

Robust FEC-CNN: A High Accuracy Facial Landmark Detection System

arxiv: v1 [cs.cv] 29 Sep 2016

FACIAL POINT DETECTION USING CONVOLUTIONAL NEURAL NETWORK TRANSFERRED FROM A HETEROGENEOUS TASK

Tweaked residual convolutional network for face alignment

Facial shape tracking via spatio-temporal cascade shape regression

Topic-aware Deep Auto-encoders (TDA) for Face Alignment

Simultaneous Facial Landmark Detection, Pose and Deformation Estimation under Facial Occlusion

Nonlinear Hierarchical Part-Based Regression for Unconstrained Face Alignment

A Fully End-to-End Cascaded CNN for Facial Landmark Detection

FITTING AND TRACKING 3D/4D FACIAL DATA USING A TEMPORAL DEFORMABLE SHAPE MODEL. Shaun Canavan, Xing Zhang, and Lijun Yin

Robust Facial Landmark Detection under Significant Head Poses and Occlusion

DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS

3D Active Appearance Model for Aligning Faces in 2D Images

Facial Landmark Detection via Progressive Initialization

One Millisecond Face Alignment with an Ensemble of Regression Trees

Extensive Facial Landmark Localization with Coarse-to-fine Convolutional Network Cascade

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction

arxiv: v1 [cs.cv] 11 Jun 2015

Learn to Combine Multiple Hypotheses for Accurate Face Alignment

Constrained Joint Cascade Regression Framework for Simultaneous Facial Action Unit Recognition and Facial Landmark Detection

Extended Supervised Descent Method for Robust Face Alignment

Occluded Facial Expression Tracking

Sparse Shape Registration for Occluded Facial Feature Localization

Face Analysis Using Curve Edge Maps

Human and Sheep Facial Landmarks Localisation by Triplet Interpolated Features

Project-Out Cascaded Regression with an application to Face Alignment

Cascaded Regression with Sparsified Feature Covariance Matrix for Facial Landmark Detection

Occlusion-free Face Alignment: Deep Regression Networks Coupled with De-corrupt AutoEncoders

3D Face Modelling Under Unconstrained Pose & Illumination

Face Alignment Robust to Occlusion

HUMAN S FACIAL PARTS EXTRACTION TO RECOGNIZE FACIAL EXPRESSION

Face Alignment with Part-Based Modeling

Large-scale Datasets: Faces with Partial Occlusions and Pose Variations in the Wild

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Robust Face Alignment Based on Hierarchical Classifier Network

Face Analysis using Curve Edge Maps

Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization

Facial Landmark Localization A Literature Survey

Pose Normalization for Robust Face Recognition Based on Statistical Affine Transformation

Supplemental Material for Face Alignment Across Large Poses: A 3D Solution

Robust color segmentation algorithms in illumination variation conditions

Detecting facial landmarks in the video based on a hybrid framework

3D Face Sketch Modeling and Assessment for Component Based Face Recognition

Robust Facial Landmark Localization Based on Texture and Pose Correlated Initialization

Analyzing the Relationship Between Head Pose and Gaze to Model Driver Visual Attention

Facial Expression Analysis

In Between 3D Active Appearance Models and 3D Morphable Models

Extending Explicit Shape Regression with Mixed Feature Channels and Pose Priors

Enhanced Active Shape Models with Global Texture Constraints for Image Analysis

THE ability to consistently extract features is at the heart of

Enhance ASMs Based on AdaBoost-Based Salient Landmarks Localization and Confidence-Constraint Shape Modeling

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

Point-Pair Descriptors for 3D Facial Landmark Localisation

Combining Statistics of Geometrical and Correlative Features for 3D Face Recognition

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

Face Recognition by 3D Registration for the Visually Impaired Using a RGB-D Sensor

A coarse-to-fine curvature analysis-based rotation invariant 3D face landmarking

Nonrigid Surface Modelling. and Fast Recovery. Department of Computer Science and Engineering. Committee: Prof. Leo J. Jia and Prof. K. H.

Face Alignment under Partial Occlusion in Near Infrared Images

Research Article Patch Classifier of Face Shape Outline Using Gray-Value Variance with Bilinear Interpolation

Facial Landmarks Detection Based on Correlation Filters

Unconstrained Face Alignment without Face Detection

Joint Head Pose Estimation and Face Alignment Framework Using Global and Local CNN Features

arxiv: v1 [cs.cv] 29 Jan 2016

Partial Face Matching between Near Infrared and Visual Images in MBGC Portal Challenge

A Comparative Evaluation of Active Appearance Model Algorithms

Improved Face Detection and Alignment using Cascade Deep Convolutional Network

SIMULTANEOUS CASCADED REGRESSION. Pedro Martins, Jorge Batista. Institute of Systems and Robotics, University of Coimbra, Portugal

LEARNING A MULTI-CENTER CONVOLUTIONAL NETWORK FOR UNCONSTRAINED FACE ALIGNMENT. Zhiwen Shao, Hengliang Zhu, Yangyang Hao, Min Wang, and Lizhuang Ma

Explicit Occlusion Detection based Deformable Fitting for Facial Landmark Localization

Eye Detection by Haar wavelets and cascaded Support Vector Machine

Active Wavelet Networks for Face Alignment

Feature Detection and Tracking with Constrained Local Models

FACE RECOGNITION USING INDEPENDENT COMPONENT

Cross-pose Facial Expression Recognition

arxiv: v5 [cs.cv] 26 Jul 2017

Constrained Local Neural Fields for robust facial landmark detection in the wild

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Gender Classification Technique Based on Facial Features using Neural Network

Depth from Stereo. Dominic Cheng February 7, 2018

Convolutional Experts Constrained Local Model for 3D Facial Landmark Detection

FROM VIDEO STREAMS IN THE WILD

Semantic Context Forests for Learning- Based Knee Cartilage Segmentation in 3D MR Images

Sparse Variation Dictionary Learning for Face Recognition with A Single Training Sample Per Person

Joint Estimation of Pose and Face Landmark

Misalignment-Robust Face Recognition

Face Alignment Using Active Shape Model And Support Vector Machine

Landmark Localisation in 3D Face Data

Graph Matching Iris Image Blocks with Local Binary Pattern

Segmentation and Normalization of Human Ears using Cascaded Pose Regression

Real-Time Eye Blink Detection using Facial Landmarks

Transcription:

Intensity-Depth Face Alignment Using Cascade Shape Regression Yang Cao 1 and Bao-Liang Lu 1,2 1 Center for Brain-like Computing and Machine Intelligence Department of Computer Science and Engineering Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China 2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China Abstract. With quick development of Kinect, depth image has become an important channel for assisting the color/infrared image in diverse computer vision tasks. Kinect can provide depth image as well as color and infrared images, which are suitable for multi-model vision tasks. This paper presents a framework for intensity-depth face alignment based on cascade shape regression. Information from intensity and depth images is combined during feature selection in cascade shape regression. Experimental results show that this combination improves face alignment accuracy notably. Keywords: Face Alignment, Depth Image, Cascade Shape Regression 1 Introduction Face alignment is to detect key points such as eye corner, mouth corner and nose tip on human face, and is an important step for many face related vision tasks like face tracking and face recognition. A variety of models are proposed to tackle face alignment problem. Notable ones are Active Shape Model (ASM) [8], Active Appearance Model (AAM) [7], Constrained Local Model (CLM) [9], Explicit Shape Regression [4] and Deep Convolutional Network [14]. Besides, numerous improvements for these models have been proposed. Among those different models, the family of cascade regression models is the leading one. Generally, cascade regression models solve face alignment problem with many stages of regressions (in a cascade manner). Based on the basic structure, a lot of models are proposed. Cao et al. [4] use two-level boosted regression, shape indexed features and fast correlation-based feature selection. They achieve remarkable results. Asthana et al. [1] make the learning parallel and incremental, so that the model can automatically adapt to data. Burgos et al. [3] model occlusion explicitly to locate occluded regions and improve performance for occluded faces. Chen et al. [5] deal with face detection and face alignment jointly, which improves performance on both problems. Corresponding author (bllu@sjtu.edu.cn)

2 Yang Cao and Bao-Liang Lu Most face alignment studies focus on intensity image. Unlike intensity image, each pixel in a depth image represents the distance between the point in the scene and the camera. Intuitively, information from intensity image and depth image is complementary. For example, eyelid and eyebrow are very distinct on intensity image because of their darkness; tip of nose is very distinct on depth image because of its shape. However, only a few researchers try to combine them in face alignment tasks [2] [6]. Both [2] and [6] are based on CLM. Nevertheless, this approach has some limitations. The first one is that it is not optimal because intensity information and depth information do not always contribute equally to face alignment in the whole face region. Moreover, this approach is based on CLM, but recent studies show that cascade shape regression achieves higher performance. In this paper, we address those two problems in a novel way. 2 Framework We adapt cascade shape regression model to depth image, and use a novel approach to combine intensity image and depth image. Fig. 1 is an overview of our framework. Fig. 1. The workflow of our framework. 2.1 Problem Description Suppose that I is an image (intensity-depth image in our case), which contains human face, R is a rectangle, which gives face region in I, and Y = [x 1, y 1,..., x N, y N ] T is the ground truth of facial landmarks. The task of face alignment is computing an estimation Ŷ from only I and R, which minimizes Y Ŷ 2. (1)

Intensity-Depth Face Alignment 3 2.2 Framework Structure Since computing Ŷ in one-shot is difficult, almost all face alignment models work in a cascade manner. That is, let Ŷ1 (which is usually the average landmarks positions translated into R) be the initial estimation, and compute a better estimation Ŷ2 from I and Ŷ1. Repeat this for several times and we can get the final result Ŷ. Our framework also works in this manner. We use the two-level structure proposed in [4]. In our framework, there are T stages. Therefore we begin with Ŷ 1, and get Ŷ2, Ŷ3,..., Ŷt,..., iteratively, until ŶT. ŶT will be our final result Ŷ. 2.3 Feature Doing regression on I and Ŷt directly is hard. We must extract features from I and Ŷt. Shape Indexed Features Shape indexed features mean that features are extracted relative to landmarks. For face alignment, shape indexed features are more robust against pose variation. In [4], a feature is the difference between intensity values of two pixels. The pixel position is generated by taking an offset to a certain facial landmark. Therefore, pixels positions are relative to facial landmarks, which makes them invariant in different poses. In [3], linear interpolation between two landmarks is used as the position of a pixel, which makes it more robust. In our framework, we also use shape indexed features. For intensity image, we directly use the difference between intensity values of two pixels, whose positions are randomly generated, as a feature. We generate many such features and then do feature selection. For depth image, we use normalized depth features. Normalized Depth Features Shape indexed features can effectively deal with pose variation problem for intensity image. But for depth image, shape indexed features are not enough, since the depth value of each pixel will change dramatically during pose variation. To make information of depth image more robust against pose variation, we propose normalized depth features. As in the intensity image case, we use the difference between depth values of two pixels as a feature. To make it invariant under pose variation, we compensate each value of pixel according to the pose. If we use a plane to approximate the human face, the plane can be expressed by Xβ = z, (2) where X = [1, x, y] T is the location of the pixel, β is the parameter of the plane and z is the depth of the pixel. Suppose X = [x 1, x 2,..., x m ] T, x k = [1, x k, y k ] T is a landmark on the depth image, and z is the corresponding depth values. Then their relationship can be expressed by Xβ = z, (3)

4 Yang Cao and Bao-Liang Lu which can be solved by linear regression. Using normal equation, the result is β = (X T X) 1 X T z. (4) Having got the parameter β of the face plane, we can compensate the depth value of a pixel by z = z αxβ, (5) where z is the original depth value, z is the compensated depth value, and X is the location of the pixel. We also use attenuation parameter α for the compensation, because the estimated face pose may not be very accurate. The estimation of β is an iterative process. As the estimation of landmarks becomes more accurate during face alignment, the estimation of β will also become more accurate, which in turn contributes to the estimation of landmarks. Feature Selection and Fusion Using the feature extraction method described above, we can randomly (randomly select a landmark and randomly choose an offset) generate a huge feature pool for each image of training data, which contains not only intensity features but also depth features. We use correlationbased (Pearson Correlation) feature selection method proposed in [4], j opt = arg min j corr(y v, X j ), (6) where X is a matrix to represent all randomly generated features of all the training data in current external-stage, X j is a column vector which represents the jth feature of all the training data, Y is a matrix in which each row represents the disparity between current estimation of landmarks positions and true landmarks positions, and v is a vector drawn from unit Gaussian to project Y into a vector. Note that v is necessary here, so that we can compute the correlation. Fig. 2. Feature fusion. A fern can contain both intensity features and depth features at the same time. In our model, each stage contains K internal-stages[4]. In the training of each internal-stage (a fern [10]), we use the feature selection method described above

Intensity-Depth Face Alignment 5 to select F features for that fern. Both intensity features and depth features are considered. Therefore, one fern can use intensity features and depth features at the same time, which effectively combine the best part of two sources of features. Fig. 2 gives an example of such a fern. Further more, we use a parameter ρ to control the ratio between intensity features and depth features, and use cross validation to select the best value for ρ. 2.4 Initial Estimation An accurate initial estimation of facial landmarks positions can greatly contribute to face alignment. We use k-means on training data to get representative initial estimations of facial landmarks positions. Multiple initializations [4] are used in testing. In a tracking scenario, a reasonable initial estimation is the face alignment result on previous frame. However, using that estimation directly will lead to model drift, since the model is trained with certain initial estimations obtained by k-means as described above. To solve drift problem, we use Procrustes method [8] to align initial estimations to face alignment result on previous frame, and then use those transformed initial estimations. 3 Experiments We conduct experiments on two datasets to evaluate our model. Shape Root- Mean Square (RMS) error normalized w.r.t the inter-ocular distance of the face is used in the evaluation. In each experiment, we create three models: intensity, depth, intensity-depth. The only difference among those three models is feature. Depth model only uses depth features, intensity model only uses intensity features, and intensity-depth model is allowed to use both intensity features and depth features. All of the parameters are determined by cross validation on training data. To make the comparison fair, even though intensity-depth model is allowed to use both kinds of features, the numbers of total features which these three models are allowed to use are equal. 3.1 FRGC FRGC (Face Recognition Grand Challenge) Version 2.0 database [11] is originally designed for face recognition, but some researchers [12] [13] annotated those face images with 68 landmarks. It consists 4950 face images with color and depth information. We conduct this experiment in a similar manner (with minor difference) as [6], so that the results can be compared. From Fig. 3, we can see that the performance of intensity model and intensitydepth model is on the same level, and both are much higher than that of the depth model. We also observe that the performance of our model is better than that of [6]. The average landmark errors of our intensity model, depth model and intensity-depth model are 0.0211, 0.0408 and 0.0209 respectively.

Proportion of images 6 Yang Cao and Bao-Liang Lu 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.01 0.02 0.03 0.04 0.05 0.06 Normalized landmark error Intensity Depth Intensity-Depth 3D CLM Fig. 3. Landmark error curves of our three models and 3D CLM [6] on FRGC. 3.2 LIDF The experiment on FRGC validates our model. However, the improvement by combining intensity feature and depth feature is very small. We believe that it is because of performance saturation (faces in FRGC are almost frontal and not challenging enough). To better evaluate the effect of combining intensity feature and depth feature, we create a more difficult dataset. Our labeled infrared-depth face (LIDF) dataset is acquired in a lab environment. It consists of 17 subjects (10 males and 7 females), each subject has 9 poses, and each pose has 6 expressions. Infrared image and depth image are taken simultaneously and are perfectly aligned (using Microsoft Kinect One). So we have 918 infrared-depth images in total. 15-points manually labeled landmarks are provided. This dataset is publicly available 3. We use images from first 7 male subjects and first 4 female subjects as training set, and the rest images as testing set. Fig. 4. Some images in LIDF. 3 Available soon on http://bcmi.sjtu.edu.cn/resource.html

Proportion of images Intensity-Depth Face Alignment 7 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.01 0.02 0.03 0.04 0.05 0.06 Normalized landmark error Intensity Depth Intensity-Depth Fig. 5. Landmark error curves of our three models on LIDF. From Fig. 5 we can see that the performance of intensity-depth model is notably higher than both intensity model and depth model. The average landmark errors of our intensity model, depth model and intensity-depth model are 0.0329, 0.0380 and 0.0305 respectively. Some alignment results obtained by our method are shown in Fig. 6. Fig. 6. Some alignment results of our intensity-depth model (first row: FRGC, second row: LIDF). 4 Conclusion We proposed a face alignment model for intensity-depth image based on cascade regression model. Depth features were made to be robust against pose variation. Intensity and depth features were effectively combined during feature selection. Our intensity-depth model got 0.9% error reduction on FRGC and 7.3% error reduction on LIDF over intensity model, which indicated that higher performance could be achieved by combining intensity image and depth image.

8 Yang Cao and Bao-Liang Lu Acknowledgments. This work was partially supported by the National Natural Science Foundation of China (Grant No. 61272248), the National Basic Research Program of China (Grant No. 2013CB329401), and the Science and Technology Commission of Shanghai Municipality (Grant No. 13511500200). References 1. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1859 1866. IEEE (2014) 2. Baltrusaitis, T., Robinson, P., Morency, L.: 3d constrained local model for rigid and non-rigid facial tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2610 2617. IEEE (2012) 3. Burgos-Artizzu, X.P., Perona, P., Dollár, P.: Robust face landmark estimation under occlusion. In: IEEE International Conference on Computer Vision (ICCV). pp. 1513 1520. IEEE (2013) 4. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. International Journal of Computer Vision (IJCV) 107(2), 177 190 (2014) 5. Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and alignment. In: European Conference on Computer Vision (ECCV), pp. 109 122. Springer (2014) 6. Cheng, S., Zafeiriou, S., Asthana, A., Pantic, M.: 3d facial geometric features for constrained local model. In: IEEE Conference on Image Processing (ICIP). IEEE (2014) 7. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on pattern analysis and machine intelligence (TPAMI) 23(6), 681 685 (2001) 8. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. Computer vision and image understanding (CVIU) 61(1), 38 59 (1995) 9. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained local models. In: British Machine Vision Conference (BMVC). vol. 2, p. 6 (2006) 10. Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1078 1085. IEEE (2010) 11. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1, pp. 947 954. IEEE (2005) 12. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: IEEE International Conference on Computer Vision Workshops (ICCVW). pp. 397 403. IEEE (2013) 13. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic methodology for facial landmark annotation. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 896 903. IEEE (2013) 14. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3476 3483. IEEE (2013)