arxiv: v1 [cs.cv] 13 Jul 2015

Similar documents
arxiv: v1 [cs.cv] 16 Nov 2015

FACIAL POINT DETECTION USING CONVOLUTIONAL NEURAL NETWORK TRANSFERRED FROM A HETEROGENEOUS TASK

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Improved Face Detection and Alignment using Cascade Deep Convolutional Network

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A Fully End-to-End Cascaded CNN for Facial Landmark Detection

arxiv: v1 [cs.cv] 29 Sep 2016

Extensive Facial Landmark Localization with Coarse-to-fine Convolutional Network Cascade

Object detection with CNNs

Channel Locality Block: A Variant of Squeeze-and-Excitation

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Intensity-Depth Face Alignment Using Cascade Shape Regression

Robust FEC-CNN: A High Accuracy Facial Landmark Detection System

Tweaked residual convolutional network for face alignment

Unconstrained Face Alignment without Face Detection

Locating Facial Landmarks Using Probabilistic Random Forest

arxiv: v1 [cs.cv] 31 Mar 2016

OVer the last few years, cascaded-regression (CR) based

Content-Based Image Recovery

arxiv: v1 [cs.cv] 26 Jun 2017

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Spatial Localization and Detection. Lecture 8-1

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Face Alignment across Large Pose via MT-CNN based 3D Shape Reconstruction

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Robust Facial Landmark Detection under Significant Head Poses and Occlusion

arxiv: v2 [cs.cv] 23 May 2016

Topic-aware Deep Auto-encoders (TDA) for Face Alignment

Deep Multi-Center Learning for Face Alignment

Supplementary material for Analyzing Filters Toward Efficient ConvNet

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

Object Detection Based on Deep Learning

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Computer Vision Lecture 16

arxiv: v1 [cs.cv] 5 Oct 2015

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Detecting facial landmarks in the video based on a hybrid framework

Real-time Object Detection CS 229 Course Project

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Unified, real-time object detection

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Fully Convolutional Networks for Semantic Segmentation

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

arxiv: v1 [cs.cv] 31 Mar 2017

Feature-Fused SSD: Fast Detection for Small Objects

arxiv: v1 [cs.cv] 11 Jun 2015

Towards Real-Time Automatic Number Plate. Detection: Dots in the Search Space

Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet

Deep Neural Networks:

arxiv: v1 [cs.cv] 20 Dec 2016

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Computer Vision Lecture 16

Regionlet Object Detector with Hand-crafted and CNN Feature

arxiv: v1 [cs.cv] 12 Sep 2018

Part Localization by Exploiting Deep Convolutional Networks

LEARNING A MULTI-CENTER CONVOLUTIONAL NETWORK FOR UNCONSTRAINED FACE ALIGNMENT. Zhiwen Shao, Hengliang Zhu, Yangyang Hao, Min Wang, and Lizhuang Ma

Structured Prediction using Convolutional Neural Networks

3D model classification using convolutional neural network

Computer Vision Lecture 16

Study of Residual Networks for Image Recognition

Learn to Combine Multiple Hypotheses for Accurate Face Alignment

Rich feature hierarchies for accurate object detection and semantic segmentation

A Deep Regression Architecture with Two-Stage Re-initialization for High Performance Facial Landmark Detection

YOLO9000: Better, Faster, Stronger

Finding Tiny Faces Supplementary Materials

Deep Convolutional Neural Network in Deformable Part Models for Face Detection

Extended Supervised Descent Method for Robust Face Alignment

FCHD: A fast and accurate head detector

MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

Cascade Region Regression for Robust Object Detection

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Automatic detection of books based on Faster R-CNN

arxiv: v1 [cs.cv] 29 Jan 2016

Hybrid Cascade Model for Face Detection in the Wild Based on Normalized Pixel Difference and a Deep Convolutional Neural Network

An Exploration of Computer Vision Techniques for Bird Species Classification

arxiv: v1 [cs.cv] 24 May 2016

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v4 [cs.cv] 6 Jul 2016

CAP 6412 Advanced Computer Vision

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Category-level localization

Detecting Faces Using Inside Cascaded Contextual CNN

arxiv: v1 [cs.cv] 15 Oct 2018

arxiv: v3 [cs.cv] 18 Oct 2017

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Tiny ImageNet Visual Recognition Challenge

De-mark GAN: Removing Dense Watermark With Generative Adversarial Network

Final Report: Smart Trash Net: Waste Localization and Classification

arxiv: v1 [cs.cv] 1 Sep 2017

Transcription:

Unconstrained Facial Landmark Localization with Backbone-Branches Fully-Convolutional Networks arxiv:7.349v [cs.cv] 3 Jul 2 Zhujin Liang, Shengyong Ding, Liang Lin Sun Yat-Sen University Guangzhou Higher Education Mega Center, Guangzhou 6, PR China alfredtofu@gmail.com, marcding@63.com, linliang@ieee.org Abstract This paper investigates how to rapidly and accurately localize facial landmarks in unconstrained, cluttered environments rather than in the well segmented face images. We present a novel Backbone-Branches Fully-Convolutional Neural Network (BB-FCN), which produces facial landmark response maps directly from raw images without relying on pre-process or sliding window approaches. BB-FCN contains one backbone and a number of network branches with each corresponding to one landmark type, and it operates in a progressive manner. Specifically, the backbone roughly detects the locations of facial landmarks by taking the whole image as input, and the branches further refine the localizations based on a local observation from the backbone s intermediate feature map. Moreover, our backbone-branches architecture does not contain fullconnection layers for location regression, leading to efficient learning and inference. Our extensive experiments show that our model achieves superior performances over other state-of-the-arts under both the constrained (i.e. with face regions) and the in the wild scenarios.. Introduction Localizing facial landmarks plays a critical role in face recognition and it is also beneficial to a batch of face-based applications such as face hallucination [2] and person verification [4]. Most of existing methods for facial landmark estimation are developed in a controlled context, e.g., the face regions are well segmented as pre-processing. Such a setting has drawbacks when dealing with in the wild images (e.g., cluttered surveillance scenes), where the automated face detection is not always reliable. This work aims at the task of unconstrained facial landmark estimation, i.e., how to rapidly and accurately localize facial landmarks in real-world, cluttered environments (see Figure (a) for example). Specifically, we consider the following challenges (a) (b) Figure. Facial landmark predictions from unconstrained environments. (a) Two cluttered images including an unknown number of person faces. (b) The dense response maps generated by our approach, where the color represents the different types of landmarks. to develop such a system. Person faces have large appearance and structure variations in a constrained scene caused by diverse views, head poses and expressions as well as facial accessories (e.g., glasses and hats) and aging. Thus, traditional global models may not work well as the usual assumptions (e.g., certain spatial layouts) do not hold. The search space of facial landmarks is quite large under the circumstance that the number and the sizescale of person faces are both unknown. Partial occlusions and conjunctions of closing faces are also two inignorable difficulties. It is quite thus infeasible to handle our task by existing deformable part-based models with exhaustive image pyramid searching. To overcome the above issues, we present a novel deep model based on convolutional neural network (CNN),

which produces facial landmark response maps directly from raw images without relying on any pre-processing or feature engineering. Two typical results generated by our approach are shown in Figure (b). Besides achieving outstanding performance for image classification [] [6] [7], deep convolutional neural network models have demonstrated their effectiveness in object detection and localization [6]. These models usually take an image patch as input and output the parameterized object localization by the regression method. For example, Sun et al. [] proposed to detect landmarks in face images by using a three-level cascaded CNN framework, in which each landmark s location was gradually predicted via full-connection layers. Despite making substantial progresses, it is complicated for these models to jointly handle the classification (i.e., whether a landmark exists) and localization problem. Thus, applying the method in [] to a unconstrained scene requires the exhaustive sliding window search in the image. Very recently, Long et al. [2] presented a new fully-convolutional network (FCN), which takes input of arbitrary size and produces a correspondingly-sized dense label map, and showed convincing results for semantic image segmentation. Notably, the classification and localization can be simultaneously obtained with the dense label map. And the FCN has a property of efficient inference by sharing convolutions among overlapped image patches. The success of this work inspires us to adopt the FCN in our task, i.e., producing pixelwise facial landmark predictions. Nevertheless, we shall develop a specialized architecture, as our task requires more accurate prediction than general image labeling. Considering both computational efficiency and localization accuracy, we pose the facial landmark estimation as a coarse-to-fine filtering process. In particular, the locations of facial landmarks are roughly detected in the global context, while they are further refined by observing the local regions. To this end, we introduce a novel architecture of fully-convolutional networks that transparently accords with this coarse-to-fine pipeline. Specifically, our architecture contains one backbone and a number of network branches each corresponding to one landmark type, and it operates in a progressive manner. First, by taking the whole image as input, the network backbone, where several convolutional layers and max-pooling layers are sequentially stacked, handles all of the facial landmarks together and generates a coarse and low-resolution response map. Then, for each type of facial landmarks, the network branch takes a local observation from the intermediate feature map of the backbone, and it produces a fine and accurate prediction, where only convolutional layers are utilized. We thus call our architecture as the Backbone-Branches Fully- Convolutional Network (BB-FCN). Our BB-FCN has three important properties on handling the challenges of unconstrained facial landmark localization. i) It is trained end-to-end, pixels-to-pixels without requiring extra supervision. ii) It does not rely on full-connection layers for accurate localization regression, leading to efficient learning and inference. iii) It naturally combines the global and local information according with human perception process. We extensively evaluate BB-FCN in several standard benchmarks (e.g., AFW [26], AFLW [9]), and our experiments show that BB- FCN achieves superior performances over other state-ofthe-arts under both the constrained (i.e. with face regions) and the in the wild scenarios. In particular, our BB-FCN significantly decreases the average mean error of the current state-of-the-art from 8.2% to 7.6% on AFW and from 8% to 7.% on AFLW. The rest of the paper is organized as follows. Section II presents a brief review of related work. Section III introduces the main pipeline of our BB-FCN, followed by a discussion of network implementation and optimization in Section IV. The experimental results, comparisons and component analysis are presented in Section V. Section VI concludes the paper. 2. Related Work Facial landmark localization has been long studied due to its critical role in a lot applications. Generally speaking, most of the methods are based on local detectors and global constraints. Local detectors are designed to give the evidence of part existence, which are implemented by classifiers with hand-crafted features. For example, Belhumeur et al. [] use SVM as the local detectors with SIFT features and Liang et al.[] use AdaBoost to implement its local detectors on Haar wavelet features. For global constraints, there are several ways to model the part relationship. For example, Zhu et al. [26] applies the tree-based deformable part model to encode the constraints which achieves good performance. Valstar et al. models the constraints as Markov random field [9] to speed up the process and improve the robustness of the algorithm. Cao et al. [3] use the whole face region as input and random ferns as the regressor with shapes to be predicted expressed as linear combinations of training shapes. All these methods use hand-crafted features (e.g. HoG features) in common. In comparison to learned features (as in our work), hand-crafted features have poor generalization performance and discriminative power. Recently significant progress on facial landmark detection has been achieved. Deep models, like Convolutional Neural Networks (CNNs), Deep Auto-encoders (DAEs) and Restricted Boltzmann Machines (RBMs), play a vital role to advance this progress and most of the works are based on regression. In the regression based methods, the problem of facial landmark detection is formulated as a regression task 2

x Conv 2x2 Max Pool + x Conv 2x2 Max Pool + x Conv Backbone Network 2x2 Max Pool + 9x9 Conv x Conv x Conv 3x6x6 32x6x6 32x8x8 32x4x4 28x2x2 64x2x2 x2x2 crop LE Get max response positions in each channel x Conv 7x7 Conv 9x9 Conv x Conv 32x64x64 6x64x64 6x64x64 6x64x64 x64x64 RM Branch Network Figure 2. Landmark localization as a progressively filtering process with two stages. The backbone network first generates low resolution response maps identifying the rough locations with a large stride. The branch network then produces fine response maps with small stride for accurate landmark detection. There are five branches corresponding to five facial landmarks and each branch refine the response map separately. and a holistic regressor is used to jointly compute the landmark coordinates. Sun et al. [] propose a cascaded regression method for facial landmarks detection which includes three-level carefully designed convolutional networks. Further, Zhou et al. [2] proposal a four-level convolutional network cascade. Zhang et al. [24] tries to optimize facial landmark detection with other related tasks like pose estimation and facial expression analysis. Zhang et al. [23] proposal a novel Coarse-to-Fine Auto-Encoder Networks methods for facial landmark detection. Our method differs most of the existing deep models in the way how we model the outputs. Previous literatures model the output as the landmark locations, while we model the output as the response map, which does not introduce fully connection layers. The most similar works to ours are [8] [7]. In [7], Tompson et al. propose to use heat map to detect body joints with carefully designed dropout and rely on segmented images, assuming there is exact one target. We generalize the response map based model to face landmark localization problem with the standard SGD algorithm. We quantitatively compare the performance of regression based model with ours to demonstrate the effectiveness of response map model in the experiments. 3. The Proposed BB-FCN Architecture We aim at localizing all the facial landmarks in unconstrained images. Suppose there are K part types (e.g. eyes, nose, mouth and etc.) and use L k i = (x k i, yk i, sk i ) to denote the location of the ith instance of part k in the image I where x k i, yk i and sk i represent the coordinate and scale of the detected part respectively, then our task can be defined as follows with k =, 2,..., K. Det(I) = {(x k i, y k i, s k i )} () Note that the number of the instances of different parts may have different values due to the pose variation and occlusion. Unlike the existing approaches that predict the landmarks by regressors, we address this problem by a backbone-branches fully convolutional network model, with the backbone network to generate coarse response maps for rough location inference and the branch network to produce fine response maps for accurate location refinement. Before going to the details of our architecture, we first explain our model from an intuitive perspective, i.e. filtering perspective to reveal the key difference between our 3

model and the regression based model. In a nutshell, the response map of our model can be seen as a filtered response by a filtering function, with high values represent high confidence of the presence of a landmark. Let F W k(p ) to denote the filtering function parameterised by W k for part k defined on patch P of size w h. An ideal filtering function should have the following property: patches containing the target part should have strong responses while patches without that part should have weak responses. We model the response exponentially decreasing with the distance r between the part and the center of the patch as below where β controlling the decay effect. In this paper, β is set. in backbone network and. in branch network. { e βr if P contains part k; F W k(p ) (2) otherwise For an input image I, applying this function in sliding window manner with stride δ then generate a response map F W k I whose values can be derived by: (F W k I)(x, y) = F W k(i(xδ, yδ, xδ + w, yδ + h)) (3) Here, I(xδ, yδ, xδ + w, yδ + h) stands for the patch of size w h started at (xδ, yδ) in image I. With this response map, then a simple landmark localization approach can be formulated as below for part k where θ denotes a threshold. Det(I) = {(x iδ + w/2, y iδ + h/2, ) (F W k I)(x i, y i) > θ} (4) Of course, in order to achieve better results, we need to detect the landmarks across a set of scales and suppress the non maximum values as the typical detection approaches do, which will be discussed in later sections. According to equation 3, there is a trade-off between the localization error magnitude and the computational cost. In order to achieve high accuracy, we should make the stride δ as small as possible. However, in order to speedup the detection process, we need to enlarge this stride, resulting a low resolution response map. This inspires us to apply a two stage filtering process to localize the landmark progressively. More specifically, the first filtering process generates a coarse response map with a relatively large stride, identifying the rough locations of the landmark. Then we apply another filtering process on the local patches centered at the estimated landmark to get a fine response map for accurate landmark localization. This two stage strategy enables us to detect the landmark quickly at a high speed. 3.. Coarse-to-Fine Localization by BB-FCN In this part, we show that the coarse-to-fine strategy can be implemented by our proposed BB-FCN architecture. The key ingredient is a fully convolutional network x conv 7x7 conv x 7x7 x Layer Layer 2 Layer 3 Figure 3. An example of receptive field. There are two convolution layers with two kernels of x and 7x7 receptively. The receptive field of a response x response in layer 3 is a patch with size 7x7 in layer 2 and the receptive field of the 7x7 patch in layer 2 is a patch with size x in layer. can model a filtering process equivalently. Actually, the output of an input image by a fully convolutional network is just the result of a deep architecture applied on the sliding windows of that input. In the context of fully convolutional network, the sliding window is called a receptive field. Figure 3 illustrates the relationship between the receptive fields and the outputs. So we design an deep architecture with one backbone and a number of branches for efficient and accurate landmark localization. The backbone is to produce a coarse response map for each part at the top layer. This backbone takes a relative large receptive field to utilize global textures to ensure the quality of the coarse response map. With rough locations provided by the coarse map, we then apply another network, i.e. a branch for each part to refine the response map. In order to share some features, this branch takes patches from intermediate features of the backbone as input. This input is then convolved by a set of filters without any pooling to generate fine grained response map. Let W c to denote the parameters of the backbone network and H k (R; W c ) denote the response map of input R for part k. We train the backbone network with the loss function as follows: L (R; W c ) = K H k (R; W c ) H k (R) 2 () k= where H k (R) denotes the ground truth response map. This ground truth response map is defined according to equation 2 with the receptive field rec field(x, y) acting as the patch P. { H k e βr if rec field(x, y) contains part k; (x,y) = otherwise; (6) 4

Figure 4. Example images from AFW and AFLW and the results of facial landmarks detection, where the color represents the different types of landmarks. This figure is encouraged to be viewed with the electronic edition. 4. Implementation The branch networks are trained almost the same way as the backbone except for that each branch takes patches of intermediate features of the backbone as input rather than patches in images. Using Wfk to denote the parameters of the branch network for part k and H(P ; Wfk ), Hk (P ) to denote the predicted response map and ground truth response map of patch P respectively, the loss function of this network is again defined as follows with H (P ) set as in equation 2: L2 (P ; Wfk ) = H(P ; Wfk ) Hk (P ) 2 In this section, we describe the detailed network architecture in our implementation. Figure 2 shows the architecture of our network which consists of a backbone and a number of branches. Backbone Network: The backbone is positioned to produce coarse response maps for rough location estimation at a high speed. As the rough locations are the foundation of the fine locations, this network is designed to utilize global contexts with several pooling layers. More specifically, this network is trained by 6x6 inputs ranging from. to.8 times face size. The coarse response maps are produced by a set of stacked convolution and pooling layers as depicted in Figure 2. The first six layers contain three convolutional layers with each layer taking the pooled output of the previous layer as input. For simplicity, these six layers are compressed as three layers in Figure 2. The last three layers are all convolutional layers without any pooling. Note the final convolutional layer is our desired response map with each channel corresponding to one part. In summary, the kernel sizes of the convolutional layers are x, x, x, 9x9, x and x, and the feature map sizes are 32x6x6, 32x8x8, 32x4x4, 28x2x2, (7) 3.2. Non-maximum Suppression over Scales As the network is trained at a fixed scale, given a testing image, we need to rescale it to a set of images to ensure each landmark can appear at the trained scale. Thus we get a pyramid of response maps for each testing image for each part. We first conduct the non maximum suppression to localize the maximum point in the pyramid of coarse response maps. Given this detected maximum point with its scale and location, we obtain its fine grained response map by feeding the patches centered at the predicted locations to the fine network. Then we use the maximum point in this response map as the final detected landmarks.

AFLW mean error(%) mean error(%) AFW mean error(%) mean error(%) 2 TSPM ESR CDM Luxand RCPR SDM TCDCN Ours 2 4.3 2.2..4 9.3 8.8 8.2 7.6 LE RE N LM RM 2 2 2.9 3 3. 2.4.6 8. 8 7. LE RE N LM RM Figure. Comparison of our methods with different methods on AFW and AFLW datasets. The top row is the result of AFW with the right column showing the averaged mean error over different parts and the bottom row is the result of AFLW. 64x2x2 and x2x2 respectively. Branch Network: The branch network is designed to generate fine response map for precise landmark prediction. We use one branch for one part. Thus there are five branches in total and we use the same architecture for each branch. Since we do not want to introduce any pooling operations, each branch is attached to the first convolutional layer of the backbone, taking the patch of size 64x64 centered at the predicted location as input. This input is then convolved by four convolutional layers without any pooling operation to ensure the output has high resolution for accurate localization. By attaching the branch under the backbone convolutional layer, it offers the possibility for the two networks to benefit each other as they share some common feature maps. As depicted in Figure 2, the kernel sizes of this four convolutional layers are x, 7x7, 9x9 and x and the feature maps size are 6x64x64, 6x64x64, 6x64x64 and x64x64 respectively. We implement our model under Caffe[8], an open source framework for deep learning. The two networks are integrated as one model in Caffe thanks to the flexible configuration so that we do not need to invoke the forward propagation of the two networks process separately. The model is trained and tested on a host equipped with a Titan Black GPU of 6G memory. The training phase takes about 2 hours and.8g memory with 64 samples per mini-batch. For the testing, each constrained image takes about only 8.8 ms and each unconstrained images with size 4x4 takes about 7ms for 4 scales which are 778x778, 482x482, 23x23, 29x29, 88x88, 7x7, 96x96, 497x497, 44x44, 34x34, 288x288, 24x24, 2x2 and 67x67.. Experiment Dataset: We create our dataset for training and validation from three sources: () 737 face images (637 for training, for validation) collected from web which are manually labelled with five facial landmarks, (2) 99 images (7998 for training, 97 for validation) randomly selected from AFLW [9], and (3) 67 natural scene images (28 for training, 43 for validation) without person from INRIAPerson database [4] as negative examples. Totally, there are 33 samples for training and 26 samples for validation. For evaluation we use two challenging public datasets: AFW [26] and AFLW (figure 4 shows samples from these two dataset and our detection results). There are no overlap among training, validation and evaluation sets. The images in AFW and AFLW are collected in the wild environment which formulate a more challenging scenario than other datasets (e.g. XM2VTS [3]). Besides, these two datasets differs from others (e.g. LFPW []) in theirs annotation of multiple, non-frontal faces in a single image. AFW dataset consists of 2 images with 468 faces. The evaluation images of AFLW are the same as [24] which randomly selects 3 faces from AFLW and 39% of them are not-frontal. Evaluation Metric: The evaluation metric adopted is mean error which is measured by the distances between estimated landmarks and the ground truths, normalized with The labelled facial landmarks are left eye, right eye, nose, left lip corner and right lip corner. 6

fine coarse regression fine coarse regression LE NE LE NE.8.8.8.8.6.6.6.6.4.4.4.4.4.6.8.4.6.8.4.6.8.4.6.8 N LM N LM.8.8.8.8.6.6.6.6.4.4.4.4.4.6.8.4.6.8.4.6.8.4.6.8 RM A RM A.8.8.8.8.6.6.6.6.4.4.4.4.4.6.8.4.6.8.4.6.8.4.6.8 Figure 6. The average precision of landmarks on AFW. fine and coarse stand for network with and without branch network and regression stand for regression model trained by us. respect to inter-ocular distance. It can be formulated as (x x ) err = 2 + (y y ) 2 (8) l where (x, y) and (x, y ) are the ground truth and predicted locations respectively, and l is the inter-ocular distance. In our experiments, we evaluate the performance of five facial landmarks, i.e. LE (left eye), RE (right eye), N (nose), LM (left mouth corner) and RM(right mouth corner) and A(average mean error over five facial landmarks)... Comparison with the State-of-the-art Methods Our method can support both constrained images and unconstrained images. In this part, we compare our method with other published results on constrained image datasets, i.e. AFW and AFLW in which we can access the bounding boxes. In this case, the maximum response locations are taken as the landmarks. The compared methods include academic State-of-the-art methods 2 as well as commercial softwares, i.e. () Robust Cascaded Pose Regression (RCPR) [2]; (2) Tree Structured Part 2 The results of the other methods are from [24]. Figure 7. The average precision of landmarks on AFLW. Model (TSPM) [26]; (3) Luxand face SDK 3 ; (4) Explicit Shape Regression (ESR) [3]; () A Cascaded Deformable Shape Model (CDM) [22]; (6) Supervised Descent Method (SDM) [2]; (7) Tasks-Constrained Deep Convolutional Network (TCDCN) [24]. Our model consistently achieves better performance on these two datasets than others. On AFW, our average mean error is 7.6 percent over five parts, advancing the state-of-the-art TDCN performance by 7.3 percent relatively. On AFLW, we achieve 7. percent average mean error, 6.2 percent improvement over TDCN. Figure shows that our methods outperforms all the other methods on both AFW and AFLW..2. Performance Under Unconstrained Scenarios So far as we know, very few facial landmark localizers are studied in unconstrained contexts. We thus use the precision-recall curve[] to evaluate the performance of facial landmark detection under unconstrain scenarios. Similar to object detection, a detected landmark is taken as correct only if there exists a ground truth landmark of the same type within % of the inter-ocular distance. For fairness, we also implement the balinese method by a fully convolutional network which has almost the same ar- 3 Luxand incorporated: Luxand face sdk, http://www.luxand.com/ 7

mean error(%) relative improvement(%) 7 4 fine coarse 7 4 8 LE RE N LM RM A 8 LE RE N LM RM A Figure 8. Comparison of branches model and backbone network. The left one shows the average mean error of five parts and the middle reduced average error one shows the relative improvement. Relative Improvement =. The right figure shows localized examples average error of the method in comparison with the top row for backbone network and bottom row for branch network. Solid circles are ground truths and empty ones are predictions. chitecture as our backbone network except for the top layer. In our baseline deep model, the top layer is a full connection layer with each part corresponds to three outputs, i.e. two for location regression in the receptive field and one for existence classification. So given a threshold value, we can find all the regions that contains the target part and use the predicted locations as the final detected landmarks. Figure 6 and figure 7 show the PR curves of different parts. Our method clearly outperforms the regression based deep model..3. Benefit of Branch Networks Our method relies on two networks to progressively refine the landmark locations. In this part, we evaluate the effect of the branch network. This is achieved by conducting the same experiments under two settings, i.e. one setting with the branch network and one setting without the branch network. We show the results in Figure 8, from which we can see that branch network can effectively improve the performance of landmarks detection. With the branch network, the performance achieves about 3.77% relative improvement. 6. Conclusion In this work, we proposed a novel Backbone-Branches Fully-Convolutional Network (BB-FCN) that progressively produces prediction maps of facial landmarks in an end-toend way. Our extensive experiments suggested that BB- FCN achieves very promising results under both the traditional constrained benchmarks as well as the cluttered, realworld scenes. In the future, we will integrate our BB-FCN model with object recognition and detection systems, where the accurate part-based localization can be very helpful for improving performances. References [] P. N. Belhumeur, D. W. Jacobs, D. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Computer Vision and Pattern Recognition (CVPR), 2 IEEE Conference on, pages 4 2. IEEE, 2. 2, 6 [2] X. P. Burgos-Artizzu, P. Perona, and P. Dollár. Robust face landmark estimation under occlusion. In Computer Vision (ICCV), 23 IEEE International Conference on, pages 3 2. IEEE, 23. 7 [3] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression, Dec. 27 22. US Patent App. 3/728,84. 2, 7 [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, and C. Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages 886 893, INRIA Rhône- Alpes, ZIRST-6, av. de l Europe, Montbonnot-38334, June 2. 6 [] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):33 338, 2. 7 [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 24 IEEE Conference on, pages 8 87. IEEE, 24. 2 [7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arxiv preprint arxiv:2.82, 2. 2 [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/48.93, 24. 6 [9] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, realworld database for facial landmark localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2. 2, 6 8

[] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 97, 22. 2 [] L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search. In Computer Vision ECCV 28, pages 72 8. Springer, 28. 2 [2] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. arxiv preprint arxiv:4.438, 24. 2 [3] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: The extended m2vts database. In Second international conference on audio and video-based biometric person authentication, volume 964, pages 96 966. Citeseer, 999. 6 [4] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pages 988 996, 24. [] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Computer Vision and Pattern Recognition (CVPR), 23 IEEE Conference on, pages 3476 3483. IEEE, 23. 2, 3 [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arxiv preprint arxiv:49.4842, 24. 2 [7] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. arxiv preprint arxiv:4.428, 24. 3 [8] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems, pages 799 87, 24. 3 [9] M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Computer Vision and Pattern Recognition (CVPR), 2 IEEE Conference on, pages 2729 2736. IEEE, 2. 2 [2] X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In Computer Vision and Pattern Recognition (CVPR), 23 IEEE Conference on, pages 32 39. IEEE, 23. 7 [2] C.-Y. Yang, S. Liu, and M.-H. Yang. Structured face hallucination. In Computer Vision and Pattern Recognition (CVPR), 23 IEEE Conference on, pages 99 6. IEEE, 23. [22] X. Yu, J. Huang, S. Zhang, W. Yan, and D. N. Metaxas. Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model. In Computer Vision (ICCV), 23 IEEE International Conference on, pages 944 9. IEEE, 23. 7 [23] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine autoencoder networks (cfan) for real-time face alignment. In Computer Vision ECCV 24, pages 6. Springer, 24. 3 [24] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In Computer Vision ECCV 24, pages 94 8. Springer, 24. 3, 6, 7 [2] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Computer Vision Workshops (ICCVW), 23 IEEE International Conference on, pages 386 39. IEEE, 23. 3 [26] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 22 IEEE Conference on, pages 2879 2886. IEEE, 22. 2, 6, 7 9