BEING one of the most significant biometric techniques, face

Size: px

Start display at page:

Download "BEING one of the most significant biometric techniques, face"

Cecil Warner
5 years ago
Views:

1 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST DeMeshNet: Blind Face Inpainting for Deep MeshFace Verification Shu Zhang, Student Member, IEEE, Ran He, Senior Member, IEEE, Zhenan Sun, Member, IEEE, and Tieniu Tan, Fellow, IEEE Abstract MeshFace photos have been widely used in many Chinese business organizations to protect ID face photos from being misused. The occlusions incurred by random meshes severely degenerate the performance of face verification systems, which raises the MeshFace verification problem between MeshFace and daily photos. Previous methods cast this problem as a typical low-level vision problem, i.e., blind inpainting. They recover perceptually pleasing clear ID photos from MeshFaces by enforcing pixel level similarity between the recovered ID images and the ground-truth clear ID images and then perform face verification on them. Essentially, face verification is conducted on a compact feature space rather than the image pixel space. Therefore, this paper argues that pixel level similarity and feature level similarity jointly offer the key to improve the verification performance. Based on this insight, we offer a novel feature oriented blind face inpainting framework. Specifically, we implement this by establishing a novel DeMeshNet, which consists of three parts. The first part addresses blind inpainting of the MeshFaces by implicitly exploiting extra supervision from the occlusion position to enforce pixel level similarity. The second part explicitly enforces a feature level similarity in the compact feature space, which can explore informative supervision from the feature space to produce better inpainting results for verification. The last part copes with face alignment within the net via a customized spatial transformer module when extracting deep facial features. All the three parts are implemented within an end-to-end network that facilitates efficient optimization. Extensive experiments on two MeshFace datasets demonstrate the effectiveness of the proposed DeMeshNet as well as the insight of this paper. Index Terms MeshFace, face verification, blind inpainting, deep learning, DeMeshNet, spatial transformer. 1 INTRODUCTION BEING one of the most significant biometric techniques, face recognition has received continuous attentions from the computer vision community since as far back as in [1]. Compared to other biometric techniques, such as iris recognition [2] and fingerprint identification [3], face recognition techniques can work in a passive manner and therefore are more readily used in many real-world scenarios, which include access control systems, social networks, forensics and public security [4], [5], [6], [7]. However, the easy of use comes at a cost, it presents many challenges for the research community to come up with robust systems that can deal with large variations in pose, expression and illumination. Until recently, benefiting from recent advancements in deep representation learning, there have been remarkable improvements in face recognition (verification in particular) performances [8], [9], [10]. Many works have shown that deep learning based solutions have the ability to surpass human in the task of face verification under uncontrolled environments. Nevertheless, this does not signal the end of the face recognition research, one specific problem that Shu Zhang is with the Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition(NLPR), Institute of Automation, Chinese Academy of Sciences(CASIA), No.95 ZhongGuanCun East Street, HaiDian District Beijing, P.R.China, shu.zhang@nlpr.ia.ac.cn Ran He, Zhenan Sun and Tieniu Tan are with the Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA) and the Center for Excellence in Brain Science and Intelligence Technology (CEBSIT), Chinese Academy of Sciences (CAS). {rhe, znsun, tnt}@nlpr.ia.ac.cn. Ran He is the corresponding author. Manuscript received March 8, deals with Face Verification Between photos from Identification documents (ID, Citizen Card or Passport Card) and Daily photos (FVBID) [11] starts to gain traction and continues to present new challenges to researchers. FVBID uses face images from digitalized ID photos as gallery and uses photos captured in daily lives (with cell-phones, surveillance cameras, etc.) as probe. An individual s identity can be easily certified as long as he or she has an ID, so there s no need for registration, which is a typical routine for all biometric systems, in advance. Therefore, this technique soon become very popular in commercial banks, custom houses, etc. However, when FVBID is applied to real-world scenarios, such as automated custom control and Very Important People (VIP) recognition in commercial banks, the ID photo in an identity card may potentially be misused or illegally distributed. Therefore, ID photos are often deliberately corrupted by mesh-like lines or watermarks by the government for privacy protection when delivered to business organizations such as banks and hotels. For convenience, we denote this type of corrupted ID photo as MeshFace. As shown in Fig. 1, MeshFaces incur catastrophic influence to face recognition systems [6], [12]. Directly verifying MeshFaces against daily photos leads to very poor accuracy [13]. Therefore, these corruptions raise a novel and challenging problem called MeshFace verification which deals with face verification between MeshFaces and daily photos. Some efforts have been made to address this challenging problem. Zhang et al. [13] propose a multi-task residual learning Convolutional Neural Network (CNN) for this problem. They propose to learn a non-linear transformation with Super-Resolution Convolutional Neural Network (SRCNN) [14] based architecture to recover clear ID photos from MeshFaces. Then, the recovered clear ID photos are used for face verification. They treat the recovery of clear ID photos from MeshFaces as blind face

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 Fig. 1. MeshFaces (first row) refer to the ID photos corrupted by randomly generated mesh-like lines or watermarks.

Recovered ID with higher PSNR may be further away from the ground-truth clear ID in the compact deep feature space. For instance, the green and yellow sample. Best viewed in color.

In a related vein of research, many contemporary works [15], [16], [17], [18] have shown that CNN is very effective in solving hole filling (non-blind inpainting) problems [19] because this

2 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 1. MeshFaces (first row) refer to the ID photos corrupted by randomly generated mesh-like lines or watermarks. The corruptions significantly degenerate the performance of facial landmark detection and facial feature extraction, thus leading to poor verification accuracy. Fig. 2. Recovered ID with higher PSNR may be further away from the ground-truth clear ID in the compact deep feature space. For instance, the green and yellow sample. Best viewed in color. inpainting because the position of corruptions is unknown during testing phase. In a related vein of research, many contemporary works [15], [16], [17], [18] have shown that CNN is very effective in solving hole filling (non-blind inpainting) problems [19] because this data-driven learning method can exploit the structure of natural images to predict occluded parts. Improved verification performance is observed after blind face inpainting in [13] because using an occlusion free image will greatly improve the accuracy of face detection and alignment. However, in their work, the performance gap between using their inpainted ID and the clear ID is still very large. On one hand, this is because SRCNN is less powerful in modeling corruption distributions and recovering the exact image content. As a result, the difference in image content between the recovered ID photos and the ground-truth clear ID photos are too large (as illustrated by the red and the purple sample in Fig.2). On the other hand, existing works often assume that using recovered ID photos with higher PSNR are more likely to achieve better verification performance [13]. Therefore, they only enforce the pixel level similarity when learning a blind face inpainting network. But in fact, face similarity is compared in a compact feature space rather the image pixel space. Moreover, it has been shown that CNN can be potentially fooled by adding even a tiny amount of noise to the original input [20], [21], [22]. That is, when facial features are extracted by a CNN, two perceptually indistinguishable face images e.g. the ground-truth clear ID and its recovered version) may still have very large feature level differences (as shown by the green and yellow sample in Fig.2). It is a common belief that a large intra-class feature distance will generally deteriorate face verification performance. Therefore, this suggests that treating blind face inpainting as a typical low-level vision problem by only enforcing the pixel level similarity can hardly guarantee an improvement in verification performance. To address the aforementioned problems, we present DeMesh- Net to take verification performance into account when dealing with the blind face inpainting problem. DeMeshNet is trained on a large scale dataset of MeshFace/clear ID photo pairs to learn a non-linear transformation to recover clear ID photos from MeshFaces. Note that DeMeshNet aims to improve the MeshFace verification performance rather than simply to recover perceptually pleasing clear ID photos. Therefore, we refer to DeMeshNet as a feature oriented blind face inpainting framework. The proposed framework is an end-to-end network that facilitates efficient optimization with gradient back propagation. For clarity, we divide the explanation the whole framework into three parts, and we will briefly introduce each part of the DeMeshNet and our contributions in the following paragraphs. In the first part, we enforce the pixel level similarity to explore the structure of face images so as to recover uncorrupted ID photos from MeshFaces. Specifically, we propose to adopt a fully convolutional network (FCN) [23] with a weighted Euclidean loss to minimize the pixel differences of the ground-truth clear ID/recovered ID photo pairs. Extra supervision from corruption positions is further exploited to accurately model the corruption distribution. In the second part, we enforce a feature level similarity between the ground-truth clear ID photo and the recovered ID photo pairs in a compact deep feature space. This feature space is spanned by a pre-trained CNN, and distance in it directly corresponds to a measure of face similarity. This part will force the inpainted image to have a smaller distance to the ground-truth clear ID in the deep feature space, which will in turn facilitate accurate verification. Moreover, we propose to measure the feature level similarity with the reverse Huber loss function so that the feature level similarity can be more efficiently optimized when the differences are very small.

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 Fig. 3. Conceptual diagram of DeMeshNet. The upper part denotes the training phase of the proposed end-to-end DeMeshNet.

3 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 3. Conceptual diagram of DeMeshNet. The upper part denotes the training phase of the proposed end-to-end DeMeshNet. Note that the corruption mask and clear ID are only used in the training phase to provide ground-truth for the pixel and feature level loss. The pre-trained feature extraction model has fixed parameters during training and is used for providing feature level supervision during the end-to-end training process. The lower part illustrates the MeshFace verification process. Trained DeMeshNet are firstly used for restoring clean ID photos from the MeshFaces, and then the restored (inpainted) clean ID photos are compared against daily photos for face verification. In the third part, we employ a customized spatial transformer module [24] to align and crop the face region for accurate feature extraction within the network. It is essential to take alignment into account because the MeshFace which is the input of DeMeshNet and the aligned face which is the input of the feature extraction sub-net are different in sizes, scales and orientations (as shown in Fig. 3). By integrating these three parts into one end-to-end training network, the proposed DeMeshNet has the ability to maintain both pixel and feature level similarity while learning to inpaint the MeshFace. Extensive experimental results on two MeshFace datasets demonstrate that DeMeshNet achieves the best verification accuracy and outperforms previous work by a large margin. Overall, as demonstrated by the experimental results, the proposed deep architecture can achieve superior recognition accuracy to previous work at the cost of some visual effects. Furthermore, we thoroughly evaluate different configurations of DeMeshNet to gain insight into the factors for such significant improvements. 2 RELATED WORK MeshFace inpainting is a rarely studied problem compared to other related subjects, such as hole filling [19], rain removal [25] and de-fencing [26]. All these tasks fall into the category of inpainting research, and they share a common goal, i.e., to recover a semantically plausible and visually pleasing image from a image with unknown corruption patterns. Specifically, MeshFace inpainting deals with the problem of improving MeshFace verification performance, where increasingly more attention has been attached to with the rapid development of FVBID technologies. Recently, Zhang et al. [13] proposed a multi-task ConvNet to inpaint the MeshFaces. In the following paragraphs, we will briefly go through some recent advances in the related studies of hole filling and rain removal. Hole Filling (or image completion) is a fundamental research topic in the field of computer vision. In the early years of study, researchers resort to prior knowledge like neighborhood resemblance to fill in missing areas with textures around the hole [27], [28]. These methods focus on a local propagation scheme to acquire smooth blending around the holes but fails to take the global structure into consideration, therefore, they can not guarantee a semantically consistent inpainting result. To compensate for the semantic loss, researchers introduce datadriven learning methods into this field, where they exploit a large reference database to find similar image patches to fill the missing holes [29]. Recently, deep learning based methods also show great potential in solving this ill-posed low-level vision problem with their power to model both the local and global structure of images. In particular, Phatak et al. [16] proposed the Context Encoders with combined L2 norm and adversarial loss [30] for semantic hole filling. Raymond et al. [31] further employ a pre-trained Deep

4 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Convolutional Generative Adversarial Network (DCGAN) [32] to restore images with random or central missing regions. Rain Removal is a typical application of the blind inpainting problem. Different from traditional inpainting problem, where the position of corrupted regions are provided in advance, blind inpainting deals with images whose position of corruption regions is not given. A common idea for solving this specific problem is to exploit the rich temporal information from video sequences to identify the corrupted regions firstly and then employ inpainting algorithm for further restoration [33]. While other researchers explore the possibility of single image rain removal with the help of image priors [34], [35]. Kang et al. [36] proposed a dictionary learning based approach for layer decomposition and rain removal. More recently, Eigen et al. [37] proposed a deep learning based method, where data-driven convolutional auto-encoder model is trained to directly recover clean patches from corrupted ones. The proposed method demonstrates effective removal of rain in outdoor test conditions. 3 APPROACH 3.1 Overview In this section, we present an overview of the proposed DeMesh- Net. We cast the proposed feature oriented blind face inpainting problem as a dense regression problem, which aims to regress a perceptually pleasing and verification favorable clear ID photo from a MeshFace X. For convenience, we refer to the groundtruth clear ID photo as the target, termed as Y and the recovered ID photo from our blind face inpainting model ψ as the prediction, termed as ψ(x). The prediction will be used for face verification against clear daily photos. We model the highly non-linear function for dense regression as a FCN as illustrated in part A of Fig. 3. Part B shows the customized spatial transformer module. The following part is a pre-trained CNN which is utilized to compute the feature representation of the aligned face region. It should be noted that parameters in both the spatial transformer module and the pretrained CNN are fixed during training. The learnable parameters in the FCN are optimized through minimizing a unified loss function that jointly models the pixel and feature level similarities between the prediction and the target pairs. No identity information is needed to train such a blind inpainting network. The pixel level loss helps to obtain perceptually pleasing inpainting results and serves as a means to capture the distribution difference between actual face texture and mesh-like corruptions. And the feature level loss explores supervision in a compact feature space to provide regularizations to the network training. Thus, the network s prediction will not only have similar appearance but also have similar feature representation to that of the target. Specifically, we develop a weighted Euclidean loss to model the pixel level similarity and employ a reverse Huber function [38] to characterize the feature level loss on the spatial transformed face region. Combining the pixel level loss and the spatial transformed feature level loss, we define the unified loss function for DeMeshNet as follows: L i = l pixel + l feature = ψ(x i ) Y i 2 F + λ 1 M i (ψ(x i ) Y i ) 2 F + 2 RH(φ j (ψ(st (X i ))) φ j (ST (Y i )) ) λ 2 j=1 (1) where ST denotes the spatial transformation implementation that samples an aligned face region from the original input solely based on the positions of facial landmarks, RH is the reverse Huber function, λ 1 and λ 2 are the balance parameters. In general, combining more than one loss ususlly is no trivial task. However, in our experiments, we have observed that our network is robust to the exact loss weights on feature and pixel level losses, as long as it can simultaneously decrease the pixel and feature level losses. Therefore, in our final experiment, we have empirically set those two parameters to 1. We postpone explanations of other symbols to later sections when we meet them. For simplicity, we omit the regularization term on the parameters of FCN (weight decay) in Equation 1, which is used to reduce overfitting when optimizing our network. The objective function can be efficiently optimized by gradient back propagation in an end-to-end manner. We will elaborate each of the three parts in the following subsections. 3.2 Pixel Level Regression Network Pixel Level Loss Blind face inpainting is naturally characterized as a pixel-wise regression problem. In the first part of DeMeshNet, it learns a highly non-linear transformation by optimizing a well-designed pixel level loss. For blind face inpainting, although positions of the corruption are not provided during the testing phase, we can still make use of this information when training DeMeshNet. Specifically, we propose to implicitly exploit this extra supervision by introducing a weighted Euclidean loss function as below: l pixel = ψ(x i ) Y i 2 F + λ M i (ψ(x i ) Y i ) 2 F (2) where X i, Y i and ψ(x i ) are a corrupted input, target and prediction, respectively. M i is the binary mask with a value of 1 indicating the pixel is corrupted and a value of 0 otherwise. is the element-wise product operation. Therefore, the second term in the loss function only measures the Euclidean loss on corrupted areas, emphasizing losses on those areas with a weighting parameter λ. This loss function helps the network to learn the distributions of corrupted pixels better by exploiting extra supervision from the corruption positions. Experimental results demonstrate that it also helps to generate predictions with higher PSNR Network Architecture Since FCN has achieved outstanding performance in dense prediction tasks like depth prediction [39] and semantic segmentation [23], it motivates us to use FCN as the non-linear function to improve the blind face inpainting performance. The main difference between FCN and the architecture in [13] is the introduction of down-sampling and up-sampling layers in FCN. This simple adjustment has enabled FCN to admit much deeper layers and to expand the receptive fields with the same amount of computational cost. The expanded receptive fields are critical to blind inpainting as it can enclose more contextual information for identifying the corrupted areas. In this work, we use the network architecture of SegNet as proposed in [40]. Specifically, a VGG-16 Net [41] is used in the front end of our network, unpooling and convolutional layers with Rectified Linear Units (ReLU) nonlinearities are symmetrically concatenated to the bottleneck layer (output of conv53) to progressively upsample the feature maps to its original size. ReLU activation is used after the last convolution layer to ensure

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 Fig. 4. Schematic architecture for face feature extraction. It uses Max-Feature-Map nonlinearities instead of ReLU.

5 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 4. Schematic architecture for face feature extraction. It uses Max-Feature-Map nonlinearities instead of ReLU. the predicted pixel values are all positive.it is feasible to adopt other architectures such as ResidualNet [42] and Deconvolution Network [43], but this is beyond the scope of this paper. The input and output to the network are MeshFaces and clear ID photos respectively.gray scale images of size are used for input and output throughout the paper. 3.3 Feature Level Regression Network Feature Level Loss As aforementioned in Section 1, images with very small Euclidean distance may have large feature distance. This problem is raised in [22] and has been shown to severely deteriorate classification performance because of the enlarged intra-class feature distance. In fact, the enlarged intra-class feature distance can also influence the verification task at hand. To improve the verification performance, it won t be enough to only enforce pixel level similarity. Therefore, in the second part, we explicitly enforce the target and the prediction to have a small distance in the compact feature space computed by the pre-trained CNN φ. Let φ j (ψ(x i )) be the activations of the jth layer of the pre-trained face model φ. We intend to improve the verification performance by minimizing the residual r = φ(ψ(x i )) φ(y i ) at each position of an image. To efficiently back-propagate the errors when the residual is very small, we employ the reverse Huber loss function [38] to measure the feature level difference, its formulation is: { r r > c RH(r) = r c r 2 +c 2 2c The reverse Huber loss is equivalent to L1 norm when the residual r is in the interval of [ c, c] and equals to a transformed L2 norm otherwise. This loss experimentally works better than L2 norm. Note that when c is smaller than 1, the derivative of the L1 norm is greater than that of the L2 norm, which will speed up error back-propagation when the residual is very tiny. Like in [38], we use a dynamic threshold for c. That is, in each batch-minimization step, c is set to be at 20% of maximum residual in that batch. Since our objective is to improve verification performance, we impose the reverse Huber loss on the output of eltwise 6 in the pre-trained model φ, which is a 256-dim feature vector used for similarity comparison. Drawing inspirations from [44], where feature loss is used for super-resolution, we also impose the reverse Huber loss on the early layers (conv2 in our case) of (3) the pre-trained face model φ to include deeper supervision [45]. Therefore, our final loss function for feature level regression is: 2 l feature = RH(φ j (ψ(x i )) φ j (Y i ) ) (4) j=1 Feature loss has recently been considered in the literature of super-resolution and sketch inversion for better visual performance [44], [46], [47]. But it should be noted that in this paper, the feature level loss is proposed from a totally different perspective. Instead of pursing a perceptually pleasing image transformation results, we want to deal with the large intra-class feature distance problem and improve verification performance [48] Pre-trained CNN for Face Feature Extraction Both training DeMeshNet and evaluate the verification performance need a pre-trained CNN to compute the facial features for face images. We use the architecture proposed in [49] (code name CRIP AC 9, as it has nine layers) for facial feature extraction because it is computationally efficient in both time and space. Fig. 4 briefly illustrates the architecture of the CRIP AC 9 model, which uses Max-Feature-Map nonlinearities instead of ReLU and thus can generate dense features at its output. The network takes aligned gray-scale face images of size as input and returns a 256-dim feature (output of eltwise 6). Alignment is conducted by transforming two facial landmarks (i.e., centers of two eyes) to (32, 32) and (96, 32) with a similarity transformation. We train a base model on the purified MS-Celeb-1M dataset [50] (the original data is very noisy, we purify them before training with it). One single base model achieves a verification accuracy of 98.80% on the LFW [51] benchmark, which is very competitive among already published works [52], [53]. We further finetune the base model with triple-let loss on a large-scale IDdaily photo dataset collected from the web and use the finetuned model φ to extract compact facial features in DeMeshNet. Note that this model s parameters should stay fixed during the training of DeMeshNet because we only use it for feature computation and no identity related supervision is adopted to finetune φ while learning the blind inpainting FCN. The obtained feature extraction model is also used for the subsequent face verification after the inpainting process. 3.4 Spatial Transformer For Face Alignment Face alignment is essential for feature extraction. It helps to improve verification performance by providing a normalized input. The pre-trained model φ admits aligned face region to

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 Fig. 5. The customized spatial transformer module used in our model.

But DeMeshNet takes 220 178 un-aligned MeshFace as input and outputs a prediction with the same size.

Moreover, unlike the image classification models that are trained with multi-scale natural images from ImageNet [54], the pre-trained facial feature extraction model φ only takes singlescale,

6 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 5. The customized spatial transformer module used in our model. It uses the locations of facial landmarks to calculate a transformation matrix for face alignment. Fig. 6. An image triplet sample from the training dataset [13]. compute the 256-dim feature. But DeMeshNet takes un-aligned MeshFace as input and outputs a prediction with the same size. Therefore, in order to compute the 256-dim feature in the fully connected layer eltwise 6, we must implement face alignment within the network. Moreover, unlike the image classification models that are trained with multi-scale natural images from ImageNet [54], the pre-trained facial feature extraction model φ only takes singlescale, well-aligned face regions for training. This means that even we only compute the features from conv layers like in [44], [46], [47], we will still need to align the MeshFaces first to acquire an accurate feature representation. We incorporate a customized spatial transformer module [24] between the pixel level regression sub-net and the feature level regression sub-net to sample an aligned face region from the prediction according to the facial landmarks. This procedure is illustrated in Fig. 3 and detailed in Fig. 5. The spatial transformer module comprises of a localization network, a grid generator and a sampler. Since the similarity transformation τ θ for face alignment is uniquely determined by the coordinates of two eye centers, we do not need to learn it through the localization network as in [24]. τ θ is parameterized by θ = [a, b, 1; b, a, 1] and can be determined with the following equation: x l y l x r y r 1 1 = [ a b 1 b a 1 ] (5) where (x l, y l ), (x r, y l ) and ( 0.5, 0.5), (0.5, 0.5) are the normalized coordinates (normalized to [ 1, 1]) of two eye centers in the original image and the aligned face image respectively. Given the similarity transformation matrix τ θ, we can define the forward and backward pass of the spatial transformer module, which employ a bilinear sampling kernel to sample the original image to its aligned version and back propagate the feature loss from the aligned image to the original input respectively. In the forward pass, a sampling grid is firstly determined with the given τ θ. A sampling grid is a set of points with continuous coordinates. Sampling an input image according to this sampling grid will generate a transformed output. By defining the output pixels to lie on a regular grid G = G i of pixels G i = (x t i, yt i ), the sampling grid τ θ(g) is given by the point-wise transformation: ( x s i y s i ) = τ θ (G) = [ a b 1 b a 1 ] ( x t i y t i where (x s i, ys i ) are the source coordinates in the input feature map, and (x t i, yt i ) are the target coordinates of the regular grid. Since ) (6) Fig. 7. Sample image pairs from SV1000. Note that this dataset is very hard as the variations in illumination, pose and hair style are very significant. the coordinates in the sampling grid are continuous numbers, a bilinear kernel is applied to those positions to produce the corresponding pixel values in the output: Q = max(0, 1 x s i m )max(0, 1 y s i n ) (7) P (x t i,y t i ) = H n W P (x s n,ym s ) Q (8) where H and W are the height and width of the input image respectively and P (x t i,yi t) represents the pixel value of (x t i, yt i ). To allow the feature loss defined in the last subsection to be backpropagated from the output of the spatial transformer module to the input image, we give the gradients with respect to the input image as follows: (P (x t i, yt i )) H (P (x s m, yn)) s = W Q (9) The gradients of the feature loss defined earlier can be easily flowed back to the input image using chain rule with Equation 9. Note that the gradients with respect to the sampling grid coordinates (x s i, ys i ) are not derived, because the transformation parameters are not learned in the customized spatial transformer module. 4 EXPERIMENTS In this section, we experimentally evaluate the proposed framework. We begin by introducing the datasets for training and testing. Then we specify the baseline methods and implementation details. At last, we present detailed algorithmic evaluation, as well as comparison with other methods. 4.1 Datasets All the compared models are trained on the dataset as in [13] that contains over 500, 000 data triplets of 11, 648 individuals. It should be noted that to increase the diversity of the patterns, the random corruption is simulated with sine and cosine waves of different magnitudes and phases. Random affine transformations are also introduced to the corruption patterns. Furthermore, the percentage of corruption is controlled by manipulating the line width of the generated wavy lines. Each data triplet consists of a m n m

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 Fig. 8. Sample image pairs from SYN500. Images of some Chinese celebrities are collected from the Internet.

7 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 8. Sample image pairs from SYN500. Images of some Chinese celebrities are collected from the Internet. MeshFace, its clear version and a corruption mask (as illustrated in Fig. 6). Facial landmarks (two eye centers) are detected using Intraface [55] to aid the spatial transformer module. Data triplets of 400 individuals are sampled for validation and testing (200 each) and all the other individuals are used for training. Besides the SYN500 used in [13], we collect another dataset with 1000 MeshFace/daily photo pairs from 1000 individuals named SV1000 to evaluate the MeshFace verification performance. Daily photos in SV1000 are captured under surveillance cameras. As shown in Fig. 7, this dataset not only contains more individuals but also presents more variations in the daily photo, which makes it more challenging than SYN500 (shown in Fig. 8). We develop a protocol for evaluation of verification performance on these two datasets. Specifically, face comparison is conducted between all the possible recovered clear ID/daily photo pairs in the compact feature space (spanned by model φ) with cosine distance. For a dataset with N data pairs, N 2 comparisons are conducted in total. To exclude influences from metric learning methods, no supervised learning methods, e.g.joint bayesian [56], are employed on the extracted features. 4.2 Baselines and Implementation Details Although many algorithms have been proposed for non-blind inpainting, few have been developed to address the blind inpainting problem, even less for the blind face inpainting problem addressed in this paper. We implement the multi-task CNN (MtNet) [13] as a baseline. This method employs architecture that resembles the SRCNN [14] and use multi-task learning to make use of the information of corruption position in the training phase. To give a detailed evaluation of each part of the proposed DeMeshNet, we also compare it with various configurations. The compared configurations include FCN with Euclidean pixel level loss (FCNE), FCN with weighted pixel level loss (FCNW) and FCN with only feature level loss (FCNF, without spatial transformer module). All three configurations use FCN as the backbone for the blind face inpainting task. Both FCNE and FCNW only adopts the pixel level loss during training, but FCNW introduces an implicit supervision from the corruption mask with a weighted loss in addition to the Euclidean loss. FCNF takes both weighted pixel loss and whole-image feature level loss into consideration. But the feature level differences are only computed at the output of conv2, using whole-image as input to the pretrained feature extraction network φ. Like in [44], we don t implement face alignment within the network for FCNF. To further reveal the importance of the pixel level loss, we also implement an alternative DeMeshNet structure without the pixel level loss and named it as DeMeshNet F. All the compared models are trained on the training set with photo pairs of size , gray scale images that are normalized to 0 to 1 are used in all the experiments. For all Fig. 9. ROC curves for SYN500 and SV1000. Note that verification performance on the challenging SV1000 dataset is significantly worse than on SYN500. the compared network structure, [57] is used to initialize the weights, and training is carried out using Adam [58] with a batch size of 30. The learning rate is set to 10 4 initially, and decreased by a factor of 10 each 40k iterations. The training process takes approximately 160k iterations to converge. For the proposed approach, feature level loss is computed at layer conv2 and eltwise 6 of the pre-trained face model φ. For FCNF, only conv2 is used for computing the feature loss as the computation of fc layer eltwise 6 requires the input to be of size All the MeshFace verification experiments use the pre-trained face model φ for facial feature extraction. All the experiments are conducted with the Caffe framework [59] on a single GTX Titan X GPU. Code for implementing the DeMeshNet is available at cripac/demeshnet. 4.3 Evaluation of Verification Results In this section, we conduct MeshFace verification experiments on two datasets, i.e., SYN500 and SV1000. Recovered ID photos are used for FVBID according to the aforementioned protocol. We report ROC curves in Fig. 9. FNMR@FMR=1% (false nonmatch rate when false match rate is %1), FNMR@FMR=0.1% and FNMR@FMR=0.01% are reported in Table 1 for closer inspection. We also present the face verification performances with ground-truth clear ID photos and MeshFaces (denoted as Clear and Corrupted) for fair comparison. As expected, when using the MeshFaces for verification, the accuracy suffers a severe drop on both datasets due to large detection and alignment errors. After processing the MeshFace with blind face inpainting models, face verification performance with the recovered ID photos has seen a great improvement. Owning to the deeper FCN architecture and expanded receptive fields, all the FCN based models outperform the baseline model MtNet [13] by a large margin on both datasets. As shown in Fig. 9, feature loss based models (DeMeshNet, FCNF, DeMeshNet F) perform better than the models that only seek a visually pleasing inpainting results (FCNE, FCNW). This suggests that by enforcing a feature level similarity during training, predictions from DeMeshNet lie in a low-dimensional feature space that is closer to the ground-truth clear ID photos than predictions from pixel-level only networks. This is validated in the next section where we calculate the RMSE (rooted mean square error) between the features of ground-truth clear IDs and recovered

8 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST TABLE 1 Verification performance on SYN500/SV1000 and inpainting results on the testing set. RMSE illustrates the feature distance between the recovered ID photos and the ground-truth clear ID photos, while PSNR indicates the pixel distance. Note that smaller RMSE consistently indicates better verification performance, but higher PSNR doesn t guarantee that. Method FNMR@FMR=1% FNMR@FMR=0.1% FNMR@FMR=0.01% PSNR RMSE MtNet 16.40% / 21.20% 37.20% / 42.50% 63.20% / 64.60% Clear 1.20% / 11.90% 10.60% / 25.70% 32.60% / 46.40% - 0 Corrupted 52.60% / 56.80% 66.20% / 71.50% 81.60% / 81.80% FCNE 4.80% / 16.80% 20.20% / 36.10% 46.20% / 57.00% FCNW 4.60% / 14.30% 21.60% / 35.10% 47.20% / 55.60% FCNF 4.80% / 14.20% 17.20% / 33.60% 45.60% / 53.70% DeMeshNet F 3.20% / 13.80% 17.80% / 30.60% 45.20% / 53.70% DeMeshNet E 3.40% / 13.70% 16.10% / 30.00% 45.20% / 53.50% DeMeshNet 2.60% / 13.30% 15.20% / 29.30% 44.80% / 53.00% IDs. However, only enforcing the feature level similarity will lead to inferior verification results as shown by the comparison between DeMeshNet and DeMeshNet F. This is possibly due to increased artifacts introduced by feature level loss as it can not maintain image details without enforcing the pixel level similarity. We further investigate the role of the spatial transformer module in our model by comparing DeMeshNet with FCNF which uses the image of original size ( ) as input to the feature loss component. We find that DeMeshNet consistently performs better than FCNF. This is because FCNF takes the whole image, which is different from the aligned face region in both scale and orientation, for feature extraction. Unlike the ImageNet [54] models that are trained with multi-scale natural images, the pretrained face model φ only takes single-scale, well-aligned face regions for training. The scale and orientation differences have led to the performance degeneration. Therefore, it is significant to take face alignment into account when optimizing the feature level loss as done in DeMeshNet. It should be noted that for blind inpainting models, there is an upper limit for their verification performance, which is the verification performance with ground-truth clear ID photos. From Table 1, we can observe that the gap between DeMeshNet and clear ID at FNMR@FMR=1% is very small (1.4% for both datasets, although the overall performance is significantly worse on SV1000 than on SYN500.), validating the outstanding performance of DeMeshNet. 4.4 Evaluation of Inpainting Results In this section, we qualitatively and quantitatively evaluate the inpainting results on the testing set (200 individuals, photos). Firstly, we qualitatively evaluate the compared models by visual inspection of the inpainting results in Fig. 10. It is observed that MtNet fails to identify and recover some portions of the corruptions in these cases (cherry-picked to illustrate the point). In contrast, FCN based models can handle all the corruption areas very well because they can enclose more contextual information with expanded receptive fields. This demonstrates the improved capacity of FCN over SRCNN based architectures. Regarding the details of the recovered ID photo, the models trained with only pixel level loss (FCNE, FCNW, MtNet) can better preserve the consistency of pixels and thus provide a smooth and clear photo which is more similar to the ground-truth. But the images recovered with models trained on feature level loss (FCNF in particular) contain many artifacts, making them visually less appealing. This is because the high-level features are robust to pixel level changes in the texture, shape and even color. However, images recovered from DeMeshNet looks much better than the ones from FCNF due to the introduction of the spatial transformer module within DeMeshNet. Next, we quantitatively evaluate the models by measuring both pixel level and feature level (eltwise 6) difference. The groundtruth clear ID photo is used as a baseline. Specifically, average PSNR and Euclidean distance between features (or rooted mean square error, RMSE) are shown in Table 1. Pixel level loss based models yield better PSNR as they explicitly optimize PSNR in their loss function. We also observe that FCNW performs slightly better than FCNF. This implies that exploiting extra supervision from the corruption position in the training phase is beneficial for acquiring visually pleasing inpainting results. To validate the choice of the reverse Huber loss over the Euclidean loss on the feature level difference, we also implement a DeMeshNet E that use Euclidean loss for the feature level loss. From Table 1, we can see that RMSE for DeMeshNet E is slightly larger than DeMeshNet. This indicates that the reverse Huber loss is more effective at minimizing the feature level distance thanks to the imposed L1 norm when residuals are small. Furthermore, from Table 1 we observe that smaller RMSE often means better verification performance, but higher PSNR doesn t guarantee smaller RMSE. This reveals that the models trained with only pixel level loss does suffer from the influence of the easily fooled nature of CNN [20], [21], [22]. Moreover, visually appealing inpainting results does not necessarily produce better verification results. By exploring supervision from the deep feature space, DeMeshNet can capture a distribution that is more robust to transformation in φ and thus provide a stable high-level representation in the compact deep feature space. 5 CONCLUSIONS This paper addresses the MeshFace verification problem that verifies corrupted ID photos against clear daily photos. Specifically, we have proposed DeMeshNet that consists of three parts to blindly inpaint the MeshFace before conducting verification.the proposed DeMeshNet distinguishes itself from previous works by explicitly taking verification performance into consideration while recovering a clear ID photo. The training objective of DeMeshNet is motivated by the fact that minimizing pixel level differences alone cannot guarantee a small intra-class feature distance in the compact deep feature space, which is crucial for accurate face verification. By further incorporating a spatial transformer module, DeMeshNet can implement face alignment within the network, resulting in an end-to-end network. For optimizing DeMeshNet, a very well-performed facial feature extraction network has been

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 Fig. 10. Visual inspection of the inpainting results.

9 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 10. Visual inspection of the inpainting results. Although these inpainting results all look very well and are quite similar, they will lead to entirely different verification rates because of different RMSE in the feature space. trained in advance. Experimental results on two MeshFace datasets demonstrate that the proposed DeMeshNet outperforms previous work on verification performance. Although we have achieved good performance on synthetic datasets, our current framework still has trouble generalizing its power to unseen real-world MeshFaces. It would be interesting to try to come up with new methods to deal with real-world MeshFaces in our future work. ACKNOWLEDGEMENT This work is funded by the National Natural Science Foundation of China (Grant No , ), the Beijing Municipal Science and Technology Commission (No.Z ) and the State Key Development Program (Grant No. 2016YF- B ). REFERENCES [1] M. A. Turk and A. P. Pentland, Face recognition using eigenfaces, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1991, pp [2] Z. Sun and T. Tan, Ordinal measures for iris recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp , Dec [3] A. K. Jain, A. Ross, and S. Pankanti, Biometrics: a tool for information security, IEEE transactions on information forensics and security, vol. 1, no. 2, pp , [4] C. Ding and D. Tao, Pose-invariant face recognition with homographybased normalization, Pattern Recognition, [5] Y. Sun, K. Nasrollahi, Z. Sun, and T. Tan, Complementary cohort strategy for multimodal face pair matching, IEEE Transactions on Information Forensics and Security, vol. 11, no. 5, pp , May [6] R. He, W. S. Zheng, and B. G. Hu, Maximum correntropy criterion for robust face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp , Aug [7] D. F. Smith, A. Wiliem, and B. C. Lovell, Face recognition on consumer devices: Reflections on replay attacks, IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp , April [8] F. Schroff, D. Kalenichenko, and J. Philbin, Facenet: A unified embedding for face recognition and clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp [9] Y. Sun, X. Wang, and X. Tang, Deep learning face representation from predicting 10,000 classes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp [10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp [11] E. Zhou, Z. Cao, and Q. Yin, Naive-deep face recognition: Touching the limit of lfw benchmark or not? arxiv preprint arxiv: , [12] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, Robust face landmark estimation under occlusion, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp [13] S. Zhang, R. He, Z. Sun, and T. Tan, Multi-task convnet for blind face inpainting with application to face verification, in Proceddings of the IEEE International Conference on Biometrics, June 2016, pp [14] C. Dong, C. C. Loy, K. He, and X. Tang, Learning a deep convolutional network for image super-resolution, in Proceedings of European Conference on Computer Vision, [15] X.-J. Mao, C. Shen, and Y.-B. Yang, Image restoration using convolutional auto-encoders with symmetric skip connections, arxiv preprint arxiv: , [16] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, Context encoders: Feature learning by inpainting, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June [17] J. S. Ren, L. Xu, Q. Yan, and W. Sun, Shepard convolutional neural networks, in Advances in Neural Information Processing Systems, 2015, pp

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 [18] J. Xie, L. Xu, and E.

10 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST [18] J. Xie, L. Xu, and E. Chen, Image denoising and inpainting with deep neural networks, in Advances in Neural Information Processing Systems, 2012, pp [19] A. Criminisi, P. Pérez, and K. Toyama, Region filling and object removal by exemplar-based image inpainting, IEEE Transactions on image processing, vol. 13, no. 9, pp , [20] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, arxiv preprint arxiv: , [21] A. Nguyen, J. Yosinski, and J. Clune, Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2015, pp [22] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, arxiv preprint arxiv: , [23] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp [24] M. Jaderberg, K. Simonyan, A. Zisserman et al., Spatial transformer networks, in Advances in Neural Information Processing Systems, 2015, pp [25] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown, Rain streak removal using layer priors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June [26] R. Yi, J. Wang, and P. Tan, Automatic fence segmentation in videos of dynamic scenes, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June [27] V. Kwatra, I. Essa, A. Bobick, and N. Kwatra, Texture optimization for example-based synthesis, ACM Transactions on Graphics (ToG), vol. 24, no. 3, pp , [28] A. A. Efros and T. K. Leung, Texture synthesis by non-parametric sampling, in Proceedings of the IEEE International Conference on Computer Vision, vol. 2. IEEE, 1999, pp [29] J. Hays and A. A. Efros, Scene completion using millions of photographs, in ACM Transactions on Graphics (TOG), vol. 26, no. 3. ACM, 2007, p. 4. [30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems, 2014, pp [31] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do, Semantic image inpainting with perceptual and contextual losses, arxiv preprint arxiv: , [32] A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, CoR- R, vol. abs/ , [33] K. GARG and S. K. NAVAR, Detection and removal of rain from videos, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [34] S. Roth and M. J. Black, Fields of experts: A framework for learning image priors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE, 2005, pp [35] D. Zoran and Y. Weiss, From learning models of natural image patches to whole image restoration, in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2011, pp [36] L.-W. Kang, C.-W. Lin, and Y.-H. Fu, Automatic single-image-based rain streaks removal via image decomposition, IEEE Transactions on Image Processing, vol. 21, no. 4, pp , [37] D. Eigen, D. Krishnan, and R. Fergus, Restoring an image taken through a window covered with dirt or rain, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp [38] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, Deeper depth prediction with fully convolutional residual networks, arxiv preprint arxiv: , [39] D. Eigen, C. Puhrsch, and R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in Advances in neural information processing systems, 2014, pp [40] V. Badrinarayanan, A. Handa, and R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling, arxiv preprint arxiv: , [41] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , [42] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June [43] H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision, December [44] J. Johnson, A. Alahi, and L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, arxiv preprint arxiv: , [45] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, Deeply-supervised nets, arxiv preprint arxiv: , [46] Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. van Gerven, Convolutional sketch inversion, arxiv preprint arxiv: , [47] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, Photo-realistic single image superresolution using a generative adversarial network, CoRR, vol. abs/ , [48] R. Huang, S. Zhang, T. Li, and R. He, Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis, arxiv preprint arxiv: , [49] X. Wu, R. He, and Z. Sun, A lightened cnn for deep face representation, arxiv preprint arxiv: , [50] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, Ms-celeb-1m: A dataset and benchmark for large-scale face recognition, arxiv preprint arxiv: , [51] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, University of Massachusetts, Amherst, Tech. Rep , October [52] C. Ding and D. Tao, Robust face recognition via multimodal deep face representation, IEEE Transactions on Multimedia, vol. 17, no. 11, pp , [53] D. Wang, C. Otto, and A. K. Jain, Face search at scale, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1 1, [54] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol. 115, no. 3, pp , [55] X. Xiong and F. De la Torre, Supervised descent method and its applications to face alignment, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [56] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, Bayesian face revisited: A joint formulation, in Proceedings of European Conference on Computer Vision. Springer, 2012, pp [57] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp [58] K. Diederik and J. Ba, Adam: A method for stochastic optimization, arxiv preprint arxiv: , [59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, arxiv preprint arxiv: , Shu Zhang received the B.E. degree in Automation from Northwestern Polytechnical University in July He is currently pursuing a Ph.D. degree in Computer Application Technology in Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese A- cademy of Sciences, China. His research interests focus on deep learning, pattern recognition, and computer vision.

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 Ran He received the B.E. and M.S. degrees in computer science from the Dalian University of Technology in 2001 and 2004, respectively, and the Ph.

Since 2010, he has been with NLPR, where he is currently a Project Professor. His research interests focus on information theoretic learning, pattern recognition, and computer vision.

S. degree in system engineering from the Huazhong University of Science and Technology, China, in 2002, and the Ph.D.

11 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Ran He received the B.E. and M.S. degrees in computer science from the Dalian University of Technology in 2001 and 2004, respectively, and the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, in Since 2010, he has been with NLPR, where he is currently a Project Professor. His research interests focus on information theoretic learning, pattern recognition, and computer vision. He currently serves as an Associate Editor of Neurocomputing (Elsevier) and also serves on the program committee of several conferences. Zhenan Sun received the B.E. degree in industrial automation from the Dalian University of Technology, China, in 1999, the M.S. degree in system engineering from the Huazhong University of Science and Technology, China, in 2002, and the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences (CASIA), China, in Since 2006, he has been with NLPR, CASIA, as a Faculty Member. He is currently a Professor with the Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, CASIA. He is a member of the IEEE Computer Society, and the IEEE Signal Processing Society. His current research interests include biometrics, pattern recognition, and computer vision. He has authored or coauthored over 100 technical papers. He is an Associate Editor of the IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY and the IEEE BIOMETRICS COMPENDIUM. Tieniu Tan received his BSc degree in electronic engineering from Xian Jiaotong University, China, in 1984, and his MSc and PhD degrees in electronic engineering from Imperial College London, U.K., in 1986 and 1989, respectively. In October 1989, he joined the Department of Computer Science, The University of Reading, U.K., where he worked as a Research Fellow, Senior Research Fellow and Lecturer. In January 1998, he returned to China to join the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of the Chinese Academy of Sciences (CAS) as a full professor. He was the Director General of the CAS Institute of Automation from , and the Director of the NLPR from He is currently Director of the Center for Research on Intelligent Perception and Computing at the Institute of Automation and also serves as Deputy Secretary-General of the CAS and the Director General of the CAS Bureau of International Cooperation. He has published more than 450 research papers in refereed international journals and conferences in the areas of image processing, computer vision and pattern recognition, and has authored or edited 11 books. He holds more than 70 patents. His current research interests include biometrics, image and video understanding, and information forensics and security. Dr Tan is a Member (Academician) of the Chinese Academy of Sciences, Fellow of The World Academy of Sciences for the advancement of sciences in developing countries (TWAS), an International Fellow of the UK Royal Academy of Engineering, and a Fellow of the IEEE and the IAPR (the International Association of Pattern Recognition). He is Editor-in-Chief of the International Journal of Automation and Computing. He has given invited talks and keynotes at many universities and international conferences, and has received numerous national and international awards and recognitions.

arxiv: v1 [cs.cv] 16 Nov 2016

arxiv: v1 [cs.cv] 16 Nov 2016 DeMeshNet: Blind Face Inpainting for Deep MeshFace Verification Shu Zhang Ran He Tieniu Tan National Laboratory of Pattern Recognition, CASIA Center for Research on Intelligent Perception and Computing,