FACE photo-sketch synthesis refers synthesizing a face

Size: px

Start display at page:

Download "FACE photo-sketch synthesis refers synthesizing a face"

Delphia Reynolds
5 years ago
Views:

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO.

cv] 10 Jul 2018 Abstract Face photo-sketch synthesis aims at generating a facial sketch (or photo) conditioned on a given photo (or sketch).

Despite the great progress achieved by existing methods, they mostly yield blurred effects and great deformation over various facial components.

Specially, we propose a novel composition-aided generative adversarial network (CA-GAN) for face photo-sketch synthesis.

1 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX Composition-Aided Face Photo-Sketch Synthesis Jun Yu, Senior Member, IEEE,, Shengjie Shi, Fei Gao, Dacheng Tao, Fellow, IEEE, and Qingming Huang, Fellow, IEEE arxiv: v2 [cs.cv] 10 Jul 2018 Abstract Face photo-sketch synthesis aims at generating a facial sketch (or photo) conditioned on a given photo (or sketch). It is of wide applications including digital entertainment and law enforcement. Despite the great progress achieved by existing methods, they mostly yield blurred effects and great deformation over various facial components. In order to tackle this challenge, we propose to use the facial composition information to help the synthesis of face sketch/photo. Specially, we propose a novel composition-aided generative adversarial network (CA-GAN) for face photo-sketch synthesis. First, we utilize paired inputs including a face photo/sketch and the corresponding pixel-wise face labels for generating the sketch/photo. Second, we propose an improved pixel loss, termed compositional loss, to focus training on hard-generated components and delicate facial structures. Moreover, we use stacked CA-GANs (SCA-GAN) to further rectify defects and add compelling details. Experimental results show that our method is capable of generating identity-preserving and visually comfortable sketches and photos over a wide range of challenging data. Besides, cross-dataset photo-sketch synthesis evaluations demonstrate that the proposed method is of considerable generalization ability. Index Terms Face photo-sketch synthesis, face hallucination, image translation, generative adversarial network, compositional loss. I. INTRODUCTION FACE photo-sketch synthesis refers synthesizing a face sketch (or photo) given one input face photo (or sketch). It has a wide range of applications such as digital entertainment and law enforcement. Ideally, the synthesized photo or sketch portrait should be appearance-preserving and photo/sketchrealistic, so that it will yield both high sketch identification accuracy and excellent perceptual quality. Despite the great success achieved in this area, existing photo-sketch synthesis methods [1], even the most advanced deep learning based method [2], yield serious blurred effects and deformation in systhesised sketches and photos [3] (see Fig. 1). Recently, generative adversarial networks (GANs) [5] have achieved great success in image transformation, e.g. image Jun Yu and Shengjie Shi are with the Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou , China. jacobshi777@hotmail.com, yujun@hdu.edu.cn. Fei Gao is with the Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou , China and the State Key Laboratory of Integrated Services Networks, Xidian University, Xian , China. E- mail: gaofei@hdu.edu.cn. Dacheng Tao is with the UBTech Sydney Artificial Intelligence Institute, and the School of Information Technologies, in the Faculty of Engineering and Information Technologies, The University of Sydney, J12 Cleveland St, Darlington, NSW 2008, Australia. dacheng.tao@sydney.edu.au. Qingming Huang is with the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing , China. qmhuang@ucas.ac.cn. Corresponding author: Fei Gao, gaofei@hdu.edu.cn. (d) (e) (f) Fig. 1. Illustration results of existing methods and the proposed methods. (a) Input, (b) MrFSPS [1], (c) cgan [4], (d) our CA-GAN, (e) our SCA-GAN, (f) Ground truth. Our results show more nautral textures and details. style transfer [4], image super-resolution [6], and image-toimage translation [7]. The face photo-sketch synthesis process can be naturally formulated as photo-to-sketch and sketch-tophoto translation problem, which can be naturally handled by a conditional generative adversarial network (cgan) model [4]. Wang et al. [8] therefore test the vanilla cgan for facial sketch generation. The results show that cgan is promising to yield sketch-like textures. However, as the vanilla cgan only takes the face photo as input, it is difficult for the model to learn the structural relationship among the facial components given no composition information, thus resulting in deformation on some facial parts (see Fig. 1). Since faces are under strong geometric constrain with complicated structural details, it is promising to use the facial composition information to help the generation of sketch portraits. In this paper, we propose to use pixel-wise face labelling masks to character the facial composition. This is motivated by the following two observations. First, the facial structure can be well represented by pixel-wise face labelling masks. In particular, the pixel-wise labels can be mapped to a face photo/sketch one-by-one, thus preserving the personal

2 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX information in faces. Second, it is easy to access pixelwise facial labels due to recent development on face parsing techniques [9], thus avoiding heavy human annotations and feasible for test. Moreover, we propose an improved pixel loss, termed compositional loss, for learning the photo/sketch generator. In typical image generation methods, the pixel loss (i.e. reconstruction error) is uniformly calculated across the whole image as (part of) the objective [4]. Thus large components that comprise a vast number of pixels dominate the training procedure, obstructing the model to generate delicate facial structures. However, for face photos/sketches, large components are typically unimportant for recognition (e.g. background) or easy to generate (e.g. facial skin). In contrast, small components (e.g. eyes) are critical for recognition and difficult to generate, because they comprise complicated structures. To eliminate this barrier, we introduce a weighting factor for the distinct pixel loss of each component, which downweights the loss assigned to large components. In other words, our compositional loss focus training on hard components and prevents the large components from overwhelming the generator during training. In this paper, we propose a Composition-Aided Generative Adversarial Network (CA-GAN) for face photo-sketch synthesis. Our model is based on the cgan infrastructure. First, we utilize paired inputs including a face photo and the corresponding pixel-wise face labelling masks for generating the portrait. Second, we use the proposed novel compositional loss for training the GAN. Moreover, we use stacked CA-GANs (SCA- GAN) for refinement, which proves to be capable of rectifying defects and adding compelling details [6]. As the proposed framework jointly exploits the image appearance space and structural composition space, it is capable of generating natural face photos and sketches. Experimental results show that our methods outperform existing methods in terms of perceptual quality, and obtain highly comparable quantitative evaluation results. We also verify the excellent generalization ability of our new model across different datasets. The contributions of this paper are mainly three-fold. First, to the best of our knowledge, this is the first work to employ facial composition information in the loop of learning a face photo-sketch synthesis model. Second, we propose an improved pixel loss, termed compositional loss, to focus training on hard-generated components and delicate facial structures, which is demonstrated to be much effective. This both speeds the training up and greatly stabilizes it. Third, the proposed method yields identity-preserving, realistic, and visually comfortable photos and sketches over a wide range of challenging data. Besides, our methods show considerable generalization ability. The rest of this paper is organized as follows. Section II introduces related works. Secion III details the proposed sketch portrait generation framework. Experimental results and analysis are presented in section IV. Section V concludes this paper. II. RELATED WORK A. Face Photo-Sketch Synthesis Tremendous efforts have been made to develop facial photosketch synthesis methods, which can be broadly classified into two groups: data-driven methods and model-driven methods [10]. Data-driven refers to methods that try to synthesize a photo/sketch by using a linear combination of similar training photo/sketch patches [11], [12], [13], [14], [15], [16]. These methods have two main parts: similar photo/sketch patch searching and linear combination weight computation. The similar photo/sketch searching process heavily increases the time consuming for test and make it difficult to use a large scale of training dataset. Model-driven refers to methods that learn a mathematical function offline to map a photo to a sketch or to map a sketch to a photo [1], [17], [18], [19]. Traditionally, researchers pay great efforts to explore handcrafted features, neighbour searching strategies, and learning techniques. However, these methods typically yield serious blurred effects and great deformation in synthesized face photos and sketches. Inspired by the great success achieved by deep learning techniques [20], [5] in various image-to-image translation tasks [7], some trials are made to learn deep learning based face sketch synthesis models. To name a few, Zhang et al. [21] propose to use branched fully convolutional network (FCN) for generating structural and textural representations, respectively, and then use face parsing results to fuse them together. However, the resulted sketches have blurred and ring effects. Recently, Wang et al. [8] propose to first use the vanilla cgan to generate a sketch and then refine it by using a postprocessing approach termed back projection. Experimental results show that cgan can produces sketch-like structures in the synthesized portrait. However, there are also great deformation in various facial parts. More recently, Wang et al. [22] use the CycleGAN [23] as the prototype, and propose to use multi-scale discriminators [24] for generating high resolution sketches/photos. This method shows distinctly improved performance and yield sketch-realistic textures. However, there are still slight blurred defects and degradations in the color components. Few exiting methods use the composition information to guide the generation of the face sketch [21], [25]. In particular, they try to learn a specific generator for each component and then combine them together to form the entire face. Similar ideas have also been proposed for face image hallucination [26], [27]. In contrast, we propose to employ facial composition information in the loop of learning the generator to boost the performance. B. Image-to-image Translation Our work is highly related to image-to-image translation, which has achieved significant progress with the development of generative adversarial networks (GANs) [5], [28] and variational auto-encoders (VAEs) [29]. Among them, conditional generative adversarial networks (cgan) [4] attracts growing attentions because there are many interesting works based on it, including conditional face generation [30], text to image

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 3 synthesis [6], and image style transfer [31]. All of them obtained amazing results.

However, we found the vanilla cgan insufficient for this task, thus propose to boost the performance by both developing the network architecture and modifying the objective. A. Preliminaries III.

In this section, we take face sketch synthesis as an example to introduce our method. Our problem is defined as follows.

Our key idea is using the face composition information to help the generation of sketch portrait. The first step is to obtain the structural composition of a face.

The remaining problem is to generate the sketch portrait based on the face photo and composition masks: {X, M} Y. Here, we propose a composition-aided GAN (CA-GAN) for this purpose.

Face Decomposition Assume that the given face photo is X R m n d, where m, n, and d are the height, width, and number of channels, respectively. We decompose the input photo into C components (e.g. hair, nose, mouse, etc.

By using P-Net, we get the pixel-wise labels related to 8 components, i.e. two eyes, two eyebrows, nose, upper and lower lips, inner mouth, facial skin, hair, and background [9].

3 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX synthesis [6], and image style transfer [31]. All of them obtained amazing results. Inspired by these observations, we are interested in generating sketch-realistic portraits by using cgan. However, we found the vanilla cgan insufficient for this task, thus propose to boost the performance by both developing the network architecture and modifying the objective. A. Preliminaries III. METHOD The proposed method is capable of handling both sketch synthesis and photo synthesis, because these two procedures are symmetric. In this section, we take face sketch synthesis as an example to introduce our method. Our problem is defined as follows. Given a face photo X, we would like to generate a sketch portrait Y that share the same identity with sketch-realistic appearance. Our key idea is using the face composition information to help the generation of sketch portrait. The first step is to obtain the structural composition of a face. As face parsing can well represent the facial composition, we employ the pixel-wise face labelling masks M as prior knowledge for the facial composition. The remaining problem is to generate the sketch portrait based on the face photo and composition masks: {X, M} Y. Here, we propose a composition-aided GAN (CA-GAN) for this purpose. We further employ stacked CA-GANs (SCA- GAN) to refine the generated sketch portraits. Details are given in the following of this section. B. Face Decomposition Assume that the given face photo is X R m n d, where m, n, and d are the height, width, and number of channels, respectively. We decompose the input photo into C components (e.g. hair, nose, mouse, etc.) by employing the face parsing method proposed by Liu et al. [9] due to its excellent performance. For notational convenience, we refer to this model as P-Net. By using P-Net, we get the pixel-wise labels related to 8 components, i.e. two eyes, two eyebrows, nose, upper and lower lips, inner mouth, facial skin, hair, and background [9]. We propose to use soft labels (probabilistic outputs) in this paper. Let M = {M (1),, M (C) } R m n C denote the pixel-wise face labelling masks. Here, M (c) i,j [0, 1], s.t. c M(c) i,j = 1 denotes the probability pixel X i,j belongs to the c-th component, predicted by P-Net, c = 1,, C with C = 8. In the preliminary implementation, we also tested the performance while using hard labels (binary outputs), i.e. each value M (c) i,j denotes whether X i,j belongs to the c-th component. Because it is almost impossible to get absolutely precise pix-wise face labels, using hard labels occasionally yields deformation in the border area between two nearby components. C. Composition-aided GAN (CA-GAN) In the proposed framework, we first utilize paired inputs including a face photo and the corresponding pixel-wise face labels for generating the portrait. Second, we propose an Facial Labels P-Net ( fixed ) Input Photo Composition Encoder Appearance Encoder Decoder GeneratedSketch Fig. 2. Generator architecture of the proposed composition-aided generative adversarial network (CA-GAN). improved pixel loss, termed the compositional loss, to focus training on hard-generated components and delicate facial structures. Moreover, we use stacked CA-GANs to further rectify defects and add compelling details. Details will be introduced in the following subsections. 1) Generator Architecture: The architecture of the generator in CA-GAN is presented in Fig. 2. In our case, the generator needs to translate two inputs (i.e., the face photo X and the face labelling masks M) into a single output Y. Because X and M are of different modalities, we propose to use distinct encoders to model them and refer to them as Appearance Encoder and Composition Encoder, correspondingly. The features of these two encoders are concatenated at the bottleneck layer for the decoder [32]. In this way, the information of both the face photo and the facial composition can be well modeled respectively. The architectures of the encoder, decoder, and discriminator are exactly the same as those used in [4] but without dropout, following the shape of a U-Net. Specifically, we concatenate all channels at layer i in both encoders with those at layer n i in the decoder. Details of the network can be found in the appendix of [4]. In addition, we test the network with one single encoder that takes the cascade of X and M, i.e. [X, M (1),, M (C) ] R m n (d+c), as the input. This network is the most straightforward solution for simultaneously encoding the face photo and the composition masks. Experimental results show that using this structure decreases the face sketch recognition accuracy by about 2 percent and yield slightly blurred effects in the area of hair. 2) Compositional Loss: Previous approaches to cgans have found it beneficial to mix the GAN objective with pixel loss (i.e. reconstruction error) for various tasks, e.g. image translation [4] and super-resolution reconstruction [7]. Besides, using the normalized L 1 distance encourage less blurring than the L 2 distance. We therefore use the normalized L 1 distance between the generated sketch Ŷ and the target Y in the computation of pixel loss. We introduce the compositional loss starting from the standard pixel loss for image generation. In previous works about cgans, the pixel loss is calculated over the whole image. For distinction, we refer to it as global pixel loss in this paper. Global pixel loss. Suppose both Ŷ and Y have shape m n. Let 1 be a m n matrix of ones. The global pixel loss is

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 4 expressed as: L L1,global(Y, Ŷ) = 1 mn Y Ŷ 1.

Here, denotes the pixelwise product operation. As all the pixels are treated equally in the global pixel loss, large components (e.g. background and facial skin) contribute more to learn the generator than small components (e.

4 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX expressed as: L L1,global(Y, Ŷ) = 1 mn Y Ŷ 1. (1) In the global pixel loss, the L 1 loss related to the c th component, c = 1, 2,, C, can be expressed as: L (c) L 1,global = 1 mn Y M(c) Ŷ M(c) 1, (2) with L L1,global = c L(c) L 1,global. Here, denotes the pixelwise product operation. As all the pixels are treated equally in the global pixel loss, large components (e.g. background and facial skin) contribute more to learn the generator than small components (e.g. eyes and mouth). Compositional Loss. To eliminate this barrier, we introduce a weighting factor, γ c, to balance the distinct pixel loss of each component. Specially, inspired by the idea of balanced crossentropy loss [33], we set γ c by inverse component frequency. When we adopt the soft facial labels, M (c) 1 is the sum of the possibilities every pixel belonging to the c th component. Here, denotes the convolutional operation. If we adopt the hard facial labels, it becomes the number of pixels belonging to the c th component. The component frequency is thus M(c) 1 mn. So we set γ c = mn and multiply it with M (c) L(c) 1 L, resulting 1,global in the balanced L 1 loss: L (c) L = 1 1,cmp M (c) 1 Y M(c) Ŷ M(c) 1 (3) Obviously, the balanced L 1 loss is exactly the normalized L 1 loss across the related componential region. The compositional loss is defined as, L L1,cmp(Y, Ŷ) = C c=1 L (c) L 1,cmp. (4) As γ c is broadly in inverse proportion to the component size, it reduces the loss contribution from large components. From the other aspect, it high-weights the losses assigned to small and hard-generated components. In practice we use a weighted average of the global pixel loss and compositional loss: L L1 (Y, Ŷ) = αl L 1,cmp + (1 α)l L1,global, (5) where α [0, 1] is used to balance the global pixel loss and the compositional pixel loss. We adopt this form in our experiments and set α = 0.7, as it yields slightly improved perceptual comfortability over the compositional loss. 3) Objective: Following the objective of the vanilla cgan, we express the adversarial loss of CA-GAN as: L adv (G, D) = E X,M,Y pdata (X,M,Y)[log D(X, M, Y)] + E X,M pdata (X,M)[log(1 D(X, M, G(X, M)))]. (6) Similar to the settings in [4], we do not add a Gaussian noise z as the input. Besides, we do not use dropout in the generator. Finally, we use a combination of the adversarial loss and the weighted pixel loss to learn the generator. We aim to solve: G = arg min max L adv + λl L1, (7) D G where λ is a weighting factor. Stage-I Generator G (1) Stage-II Generator G (2) Reconstruction Error Target Sketch Input Photo Facial Labels Reconstruction Error D (1) D (2) Discriminator Real/Fake Pairs Fig. 3. Pipeline of the proposed stacked composition-aided generative adversarial network (SCA-GAN). D. Stacked Refinement Network Finally, we use stacked CA-GAN (SCA-GAN) to further boost the quality of the generated sketch portrait [6]. The architecture of SCA-GAN is illustrated in Fig. 3. SCA- GAN includes two-stage GANs, each comprises a generator and a discriminator, which are sequentially denoted by G (1), D (1), G (2), D (2). In SCA-GAN, the Stage-I GAN yields an initial portrait, Ŷ(1), based on the given face photo X and pix-wise label masks M. Afterwards, the Stage-II GAN takes {X, M, Ŷ(1) } as inputs to rectify defects and add compelling details, yielding a refined sketch portrait, Ŷ (2). The network architectures of the these two GANs are almost the same, except that the inputs of G (2) and D (2) have one more channel (i.e. the initial sketch) than those of G (1) and D (1), correspondingly. Here, the given photo and the initial sketch are concatenated and input into the appearance encoder. In the implementation, we also test the SCA-GAN network with one single discriminator, shared by these two GANs. However, it cannot yield vivid hairs. E. Optimization and Implementation In the proposed method, the input image should be of fixed size, e.g In the default setting of cgan [4], the input image is resized from an arbitrary size to However, we observed that resizing the input face photo will yield serious blurred effects and great deformation in the generated sketch [8] [22]. In contrast, by padding the input image to the target size, we can obtain considerable performance improvement. We therefore use padding across all the experiments. To optimize our networks, following [4], we alternate between one gradient descent step on D, then one step on G. We use minibatch SGD and apply the Adam solver. For clarity, we illustrate the optimization procedure of SCA-GAN in Algorithm 1. In our experiments, we use batch size 1 and run for 700 epochs for all the experiments. Besides, we apply instance normalization, which has shown great superiority over batch normalization in the task of image generation [4]. We trained our models on a single Pascal Titan X GPU. When we used a training set of 500 samples, it took about 3 hours to train the CA-GAN model and 6 hours to train the SCA-GAN model. At test time, all models run in well under one second on this GPU.

5 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX Algorithm 1 Optimization procedure of SCA-GAN (for sketch synthesis). Input: a set of training instances, in form of triplet: {a face photo X, pix-wise label masks M, a target sketch Y }; iteration time t = 0, max iteration T ; Output: optimal G (1), D (1), G (2), D (2) ; initial G (1), D (1), G (2), D (2) ; for t = 1 to T do 1. Randomly select one training instance: { a face photo X, pix-wise label masks M, a target sketch Y. } 2. Estimate the initial sketch portrait: Ŷ (1) = G (1) (X, M) 3. Estimate the refined sketch portrait: Ŷ (2) = G (2) (X, M, Ŷ(1) ) 4. Update D (1) : D (1) = arg min D (1) L adv (G (1), D (1) ) 5. Update D (2) : D (2) = arg min D (2) L adv (G (2), D (2) ) 6. Update G (1) : G (1) = arg max G (1) L adv (G (1), D (1) ) + λl L1 (Y, Ŷ(1) ) 7. Update G (2) : G (2) = arg max G (2) L adv (G (2), D (2) ) + λl L1 (Y, Ŷ(2) ) end for IV. EXPERIMENTS In this section, we will first introduce the experimental settings and then present a series of empirical results to verify the effectiveness of the proposed method. A. Settings 1) Datasets: We conducted experiments on three public available databases: the CUHK Face Sketch database (CUHK) [34], the CUFSF database [35], and the VIPSL-FS database [19] [36]. The CUHK database consists of 606 face photos from three databases: the CUHK student database [37] (188 persons), the AR database [38] (123 persons), and the XM2VTS database [39] (295 persons). The CUFSF database includes 1194 persons [40]. In the CUFSF database, there are lighting variation in face photos and shape exaggeration in sketches. Thus the CUFSF is very challenging. For each person, there are one face photo and one face sketch drawn by the artist in both the CUHK database and the CUFSF database. The VIPSL-FS database includes 200 persons. For each person, there are 5 sketches drawn by different artists. Because the original sketches in the VIPSL-FS database are , we use it to test the performance of our proposed method for generating high-resolution face photos/sketches. Following existing methods [3], all these face images (photos and sketches) are geometrically aligned relying on three points: two eye centers and the mouth center. For the CUHK and CUFSF, the aligned images are cropped to the size of For the VIPSL-FS database, the aligned face photos are first cropped to the size of and then resized to The corresponding pixel-wise label masks are estimated from the photo and then resized to The aligned sketches are cropped to In the following context, we present a series of experiments: First, we perform face photo-sketch synthesis on the CUHK, CUFSF, and VIPSL-FS databases, respectively, to evaluate the performance of the proposed methods (see Part IV-B and Part IV-C); Second, we conduct cross-dataset experiments to verify whether the proposed method is independent of the training data (see Part IV-D); and Third, we discuss the network configurations for our proposed method on the CUHK database and CUFSF database (see Part IV-E). We use the proposed architecture for both the sketch synthesis and photo synthesis and release all the synthesized sketches and photos online: It is well know that a large size of training dataset is necessary for learning the GAN based model. In the experiment, unless otherwise specified, we randomly split each dataset into a training set (80%) and a testing set (20%). There is no overlap between them. Besides, we ran the training-testing process for 10 times and calculated the average values of the following criteria as the performance measure. 2) Criteria: We adopt the Peak Signal to Noise Ratio (PSNR) and Feature Similarity Index Metric (FSIM) [41] between the synthesized image and the ground-truth image to objectively assess the quality of the synthesized image. It is worth mentioning that, although these metrics works well for evaluating the quality of natural images and have become a prevalent metric in the face photo-sketch synthesis community, their performance for the synthesized images is referential but not infallible [42]. In addition, sketch based face recognition is always used to assist law enforcement. It is necessary to verify whether the synthesized images can be used for identity recognition. We therefore statistically evaluate the face recognition accuracy while using the ground-truth image (the photo or the sketch drawn by the artist) as the probe image and synthesized images (photos or sketches) as the images in the gallery. Nullspace linear discriminant analysis (NLDA) [43] is employed to conduct the face recognition experiments. We repeat each face recognition experiment 20 times by randomly partitioning the data and report the average accuracy. B. Face Sketch Synthesis Comparison with existing methods: There are great divergence in the experimental settings among existing face sketch synthesis methods. Besides, existing methods are typically tested on the CUHK database and CUFSF database. In this paper, we follow the work presented in [3] and split the dataset in the following ways. For the CUHK student database, 88 pairs of face photo-sketch are taken for training and the rest for testing. For the AR database, we randomly choose 80 pairs for training and the rest 43 pairs for testing. For the XM2VTS database, we randomly choose 100 pairs for training and the rest 195 pairs for testing. Fig. 4 presents some synthesized face sketches from different methods on the CUHK database and the CUFSF database. Four advanced methods are compared: MrFSPS [1], RSLCR [3], FCN [2], and cgan [4]. All the synthesized sketches by RSLCR, and FCN are those released by Wang et al. at: All the synthesized sketches by MrFSPS are those released by the author Peng at: MrFSPS.html.

6 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX (d) (e) (f) (g) (h) (i) Fig. 4. Examples of synthesized face sketches on the the CUHK database and the CUFSF database. (a) Photo, (b) MrFSPS [1], (c) RSLCR[3], (d) FCN [2], (e) BP-GAN [8], (f) cgan [4], (g) CA-GAN, (h) SCA-GAN, and (i) Sketch drawn by artist. From top to bottom, the examples are selected from the CUHK to, (b) MrFSPS student [3], database (c)[37], RSLCR[2], the AR database (d) FCN [38],[1], the XM2VTS (e) BP-GAN database [8], [39], (f) andcgan the CUFSF [4], database (g) CA-GAN, [40], sequentially. (h)stack-ca-gan, and (i) Sketch drawn by a As shown in Fig. 4, cgan, CA-GAN, and SCA-GAN methods could generate sketch-like textures (e.g. hair region) and shadows. In contrast, BP-GAN yields over-smooth sketch portrait. MrFSPS, RSLCR, and FCN yield serious blurred effects and great deformation in various facial pars. Besides, there are deformations on synthesized sketches by cgan, specially for the mouth area. In contrast, CA-GAN alleviates such defects, and SCA-GAN almost eliminates them. This illustrates the effectiveness of the proposed methods. Table I presents the average PSNR, FSIM, and face sketch recognition accuracy (Acc.) of the most advanced face sketch synthesis methods and the proposed ones, on the CUHK database and CUFSF database. The evaluation method is exactly the same as that presented in [3]. Specially, in the face sketch recognition experiment, we randomly split the CUHK database into a training set (150 synthesized sketches and corresponding ground-truths) and a testing set (188 sketches) consists of the gallery. For the CUFSF database, we randomly choose 300 synthesized sketches and corresponding groundtruths for training and 644 synthesized sketches as the gallery. We repeat each face recognition experiment 20 times by randomly partitioning the data. As shown in Table I, the PSNR values related to all these methods are highly comparable. According to FSIM, cgan, CA-GAN, and SCA-GAN outperform existing methods, except MrFSPS, on both the CUHK database and CUFSF database. According to the recognition accuracy, cgan, CA- GAN, and SCA-GAN show 2-3 percent inferiority over MrF- SPS and RSLCR on the CUHK database, but show 4-5 superiority over them on the CUFSF database. Since the CUFSF database is much larger than CUHK database. Besides, the lighting variation in face photos and the shape exaggeration in sketches both increase the difficulty of face sketchphoto synthesis and recognition. We conclude that cgan, CA-GAN, and SCA-GAN outperform existing methods according to the face sketch recognition accuracy. There is no considerable difference between cgan, CA-GAN, and SCA-GAN, in terms of these three criteria, across both the CUHK database and CUFSF database. In addition, PS 2 -MAN [22] achieves a FSIM value of on the CUHK database, which is slightly better

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 7 TABLE I COMPARISON WITH EXISTING FACE SKETCH SYNTHESIS METHODS IN TERM OF THE AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.

01 CUFSF N/A 28.90 28.41 29.87 29.29 29.22 29.21 FSIM CUHK 73.39 69.64 69.34 69.05 71.09 71.19 71.43 CUFSF N/A 66.47 66.22 68.18 72.81 72.72 72.86 Acc. CUHK 97.70 98.38 96.49 93.14 95.48 95.64 95.

7 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX TABLE I COMPARISON WITH EXISTING FACE SKETCH SYNTHESIS METHODS IN TERM OF THE AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.) (%), ON THE CUHK AND CUFSF DATABASES. THE EXPERIMENTAL SETTTINGS ARE FOLLOWING RSLCR [3]. MrFSPS[1] RSLCR[3] FCN[2] BP-GAN[8] cgan[4] CA-GAN SCA-GAN PSNR CUFS N/A CUFSF N/A FSIM CUHK CUFSF N/A Acc. CUHK CUFSF than both CA-GAN and SCA-GAN. From Fig. 4 and Table I, we can safely conclude that both CA-GAN and SCA-GAN generate much better sketches and achieve highly comparable quantitative evaluations, in comparison with existing face sketch synthesis methods. High-resolution sketch synthesis: We add one convolutional layer to the encoders and one deconvolutional layer to the decoders in cgan, CA-GAN, and SCA-GAN, for the purpose of generating high-resolution sketches. We use 1000 photosketch pairs in the VIPSL-FS database here. We randomly split these pairs into a training set and a testing set by 80%:20%. Fig. 5 shows the sketch portraits generated by using cgan [4], SCA-GAN. Obviously, cgan yields check-board-like textures and blurred effects in the area of hair. Besides, cgan yields deformation in small facial components (see the left eye of the first person in Fig. 5). In contrast, SCA-GAN generate very high-quality and sketch-realistic portraits, alleviating such defects. Quantitative Evaluation: Since GANs typically need a large size of training data, we further conduct the sketch synthesis experiment on the CUHK, CUFSF, and VIPSL-FS databases by randomly splitting each database into a training set (80%) and a testing set (20%). In the face sketch recognition, we randomly split the CUHK database into a training set (70 synthesized sketches and corresponding ground-truths) and a testing set (188 sketches) consists of the gallery. For the CUFSF database, we randomly choose 120 synthesized sketches and corresponding ground-truths for training and 250 synthesized sketches as the gallery. For the VIPSL-FS database, we randomly choose 20 synthesized sketches and corresponding ground-truths for training and 40 synthesized sketches as the gallery. We repeat each face sketch recognition experiment 20 times by randomly partitioning the data. We run the training-testing process for 10 times, and calculate the average PSNR, FSIM, and face recognition accuracy (Acc.) of the synthesized sketches. The corresponding results are shown in Table II. As shown in Table II, there is no distinct difference between cgan, CA-GAN, and SCA-GAN in term of PSNR. According to FSIM, CA-GAN is highly comparable with cgan, and SCA-GAN show slight superiority over both of them. In addition, both CA-GAN and SCA-GAN achieves higher face sketch recognition accuracy on the CUHK database, and is still comparable with cgan on both the CUFSF and VIPSL- FS databases. Recall that the sketches generated by SCA-GAN looks most like the input face (as illustrated in Figs. 1, 4, and (d) (e) Fig. 5. Examples of high-resolution synthesized face sketches on the VIPSL- FS database. (a) Photo, (b) cgan [4], (c) CA-GAN, (d) SCA-GAN, and (e) Sketch drawn by artist. TABLE II AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.) (%) OF THE SYNTHESIZED SKETCHES ON THE CUHK, CUFSF, AND VIPSL-FS DATABASES. EACH DATABASE IS RANDOMLY SPLIT INTO A TRAINING SET (80%) AND A TESTING SET (20%). cgan CA-GAN SCA-GAN CUHK PSNR CUFSF VIPSL-FS CUHK FSIM CUFSF VIPSL-FS CUHK Acc. CUFSF VIPSL-FS ). We can safely draw the conclusion that both CA-GAN and SCA-GAN are capable of generating identity-preserving and sketch-realistic sketch portraits. C. Face Photo Synthesis We exchange the roles of the sketch and photo in the proposed model, and evaluate the face photo synthesis performance on the aforementioned datasets, separately. Fig.6 illustrates the synthesized face photos of MrFSPS [1], cgan, CA-GAN, and SCA-GAN. All the synthesized photos by MrFSPS are those released by the author Peng at: MrFSPS.html. Obviously, the face photos synthesized by MrFSPS are heavily blurred. Besides, there are serious degradations in the synthesized photos by suing cgan. In contrast, the photos

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 8 TABLE III AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.

85 PSNR CUFSF 30.06 29.65 29.97 VIPSL-FS 30.40 30.18 30.52 CUHK 76.18 76.53 77.13 FSIM CUFSF 79.54 79.13 79.67 VIPSL-FS 72.83 72.80 73.15 CUHK 94.80 96.96 95.49 Acc. CUFSF 77.50 73.38 74.

From top to bottom, the examples are selected from the CUHK student database [37], the AR database [38], the XM2VTS database [39], and the CUFSF database [40], sequentially.

) on the CUHK, CUFSF, and VIPSL-FS databases. Obviously, SCA-GAN obtains the best FSIM values across all the three databases.

6 and Table III, we can see that both CA-GAN and SCA-GAN generate better face photos and achieve highly comparable quantitative evaluations as compared with cgan.

8 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX TABLE III AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.) (%) OF THE SYNTHESIZED PHOTOS ON THE CUHK, CUFSF, AND VIPSL-FS DATABASES. EACH DATABASE IS RANDOMLY SPLIT INTO A TRAINING SET (80%) AND A TESTING SET (20%). cgan CA-GAN SCA-GAN CUHK PSNR CUFSF VIPSL-FS CUHK FSIM CUFSF VIPSL-FS CUHK Acc. CUFSF VIPSL-FS (d) (e) (f) Fig. 6. Examples of synthesized face photos. (a) Sketch drawn by artist, (b) MrFSPS [1], (c) cgan, (d) CA-GAN, (e) SCA-GAN, and (f) groundtruth photo. From top to bottom, the examples are selected from the CUHK student database [37], the AR database [38], the XM2VTS database [39], and the CUFSF database [40], sequentially. generated by both CA-GAN or SCA-GAN consistently show considerable improvement in the perceptual quality. Table III presents the average PSNR, FSIM, and face recognition accuracy (Acc.) on the CUHK, CUFSF, and VIPSL-FS databases. Obviously, SCA-GAN obtains the best FSIM values across all the three databases. Besides, both CA-GAN and SCA-GAN outperform cgan in terms of face recognition on the CUHK and VIPSL-FS databases, but infer to cgan on the CUFSF database. From Fig. 6 and Table III, we can see that both CA-GAN and SCA-GAN generate better face photos and achieve highly comparable quantitative evaluations as compared with cgan. We can safely draw the conclusion that both CA-GAN and SCA-GAN are capable of generating identity-preserving and natural face photos. Comparison with existing methods: Recently, only a few number of methods have been proposed for face photo synthesis. Here we compare the proposed method with two advanced methods: MrFSPS [1] and PS 2 -MAN [22]. MrFSPS achieves an FSIM of 80.31% and a face recognition accuracy of 96.7% using the synthesized photos on the CUHK database. Besides, on the CUFSF database, it achieves an a face recognition accuracy of 59.37%. As reported in [22], PS 2 -MAN achieves a FSIM value of 80.62% on the CUHK database. In general, the performance of CA-GAN and SCA-GAN are highly comparable with MrFSPS and PS 2 -MAN. Besides, the photos synthesized by using CA-GAN and SCA-GAN outperform the results of MrFSPS and PS 2 -MAN. Specially, there are serious (d) (e) Fig. 7. Examples of synthesized high-resolution face photos on the VIPSL- FS database. (a) Sketch drawn by artist, (b) cgan [4], (c) CA-GAN, (d) SCA-GAN, and (e) ground truth photo. blurred effects in the photos synthesized by using MrFSPS and visible degradations in color components in those by using PS 2 -MAN [22]. In contrast, the results of CA-GAN and SCA- GAN express more natural colors and details. High-resolution photo synthesis: In addition, we evaluate the performance of cgan, CA-GAN, and SCA-GAN, in synthesizing high-resolution photos, on the VIPSL-FS database. The experimental settings are exactly the same as those previously presented in Section IV-B, except that the roles of the sketch and photo are exchanged. Fig. 7 illustrates the synthesized photos. Obviously, cgan yields check-board-like textures and blurred effects in the area of hair. Besides, cgan yields deformation in small facial components (e.g. the left eye of the second person in Fig. 7 (b)). In contrast, both CA-GAN and SCA-GAN generate very high-quality and natural face photos, alleviating such defects. Besides, the photos synthesized by using SCA-GAN are of the best perceptual quality. D. Dataset Independence To verify the generalization ability of the learned model, we conducted two cross-dataset experiments. Cross-database experiment: First, we apply the model learned from the CUHK training dataset to the whole VIPSL- FS database. There is great divergence in person identity, background, and sketch style between these two datasets. Fig. 8 illustrates the synthesized sketches on the VIPSL-FS

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 9 TABLE IV AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.

77 27.70 27.74 Acc. 24.17 26.67 23.33 PSNR 25.44 25.63 25.52 Photo Synthesis FSIM 61.81 64.42 65.17 Acc. 27.50 22.50 22.50 (d) (e) Fig. 8.

(a) Photo, (b) cgan, (c) CA-GAN, (d) SCA-GAN, (e) ground-truth crossdataset sketch drawn CUFS by the artist. VIPSL database, and Fig. 9 the synthesized photos.

Table IV literates the average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized photos/sketches on the VIPSL-FS database.

We repeat each face sketch recognition experiment 20 times by randomly partitioning the data.

Face photo-sketch synthesis of Chinese celebrities: In addition, we tested the CA-GAN and SCA-GAN model, trained on the CUHK database, our method on the photos and sketches of (d) Fig. 10.

(a) Photo, (b) cgan, (c) CA-GAN, (d) SCA-GAN. Chinese celebrities.

10 shows the synthesized sketches, and Fig. 11 the synthesized photos. Obviously, our results express more natural textures and details than cgan.

9 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX TABLE IV AVERAGE PSNR, FSIM (%), AND FACE RECOGNITION ACCURACY (ACC.) (%) OF THE SYNTHESIZED PHOTOS/SKETCHES ON THE WHOLE VIPSL-FS DATABASE WHILE THE MODEL IS LEARNED FROM THE CUHK TRAINING DATASET. cgan CA-GAN SCA-GAN PSNR Sketch Synthesis FSIM Acc PSNR Photo Synthesis FSIM Acc (d) (e) Fig. 8. Synthesized sketches on the VIPSL-FS database while the model is trained on the CUHK database. (a) Photo, (b) cgan, (c) CA-GAN, (d) SCA-GAN, (e) ground-truth crossdataset sketch drawn CUFS by the artist. VIPSL database, and Fig. 9 the synthesized photos. Obviously, both CA-GAN and SCA-GAN generate much better sketches and photos than cgan. Besides, the results of SCA-GAN express the best appearance. Table IV literates the average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized photos/sketches on the VIPSL-FS database. In the face sketch recognition task, we randomly choose 100 synthesized sketches and corresponding ground-truths for training and 200 synthesized sketches as the gallery. We repeat each face sketch recognition experiment 20 times by randomly partitioning the data. CA-GAN and stack-ca-gan outperform cgan according to PSNR and FSIM, but inferior to cgan according to the face recognition accuracy. Face photo-sketch synthesis of Chinese celebrities: In addition, we tested the CA-GAN and SCA-GAN model, trained on the CUHK database, our method on the photos and sketches of (d) Fig. 10. Synthesized sketches of Chinese celebrities. (a) Photo, (b) cgan, (c) CA-GAN, (d) SCA-GAN. Chinese celebrities. These photos and sketches are downloaded from the web, and contain different lighting conditions and backgrounds compared with the images in the training set. Fig. 10 shows the synthesized sketches, and Fig. 11 the synthesized photos. Obviously, our results express more natural textures and details than cgan. Limitations: It is inspiring that both CA-GAN and SCA- GAN show outstanding generalization ability in the sketch synthesis task. However, as shown in Fig. 8, the proposed method could not handle the black margins well, and yields ink marks on the corresponding areas. This might be caused by the fact that there are little black margins in the CUHK dataset, the generator therefore learns little about how to process black margins. In addition, the synthesized photos in the crossdataset experiment are dissatisfactory. This might be due to the great divergence between the input sketches in terms of textures and styles. It is necessary to further improve the generalization ability of the photo synthesis models. (d) (e) Fig. 9. Synthesized photos on the VIPSL-FS database while the model is trained on the CUHK database. (a) Sketch drawn by the artist, (b) cgan, (c) CA-GAN, (d) SCA-GAN, (e) ground-truth photo. crossdataset CUFS VIPSL E. Discussions on the Network Configurations 1) Ablation Study: There are mainly three components in CA-GAN, i.e. (i) using face labels in G; (ii) using face labels in D; and (iii) the compositional loss. To illustrate the contribution of each component, we accordingly evaluate the performance related to the following settings: cgan, cgan+i, cgan+ii, cgan+iii, CA-GAN (i.e. cgan+i+ii+iii), and SCA-GAN. We separately conduct the photo synthesis and sketch synthesis experiments on the CUHK database and the

JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 10 V. CONCLUSION In this paper, we propose a novel composition-aided generative adversarial network for face photo-sketch synthesis.

Besides, it is essential to develop models that can handle photos/sketches with great variations in head poses, lighting conditions, and styles.

(a) Sketch, (b) cgan, (c) CA-GAN, (d) SCA-GAN. CUFSF database. We randomly split each database into two parts: 80% for training and the rest for testing.

) of the synthesized sketches and photos in Table V and Table VI, respectively. There is no distinct difference between different settings. Fig. 12, illustrates the synthesised sketches and photos.

10 JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX V. CONCLUSION In this paper, we propose a novel composition-aided generative adversarial network for face photo-sketch synthesis. Our approach produces high-quality face photos and sketches over a wide range of challenging data. We hope that the presented approach can support the applications of other image generation problems. Besides, it is essential to develop models that can handle photos/sketches with great variations in head poses, lighting conditions, and styles. Finally, exciting work remains to be done to qualitatively evaluate the quality of the synthesized sketches and photos. (d) Fig. 11. Synthesized photos of Chinese celebrities. (a) Sketch, (b) cgan, (c) CA-GAN, (d) SCA-GAN. CUFSF database. We randomly split each database into two parts: 80% for training and the rest for testing. There is no overlap between the training set and the testing set. We show the corresponding PSNR, FSIM, and recognition accuracy (Acc.) of the synthesized sketches and photos in Table V and Table VI, respectively. There is no distinct difference between different settings. Fig. 12, illustrates the synthesised sketches and photos. Compared to (b), (c)-(e) express less deformations and sharper margins in the area of nose, mouse and eyes. In other words, all the proposed three components improve the quality of the generated sketches. 2) Stability of the Training Procedure: We discover that, our proposed approaches considerably stabilizes the training procedure of the network. Fig. 13 shows the (smoothed) training loss curves related to cgan [4], CA-GAN, and SCA- GANK database. Specially, (a) and (b) shows the reconstruction error (Global L 1 loss) and the adversarial loss in the sketch synthesis task; (c) and (d) show the reconstruction error and the adversarial loss in the photo synthesis task, respectively. For clarity, we smooth the initial loss curves by averaging adjacent 40 loss values. Obviously, there are large impulses in the adversarial loss of cgan. In contrast, the corresponding curves of CA-GAN and SCA-GAN are much smoother. The reconstruction error of both CA-GAN and SCA-GAN are smaller than that of cgan. Besides, SCA-GAN achieves the least reconstruction errors and smoothest loss curves. This observation explains why stacked generators are capable of refining the generation performance [44]. REFERENCES [1] C. Peng, X. Gao, N. Wang, D. Tao, X. Li, and J. Li, Multiple representations-based face sketch-photo synthesis, IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 11, pp , [2] L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang, End-to-end photosketch generation via fully convolutional representation learning, in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp [3] N. Wang, X. Gao, and J. Li, Random sampling for fast face sketch synthesis, Pattern Recognition (PR), [4] P. Isola, J. Zhu, T. Zhou, and A. Efros, Image-to-image translation with conditional adversarial networks, arxiv preprint arxiv: , Tech. Rep., [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in International Conference on Neural Information Processing Systems, 2014, pp [6] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, Tech. Rep., [7] J. Johnson, A. Alahi, and F. F. Li, Perceptual losses for real-time style transfer and super-resolution, in European Conference on Computer Vision, 2016, pp [8] N. Wang, W. Zha, J. Li, and X. Gao, Back projection: an effective postprocessing method for gan-based face sketch synthesis, Pattern Recognition Letters, pp. 1 7, [9] S. Liu, J. Yang, C. Huang, and M. H. Yang, Multi-objective convolutional learning for face labeling, in Computer Vision and Pattern Recognition, 2015, pp [10] N. Wang, M. Zhu, J. Li, B. Song, and Z. Li, Data-driven vs. modeldriven: Fast face sketch synthesis, Neurocomputing, [11] Y. Song, J. Zhang, L. Bao, and Q. Yang, Fast preprocessing for robust face sketch synthesis, in Proceedings of International Joint Conference on Artifical Intelligence, 2017, pp [12] Y. Song, L. Bao, S. He, Q. Yang, and M. H. Yang, Stylizing face images via multiple exemplars, Computer Vision and Image Understanding, [13] X. Gao, N. Wang, D. Tao, and X. Li, Face sketchphoto synthesis and retrieval using sparse representation, IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 8, pp , [14] Y. Song, L. Bao, Q. Yang, and M. H. Yang, Real-time exemplar-based face sketch synthesis, in European Conference on Computer Vision, 2014, pp [15] Q. Pan, Y. Liang, L. Zhang, and S. Wang, Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis, in Computer Vision and Pattern Recognition, 2012, pp [16] X. Wang and X. Tang, Face photo-sketch synthesis and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp , [17] S. Zhang, X. Gao, N. Wang, J. Li, and M. Zhang, Face sketch synthesis via sparse representation-based greedy search. IEEE Transactions on Image Processing, vol. 24, no. 8, pp , [18] S. Zhang, X. Gao, N. Wang, and J. Li, Robust face sketch style synthesis, IEEE Transactions on Image Processing, vol. 25, no. 1, p. 220, [19] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, Transductive face sketchphoto synthesis, IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp , 2013.

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XX 2018 11 (a) (b) (c) (d) (e) (f) (g) (h) Fig. 12. Illustration of synthesized face sketch and photo under different configurations.

TABLE V AVERAGE PSNR, FSIM, AND FACE RECOGNITION ACCURACY (ACC.) OF THE SYNTHESIZED SKETCHES ON THE CUHK AND CUFSF DATABASES.

11 JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XX (a) (b) (c) (d) (e) (f) (g) (h) Fig. 12. Illustration of synthesized face sketch and photo under different configurations. (a) Input, (b) cgan, (c) cgan with face labels in G (cgan+i), (d) cgan with face labels in D (cgan+ii), (e) cgan with the compositional loss (cgan+iii), (f) CA-GAN, (g) SCA-GAN, (h) ground-truth. TABLE V AVERAGE PSNR, FSIM, AND FACE RECOGNITION ACCURACY (ACC.) OF THE SYNTHESIZED SKETCHES ON THE CUHK AND CUFSF DATABASES. ( I ) U SING FACE LABELS IN G; ( II ) USING FACE LABELS IN D; AND ( III ) THE COMPOSITIONAL LOSS. G cgan D cgan+i LOSS cgan+ii Parsing cgan+iii CA-GAN SCA-GAN PSNR CUHK CUFSF FSIM CUHK CUFSF Acc. CUHK CUFSF TABLE VI AVERAGE PSNR, FSIM, AND FACE RECOGNITION ACCURACY (ACC.) OF THE SYNTHESIZED PHOTOS ON THE CUHK AND CUFSF DATABASES. ( I ) U SING FACE LABELS IN G; ( II ) USING FACE LABELS IN D; AND ( III ) THE COMPOSITIONAL LOSS. (a) cgan cgan+i cgan+ii cgan+iii CA-GAN SCA-GAN PSNR CUHK CUFSF FSIM CUHK CUFSF Acc. CUHK CUFSF (b) (c) (d) Fig. 13. Training loss curves of cgan, CA-GAN, and SCA-GAN, on the CUHK database. (a) Reconstruction error in the sketch synthesis task, (b) adversarial loss in the sketch synthesis task, (c) reconstruction error in the photo synthesis task, and (d) adversarial loss in the photo synthesis task.

FACE photo-sketch synthesis refers synthesizing a face. Towards Realistic Face Photo-Sketch Synthesis via Composition-Aided GANs

FACE photo-sketch synthesis refers synthesizing a face. Towards Realistic Face Photo-Sketch Synthesis via Composition-Aided GANs JOURNAL OF L A TEX CLASS FILES, VOL. X, NO. X, XX 2018 1 Towards Realistic Face Photo-Sketch Synthesis via Composition-Aided GANs Jun Yu, Senior Member, IEEE, Shengjie Shi, Fei Gao, Dacheng Tao, Fellow,