Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions

Size: px

Start display at page:

Download "Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions"

Gerald Hudson
5 years ago
Views:

1 Disentangling Latent Spae for VAE by Label Relevant/Irrelevant Dimensions Zhilin Zheng 1,2 Li Sun 1,2 1 Shanghai Key Laboratory of Multidimensional Information Proessing 2 East China Normal University @stu.enu.edu.n sunli@ee.enu.edu.n This paper deals with image generation problem in VAE with two sperate enoders. For a single input x, our goal is to disentangle the latent spae ode z, omputed by enoders, into the label relevant dimensions z s and irrelevant ones z u. We emphasize the differene between z s and z u, and their orresponding enoders. For z s, sine label is known during training, it should be more aurate and speifi. While without any label onstraint, z u should be genarxiv: v3 [s.cv] 4 Jan 2019 Abstrat VAE requires the standard Gaussian distribution as a prior in the latent spae. Sine all odes tend to follow the same prior, it often suffers the so-alled posterior ollapse. To avoid this, this paper introdues the lass speifi distribution for the latent ode. But different from CVAE, we present a method for disentangling the latent spae into the label relevant and irrelevant dimensions, z s and z u, for a single input. We apply two separated enoders to map the input into z s and z u respetively, and then give the onatenated ode to the deoder to reonstrut the input. The label irrelevant ode z u represent the ommon harateristis of all inputs, hene they are onstrained by the standard Gaussian, and their enoder is trained in amortized variational inferene way, like VAE. While z s is assumed to follow the Gaussian mixture distribution in whih eah omponent orresponds to a partiular lass. The parameters for the Gaussian omponents in z s enoder are optimized by the label supervision in a global stohasti way. In theory, we show that our method is atually equivalent to adding a KL divergene term on the joint distribution of z s and the lass label, and it an diretly inrease the mutual information between z s and the label. Our model an also be extended to GAN by adding a disriminator in the pixel domain so that it produes high quality and diverse images. 1. Introdution Learning a deep generative model for the strutured image data is diffiult beause this task is not simply modeling a many-to-one mapping funtion suh as the lassifiation, instead it is often required to generate diverse outputs for similar odes sampled from a simple distribution. Furthermore, image x in the high dimension spae often lies in a omplex manifold, thus the generative model should apture the underlying data distribution p(x). Basially, Variational Auto-Enoder (VAE) [34, 20] and Generative Adversarial Network (GAN) [13, 25] are two strategies for strutured data generation. In VAE, the enoder p φ (z x) maps data x into the ode z in latent spae. The deoder, represented by p θ (x z), is given a latent ode z sampling from a distribution speified by the enoder and tries to reonstrut x. The enoder and deoder in VAE are trained together mainly based on the data reonstrution loss. At the same time, it requires to regularize the distribution p φ (z x) to be simple (e.g. Gaussian) based on the Kullbak-Leibler (KL) divergene between p(z x) and q(z) = N (0, I), so that the sampling in latent spae is easy. Optimization for VAE is quite stable, but results from it are blurry. Mainly beause the posterior defined by p φ (z x) is not omplex enough to apture the true posterior, also known for posterior ollapse. On the other hand, GAN treats the data generation task as a min/max game between a generator G(z) and disriminator D(x). The adversarial loss omputed from the disriminator makes generated image more realisti, but its training beomes more unstable. In [10, 22, 28], VAE and GAN are integrated together so that they an benefit eah other. Both VAE and GAN work in an unsupervised way without giving any ondition of the label on the generated image. Instead, onditional VAE (VAE) [39, 3] extends it by showing the label for both enoder and deoder. It learns data distribution onditioned on the given label. Hene, the enoder and deoder beome p φ (z x, ) and p θ (x z, ). Similarly, in onditional GAN (GAN) [9, 18, 33, 30] label is given to both generator G(z, ) and disriminator D(x, ). Theoretially, feeding label to either the enoder in VAE or deoder in VAE or GAN helps inreasing the mutual information between the generated x and the label. Thus, it an improves the quality of generated image.

2 eral. Speifially, the two enoders are onstrained with different priors on their posterior distributions p φs (z s x) and p φu (z u x). Similar with VAE or VAE, in whih the full ode z is label irrelevant, the prior for z u is also hosen N (0, I). But different from previous works, the prior q(z s x) beomes omplex to apture the label revelent distribution. From the deoder s perspetive, it takes the onatenation of z s and z u to reonstrut the input x.here the distintion with VAE and GAN is that they uses the fixed, one-hot enoding label, while our work applies z s, whih is onsidered to be a variational, soft label. Note that there are two stages for training our model. First, the enoder for z s gets trained for lassifiation task under the supervision of label. Here instead of the softmax ross entropy loss, Gaussian mixture ross entropy loss proposed in [44] is adopted sine it aumulates the mean µ and variane σ for samples with the same label, and models it as the Gaussian N (µ, σ ), hene z s N (µ, σ ). The first stage speifies the label relevant distribution. In the seond stage, the two enoders and the deoder are trained jointly in an end-to-end manner based on the reonstrution loss. Meanwhile, priors of z s N (µ, σ ) and z u N (0, I) are also onsidered. The main ontribution of this paper lies in following aspets: (1) for a single input x to the enoder, we provide an algorithm to disentangle the latent spae into label relevant and irrelevant dimensions in VAE. Previous works like [15, 4, 37] disentangle the latent spae in AE not VAE. So it is impossible to make the inferene from their model. Moveover, [27, 4, 23] requires at least two inputs for training. (2) we find the Gaussian mixture loss funtion is suitable way for estimating the parameters of the prior distribution, and it an be optimized in VAE framework. (3) we give both a theoretial derivation and a variety of detailed experiments to explain the effetiveness of our work. 2. Related works Two types of methods for the strutured image generation are VAE and GAN. VAE [20] is a type of parametri model defined by p θ (x z) and p φ (z x), whih employs the idea of variational inferene to maximize the evidene lower bound (ELBO), as is shown in (1). log p θ (x) E qφ (z x)(log p θ (x z)) D KL (p φ (z x) q(z)) (1) The right side of the above is the ELBO, whih is the lower bound of maximum likelihood. In VAE, a differentiable enoder-deoder are onneted, and they are parameterized by φ and θ, respetively. E pφ (z x)(log p θ (x z)) represents the end-to-end reonstrution loss, and KL(p φ (z x) q(z)) is the KL divergene between the enoder s output distribution p φ (z x) and the prior q(z), whih is usually modeled by standard normal distribution N (0, I). Note that VAE assumes that the posterior p φ (z x) is of Gaussian, and the µ and σ are estimated for every single input x by the enoder. This strategy is named amortized variational inferene (AVI), and it is more effiieny than stohasti variational inferene (SVI) [17]. VAE s advantage is that its loss is easy to optimize, but the simple prior in latent spae may not apture the omplex data patterns whih often leads to the mode ollapse in latent spae. Moreover, VAE s ode is hard to be interpreted. Thus, many works fous on improving VAE on these two aspets. VAE [39] adds the label vetor as the input for both the enoder and deoder, so that the latent ode and generated image is onditioned on the label, and potentially prevent the latent ollapse. On the other hand, β- VAE [16, 7] is a unsupervised approah for the latent spae disentanglement. It introdues a simple hyper-parameter β to balane the two loss term in (1). A sheme named infinite mixture of VAEs is proposed and applied in semisupervised generation [1]. It uses multiple number of VAEs and ombines them as a non-parametri mixture model. In [19], the semi-amortized VAE is proposed. It ombines AVI with SVI in VAE. Here the SVI estimates the distribution parameters on the whole training set, while the AVI in traditional VAE gives this estimation for a single input. GAN [13] is another tehnique to model the data distribution p D (x). It starts from a random z p(z), where p(z) is simple, e.g. Gaussian, and trains a transform network g θ (z) under the help of disriminator D φ ( ) so that p θ (z) approximates p D (x). The later works [32, 26, 2, 14, 29] try to stabilize GAN s training. Traditional GAN works in a fully supervised manner, while GAN [18, 33, 30, 6] aims to generate images onditioned on labels. In GAN, the label is given as an input to both the generator and disriminator as a ondition for the distribution. The enoder-deoder arhiteture like AE or VAE an also be used in GAN. In ALI [11] and BiGAN [10], the enoder maps x to z, while the deoder reverse it. The disriminator takes the pair of z and x, and is trained to determine whether it omes from the enoder or deoder in an adversarial manner. In VAE-GAN [22, 24], VAE s generated data are improved by a disriminator. Similar idea also applies to VAE in [3]. VAE-GAN also applies in some speifi appliations like [4, 12]. Sine ode z potentially affets the generated data, some works try to model its effet and disentangle the dimensions of z. InfoGAN [9] reveals the effet of latent spae ode by maximizing the mutual information between and the syntheti data g θ (z, ). Its generator outputs g θ (z, ) whih is inspeted by the disriminator D φ ( ). D φ ( ) also tries to reonstrut the ode. In [27], the latent dimension is disentangled in VAE based on the speified fators and unspeified ones, whih is similar with our work. But its enoder takes multiple inputs, and the deoder ombines odes from different inputs for reonstrution. The work in [15] mod-

Enoder S LGM zs μ Σ μ5 μ1 μ9 μ4 μ6 μ2 μ10 μ7 μ3 μ8 dimensions z s and the irrelevant dimensions z u, whih means z s fully reflets the lass but z u dose not.

log p(x) = log p(x, z s, z u, )dz s dz u x Enoder U Lkl zu Deoder Adversarial Classifier LC adv LE adv x Lre Disriminator adv LGD adv LD E qψ (z s x),q φ (z u x)[log p θ (x z s, z u )] D KL (q φ (z u

The enoder s maps input image x to z s, and fores z s to be well lassified while following a Gaussian mixture distribution with learned mean µ and ovariane Σ.

The adversarial lassifier is added on the top of z u to distinguish the lass of z u, while enoder u tries to fool it.

3 Enoder S LGM zs μ Σ μ5 μ1 μ9 μ4 μ6 μ2 μ10 μ7 μ3 μ8 dimensions z s and the irrelevant dimensions z u, whih means z s fully reflets the lass but z u dose not. Thus the objetive an be rewritten as (derived in detail in Appendix). log p(x) = log p(x, z s, z u, )dz s dz u x Enoder U Lkl zu Deoder Adversarial Classifier LC adv LE adv x Lre Disriminator adv LGD adv LD E qψ (z s x),q φ (z u x)[log p θ (x z s, z u )] D KL (q φ (z u x) p(z u )) D KL (q ψ (z s, x) p(z s, )) (2) Figure 1. The network arhiteture. We disentangle lass relevant dimensions z s and lass irrelevant dimensions z u in the latent spae. The enoder s maps input image x to z s, and fores z s to be well lassified while following a Gaussian mixture distribution with learned mean µ and ovariane Σ. Meanwhile, the enoder u extrats z u from x and pushes it to math the standard Gaussian N (0, I). The adversarial lassifier is added on the top of z u to distinguish the lass of z u, while enoder u tries to fool it. Then z s and z u are onatenated and fed into the deoder to obtain x for reonstrution. The adversarial training in the pixel domain is also adopted with a disriminator added on the images. The forward pass proess is drawn in solid lines and dashed lines represent bak propagation. ifies [27] by taking a single input. To stabilize training, its model is built in AE not VAE, hene it an t perform variational inferene. Other works in [37, 4, 23] are also built in AE and more than two inputs. Moreover they only apply in a partiular domain like fae [37, 4] or image-to-image translation [23], while our work is built in VAE and takes only a single input for a more general ase. 3. Proposed method We propose a image generation algorithm based on VAE whih divides the enoder into two separate ones, one enoding label relevant representation z s and the other enoding label irrelevant information z u. z s is learned with supervision of the ategorial lass label and it is required to follow a Gaussian mixture distribution, while z u is wished to ontain other ommon information irrelevant to the label and is made lose to standard Gaussian N (0, I) Problem formulation Given a labeled dataset D s = {(x 1, y 1 ), (x 2, y 2 ),, (x (N), y (N) )}, where x (i) is the i-th images and y (i) {0, 1,, C 1} is the orresponding label. C and N are the number of lasses and the size of the dataset, respetively. The goal of VAE is to maximum the ELBO defined in (1), so that the data log-likelihood log p(x) is also maximized. The key idea is to split the full latent ode z into the label relevant In Eq. 2, the ELBO beomes 3 terms in our setting. The first term is the negative reonstrution error, where p θ is the deoder parameterized by θ. It measures whether the latent ode z s and z u is informative enough to reover the original data. In pratie, the reonstrution error L re an be defined as the l 2 loss between x and x. The seond term ats as a regularization term of label irrelevant branh that pushes q φ (z u x) to math the prior distribution p(z u ), whih is illustrated in detail in Setion 3.2. The third term mathes q ψ (z s x) to a lass-speifi Gaussian distribution whose mean and ovariane are learned with supervision, and it will be further introdued in Setion Label irrelevant branh Intuitively, we want to disentangle the latent ode z into z s and z u, and expet z u to follow a fixed, prior distribution whih is irrelevant to the label. This regularization is realized by minimizing KL divergene between q φ (z u x) and the prior p(z u ) as illustrated in Eq. 3. More speifially, q φ (z u x) is a Gaussian distribution whose mean µ and diagonal ovariane Σ are the output of enoder u parameterized by φ. p(z u ) is simply set to N(0, I). Hene the KL regularization term is: L kl = D KL [N (µ, Σ) N (0, I)] (3) Note that Eq. 3 an be represented in a losed form, whih is easy to be omputed. To ensure good disentanglement in z u and z s, we introdue adversarial learning in the latent spae as in AAE [25] to drive the label relevant information out of z u. To do this, an adversarial lassifier is added on the top of z, whih is trained to lassify the ategory of z with ross entropy loss as is shown in (4): L adv C = E qφ (z u x) I( = y) log q ω ( z u ) (4) where I( = y) is the indiator funtion, and q ω ( z u ) is softmax probability output by the adversarial lassifier parameterized by ω. Meanwhile, enoder u is trained to fool the lassifier, hene the target distribution beomes uniform

4 over all ategories, whih is 1 C. The ross entropy loss is defined as (5). L adv E = E qφ (z u x) 3.3. Label relevant branh 1 C log q ω( z u ) (5) Inspired by GM loss [44], we expet z s to follow a Gaussian mixture distribution, expressed in Eq. 6, where µ and Σ are the mean and ovariane of Gaussian distribution for lass, and p() is the prior probability, whih is simply set to 1 C for all ategories. For simpliity, we ignore the orrelation among different dimensions of z s, hene Σ is assumed to be diagonal. p(z s ) = p(z s )p() = N (z s ; µ, Σ )p() (6) Reall that in Eq. 2, the KL divergene between q ψ (z s, x) and p(z s, ) is minimized. If z s is formulated as a Gaussian distribution with its Σ 0 and its mean ẑ s output by enoder s, whih is atually a Dira delta funtion δ(z s ẑ s ), the KL divergene turns out to be the likelihood regularization term L lkd in Eq. 7, whih is proved in Appendix. Here µ y and Σ y are the mean and ovariane speified by the label y. L lkd = log N (ẑ s ; µ y, Σ y ) (7) Furthermore, we want z s to ontain label information as muh as possible, thus the mutual information between z s and lass is added to the maximization objetive funtion. We prove in Appendix that it s equal to minimize the rossentropy loss of the posterior probability q( z s ) and the label, whih is exatly the lassifiation loss L ls in GM loss as is shown in Eq. 8. L ls = E qψ (z s x) I( = y) log q( z s ) = log N (ẑ s µ y, Σ y )p(y) k N (ẑ s µ k, Σ k )p(k) These two terms are added up to form GM loss in Eq. 9. Here L GM is finally used to train the enoder s. (8) L GM = L ls + λ lkd L lkd (9) 3.4. The deoder and the adversarial disriminator The latent odes z s and z u output by enoder s and enoder u are first onatenated together, and then further given to the deoder to reonstrut the input x by x. Here the deoder is indiated by p θ (x z) with its parameter θ learned from the l 2 reonstrution error L re. To synthesize a high quality x, we also employ the adversarial training in the pixel domain. Speifially, a disriminator D θd (x, ) with adversarial training on its parameter θ d is used to improve x. Here the label is utilized in D θd like in [30]. The adversarial training loss for disriminator an be formulated as in Eq. 10, L adv D = E x Pr [log D θd (x, )] E zu N (0,I),z s p(z s)[log(1 D θd (G(z s, z u ), ))] (10) while this loss beomes L adv GD = E zu N (0,I),z s p(z s)[log(d θd (G(z s, z u ), ))] for the generator. Note that here G(z s, z u ) is the deoder and p(z s ) is defined in Eq Training algorithm The training detail is illustrated in Algorithm 1. The Enoder s, modeled by q ψ, extrats label relevant ode z s. Enoder s is trained with L GM and L re, enouraging z s to be label dependent and follow a learned Gaussian mixture distribution. Meanwhile, the Enoder u represented by q φ is intended to extrat lass irrelevant ode z u. It s trained by L kl, L adv E and L re to make z u irrelevant to the label and be lose to N (0, I). The adversarial lassifier parameterized by ω is learned to lassify z u using L adv C. Then the deoder p θ generates reonstrution image using the ombined feature of z s and z u with the loss L re. In the training proess, a 2-stage alternating training algorithm is adopted. First, enoder s is updated using L GM to learn mean µ and ovariane Σ of the prior p(z s ). Then, the two enoders and the deoder are trained jointly to reonstrut images while the distributions of z s and z u are onsidered Appliation in semi-supervised generation Given L unlabelled extra data D u = {x (N+1), x (N+2),, x (N+L) }, we now use our arhiteture for the semi-supervised generation, in whih the labels y (N+i) of x (N+i) in D u are not presented. Here we hold the assumption that D u are in the same domain as the fully supervised D s, but y (N+i) an be satisfied y (N+i) {0, 1,, C 1}, or out of the predefined range. In other words, if the absent y (N+i) is in the predefined range, its z s follows the same Gaussian mixture distribution as in Eq. 6. Otherwise, z s should follow an ambiguous Gaussian distribution defined in Eq. 11. µ t = σ 2 t = p()µ p()σ 2 + p()(µ ) 2 ( p()µ ) 2 (11)

5 Algorithm 1 The training proess of our proposed arhiteture. Require: ψ, φ, θ, ω, θ d initial parameters of Enoder s, Enoder u, Deoder, the adversarial lassifier on z u and the disriminator on x; µ and Σ initial mean and ovariane for Gaussian distribution of z s ; n gm, the number of iterations of L GM per end-to-end iteration; λ re and λ kl the weight of L re and L kl ; 1: while not onverged do 2: for i = 0 to n gm do 3: Sample {x, y} a bath from dataset. 4: ẑ s enoder s (x). 5: L GM log q(y ẑ s ) λ lkd log p(ẑ s y) 6: ψ + ψ L GM 7: µ + µ L GM, [0, C 1] 8: + Σ Σ L GM, [0, C 1] 9: end for 10: Sample {x, y} a bath from dataset. 11: µ, Σ Enoder u (x) 12: L kl D KL [N (µ, Σ) N (0, I)] 13: Sample ɛ N (0, I) 14: z u Σ 1 2 ɛ + µ 15: L adv E 1 C log q ω( z u ) 16: L adv C log q ω(y z u ) 17: ẑ s Enoder s (x). 18: L lkd log N (ẑ s ; µ y, Σ y ) 19: x f Deoder(ẑ s, z u ) 20: L re 1 2 x x f : Sample z p s p(z s y), z p u N (0, I) 22: x p Deoder(z p s, z p u) log D θ d (x, y) log(1 D θd (x f, y)) log(1 D θd (x p, y)) 24: L adv GD log(d θ d (x f, y)) log(d θ d (x p, y)) 23: L adv D 25: ψ + ψ (L re + λ lkd L lkd ) 26: φ + φ (L adv E + λ kll kl + λ re L re ) 27: ω + ω L adv C 28: θ + θ (L re + L adv GD ) + 29: θ d θd L adv D 30: end while More speifially, z s is expeted to follow N (µ t, Σ t ) where µ t and Σ t are the total mean and ovariane of all the lass-speifi Gaussian distributions N (µ, Σ ) as illustrated in Eq. 6. Here, Σ t is diagonal matrix with σt 2 as its variane vetor. σ 2 is also the variane vetor of Σ. Hene, the likelihood regularization term beomes L lkd = log N (ẑ s ; µ t, Σ t ). The whole network is trained in a end-to-end manner using total losses. Note that in this setting, the label y is not provided, so L GM, L adv E and Ladv C are ignored in the training proess. 4. Experiments In this setion, experiments are arried out to validate the effetiveness of the proposed method. A toy example is first designed to show that by disentangling the label relevant and irrelevant odes, our model has the ability of generating diverse data samples than VAE-GAN [3]. We then ompare the quality of generated images on real image datasets. The latent spae is also analyzed. Finally, the experiments of semi-supervised generation and image inpainting show the flexibility of our model, hene it may has many potential appliations Toy examples This setion demonstrates our method on a toy example, in whih the real data distribution lies in 2D with one dimension (x axis) being label relevant and the other (y axis) being irrelevant. The distribution is assumed to be known. There are 3 types of data points indiated by green, red and blue, belonging to 3 lasses. The 2D data points and their orresponding labels are given to our model for variational inferene and the new sample generation. For omparison, we also give the same training data to VAE-GAN for the same purpose. The two ompared models share the similar settings of the network. In our model, the two enoders are both MLP with 3 hidden layers, and there are 32, 64, and 64 units in them. In VAE-GAN, the enoder is the same, but it only has one enoder.the disriminators are exatly the same, whih is also an MLP of 3 hidden layers with 32, 64, and 64 units. Adam is used as the optimization method in whih a fixed learning rate of is applied for both. Eah model is trained for 50 epohs until they all onverge. The generated samples of eah model is plotted in Figure 2. From Figure 2 we an observe that both two models an apture the underlying data distribution, and our model onverges at the similar rate. The advantage of our model is that it tend to generate diverse samples, while VAE-GAN generates samples in a onserving way in whih the label irrelevant dimensions are within the limited value range Analysis on generated image quality In this setion, we ompare our method with other generative models for image generation quality. The experiments are onduted on two datasets: FaeSrub [31] and CIFAR- 10 [21]. The FaeSrub ontains 92k training images from 530 different identities. For FaeSrub, a asaded objet detetor proposed in [42] is first used to detet faes first, and then the fae alignment is also onduted based on SDM proposed in [46]. The deteted ropped faes are resized to the fixed size In the training proess, Adam optimizer with α = is used. The hyper parameter λ lkd, λ kl, and λ re are set to 0.1, 10 N pixel, and 1 N zu,

initial 10 epohes 20 epohes 30 epohes 40 epohes 50 epohes CVAE-GAN ours

Results on a toy example for our model and VAE-GAN.

(b) VAE () GAN (d) CVAE-GAN (e) ours CIFAR-10 FaeSrub (a) real samples

zu. Sine our method inorporates the label for training, popular

sampling z N (0, I) and then onatenating z and one hot vetor of as the

As for ours, zs N (µ, σ ) and zu N (0, I) are sampled and ombined for

Some of generated images are visualized in Figure 9.

we refer to two metris,ineption Sore [36] and intra-lass diversity [5] to

We adopt Ineption Sore to evaluate realism and interlass diversity of

Generated images that are lose to real images of lass y should have a

Meanwhile, images of diverse lasses should have a marginal probability

Hene, Ineption Sore, formulated as exp(ex KL(p(y x) p(y))), gets a high

To get onditional lass probability p(y x), we first train a lassifier

6 initial 10 epohes 20 epohes 30 epohes 40 epohes 50 epohes CVAE-GAN ours (a) real data distribution (b) generated distribution Figure 2. Results on a toy example for our model and VAE-GAN. We show the generated points at different epohs. (b) VAE () GAN (d) CVAE-GAN (e) ours CIFAR-10 FaeSrub (a) real samples Figure 3. Visualization of generated images of different models. respetively. Here, Npixel is the number of image pixels, and Nzu is the dimension of zu. Sine our method inorporates the label for training, popular generative networks onditioned on label, like VAE [39], VAE-GAN [3], and GAN [30], are hosen for omparison. For VAE, VAEGAN and GAN, we randomly generate samples of lass by first sampling z N (0, I) and then onatenating z and one hot vetor of as the input of deoder/generator. As for ours, zs N (µ, σ ) and zu N (0, I) are sampled and ombined for deoder to generate samples. Some of generated images are visualized in Figure 9. It shows that samples generated by VAE are highly blurred, and GAN suffers from mode ollapse. Samples generated by VAEGAN and our method seem to have similar quality, we refer to two metris,ineption Sore [36] and intra-lass diversity [5] to ompare them. We adopt Ineption Sore to evaluate realism and interlass diversity of images. Generated images that are lose to real images of lass y should have a posterior proba- bility p(y x) with low entropy. Meanwhile, images of diverse lasses should have a marginal probability p(y) with high entropy. Hene, Ineption Sore, formulated as exp(ex KL(p(y x) p(y))), gets a high value when images are realisti and diverse. To get onditional lass probability p(y x), we first train a lassifier with Ineption-ResNet-v1 [40] arhiteture on real data. Then we randomly generate 53k samples(100 for eah lass) of FaeSrub and 5k samples (500 for eah lass) of CIFAR-10, and apply them to the pre-trained lassifier. The marginal p(y) is obtained by averaging all p(y x). The results are listed in Table 4. We emphasize that our method will generate more diverse samples in one lass. Sine Ineption Sore only measures inter-lass diversity, intra-lass diversity of samples should also be taken into aount. We adopt the metri proposed in [5], whih measures the average negative MSSSIM [45] between all pairs in the generated image set X. Table 2 shows the inter-lass diversity of VAE-GAN and

7 FaeSrub CIFAR-10 VAE [39] GAN [30] VAE-GAN [3] ours Table 1. Ineption Sore of different methods on two datasets. Please refer to 4.2 for more details. with linearly transformed z u a fixed z s of lass,and eah olumn shares a same z u. We an see that images from eah row hange onsistently with poses, expressions and illuminations. These two experiments suggest that z s is relevant to, while z u reflets more ommon label irrelevant harateristis. FaeSrub CIFAR-10 VAE-GAN [3] ours Table 2. Intra-lass diversity of different methods on two datasets. Please refer to 4.2 for more details. (a) Fixing z u and varying z s. our method on FaeSrub and CIFAR-10. d intra (X) = 1 1 X 2 (x,x) X X MS SSIM(x, x) (12) 4.3. Analysis on disentangled latent spae We now evaluate our proposal on the disentangled latent spae, whih is represented by label relevant dimensions z s and irrelevant ones z u. z s for lass is supposed to apture the variation unique to training images within the label, while z u should ontain the variation in ommon harateristis for all lasses. It s validated in the following ways: (1) fixing z u and varying z s. In this setting, we diretly sample a z u N (0, I), and keep it fixed. Then a set of z s for lass is obtained by first getting a series of random odes sampled from N (0, I) and then mapping them to lass. In speifi, we first sample z 1 N (0, I) and z 2 N (0, I). Then a set of random odes z (i) are obtained by linear interpolation, i.e., z (i) = αz 1 + (1 α)z 2, α [0, 1]. We = z (i) σ + µ. Finally eah z (i) s is onatenated with the fixed z u and given to the deoder to get a generated image. (2) fixing z s and varying z u. Similar to (1), we first sample a z s N (µ, σ ) from a learned distribution and keep it fixed. Then a set of label irrelevant z u are obtained by linearly interpolating between z 1 and z 2, where z 1 and z 2 are sampled from N (0, I). We ondut experiments on FaeSrub and the generated images are shown in Figure 4. In Figure 4 (a), eah row presents samples generated by linearly transformed z s of a ertain lass and a fixed z u. All three rows share the same z u, and eah olumn shares the same random ode z (i) and map eah z (i) to lass with z (i) s just maps it to different lass with z (i) s = z (i) σ + µ. It shows that as z s varies, one may hange differently for different identities, e.g., grow a beard, wrinkle, or take off the make-up. In Figure 4 (b), eah row presents samples (b) Fixing z s and varying z u. Figure 4. The generated images by fixing one ode and varying the other. In (a), eah row shows samples for linearly transformed z s of a ertain lass with a fixed z u. In (b), eah row orresponds to samples for linearly transformed z u with a fixed z s of lass. We are also interested in eah dimension in z u and ondut an experiment by varying a single element in it. We find three dimensions in z u whih reflet the meaningful the ommon harateristis, suh as the expression, elevation and azimuth Semi-supervised image generation Aording to the details in Setion 3.6, the experiments on semi-supervised image generation are onduted. We find our method an learn well disentangled latent representation when the unlabelled extra data are available. To validate that, we randomly selet 200 identities of about 21k images from CASIA [47] dataset and remove their labels to form unlabelled dataset D u. Note that the identities in D u are totally different with those in FaeSrub. After training the whole network on labelled dataset D s, we finetune it on D u using the training algorithm illustrated in Setion 3.6. To demonstrate the semi-supervised generation results, two different images are given to Enoder S and Enoder U to generate the ode z s and z u, respetively. Then, the deoder is required to synthesis a new image based on the onatenated ode from z s and z u. The Figure 6 shows fae synthesis results using images whose identities have not appeared in D s. The first row and first olumn show a set of original images providing z u and z s respetively,

8 (a) expression (b) elevation () azimuth Figure 5. The generated images by fixing z s for eah row and varying single dimensions in z u. Here, we find three different dimensions in z u, whih diretly auses the variations on expressions in (a), eluviation in (b), and azimuth in (). Figure 6. Fae synthesis using images whose identities have not appeared in D s. Original images providing z u and z s are given in the first row and the first olumn. The synthesizing images using the ombination of z u z s are shown in the orresponding position. while images in the middle are generated ones using z s of the orresponding row and z u of the orresponding olumn. It is obvious that the identity depends on z s, while other harateristis like the pose, illumination, expressions are refleted on z u. This semi-supervised generation shows z s and z u an also be disentangled on identities outside the labeled training data D s, whih provides the great flexibility for image generation Image inpainting eyes), while our method provides visually pleasing results. original image orrupted image CVAE-GAN ours original image orrupted image CVAE-GAN ours original image orrupted image C (a) right-half fae (b) eyes () nose and m Our method an also be applied to image inpainting. It means that given a partly orrupted image, we an extrat meaningful latent ode to reonstrut the original image. Note that in VAE-GAN [3], an extra lass label should be provided for reonstrution while it s needless in our method. In pratie, we first orrupt some pathes for a image x, namely right-half, eyes, nose and mouth, and bottom-half regions, then input those orrupted images into the two enoders to get z s and z u, then the reonstruted image x is generated using a ombined z s and z u. The image inpainting result is obtained by x inp = M x + (1 M) x, where M is the binary mask for the orrupted path. Figure 7 shows the results of image inpainting. VAE-GAN struggles to omplete the images when it omes to a large part of missing regions (e.g. righthalf and bottom-half parts) or pivotal regions of faes (e.g. original image orrupted image CVAE-GAN ours original image orrupted image CVAE-GAN ours original image orrupted image CVAE-GAN ours (a) right-half fae (b) eyes () nose and mouth original image orrupted image CVAE-GAN (d) bottom-half fae Figure 7. Image inpainting results. The original image is orrupted with different patterns, on the right-half, eyes, nose and mouth, and bottom-half fae. We ompare our model with VAE-GAN. 5. Conlusion We propose a latent spae disentangling algorithm on VAE baseline. Our model learns two separated enoders ours

9 and divides the latent ode into label relevant and irrelevant dimensions. Together with a disriminator in pixel domain, we show that our model an generate high quality and diverse images, and it an also be applied in semisupervised image generation in whih unlabeled data with unseen lasses are given to the enoders. Future researh inludes building more interpretable latent dimensions with help of more labels, and reduing the orrelation between the label relevant and irrelevant odes in our framework.

10 Referenes [1] M. E. Abbasnejad, A. Dik, and A. van den Hengel. Infinite variational autoenoder for semi-supervised learning. In 2017 IEEE Conferene on Computer Vision and Pattern Reognition (CVPR), pages IEEE, [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. stat, 1050:9, [3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: Fine-grained image generation through asymmetri training. In 2017 IEEE International Conferene on Computer Vision (ICCV), pages IEEE, [4] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards openset identity preserving fae synthesis. In Proeedings of the IEEE Conferene on Computer Vision and Pattern Reognition, pages , [5] M. Ben-Yosef and D. Weinshall. Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised lustering of images. arxiv preprint arxiv: , [6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In The IEEE Conferene on Computer Vision and Pattern Reognition (CVPR), volume 1, page 7, [7] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerhner. Understanding disentangling in β-vae. arxiv preprint arxiv: , [8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggfae2: A dataset for reognising faes aross pose and age. In Automati Fae & Gesture Reognition (FG 2018), th IEEE International Conferene on, pages IEEE, [9] X. Chen, Y. Duan, R. Houthooft, J. Shulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advanes in neural information proessing systems, pages , [10] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arxiv preprint arxiv: , [11] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inferene. In ICLR, [12] Y. Ge, Z. Li, H. Zhao, G. Yin, X. Wang, and H. Li. Fdgan: Pose-guided feature distilling gan for robust person reidentifiation. In Advanes in Neural Information Proessing Systems, [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advanes in neural information proessing systems, pages , [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advanes in Neural Information Proessing Systems, pages , [15] N. Hadad, L. Wolf, and M. Shahar. A two-step disentanglement method. In Proeedings of the IEEE Conferene on Computer Vision and Pattern Reognition, pages , [16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinik, S. Mohamed, and A. Lerhner. β-vae: Learning basi visual onepts with a onstrained variational framework. In ICLR, [17] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stohasti variational inferene. The Journal of Mahine Learning Researh, 14(1): , [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Imageto-image translation with onditional adversarial networks. arxiv preprint, [19] Y. Kim, S. Wiseman, A. C. Miller, D. Sontag, and A. M. Rush. Semi-amortized variational autoenoders. arxiv preprint arxiv: , [20] D. P. Kingma and M. Welling. Auto-enoding variational bayes. stat, 1050:1, [21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tehnial report, Citeseer, [22] A. B. L. Larsen, S. K. Sønderby, H. Larohelle, and O. Winther. Autoenoding beyond pixels using a learned similarity metri. In International Conferene on Mahine Learning, pages , [23] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.- H. Yang. Diverse image-to-image translation via disentangled representations. In European Conferene on Computer Vision, [24] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-toimage translation networks. In Advanes in Neural Information Proessing Systems, pages , [25] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoenoders. arxiv preprint arxiv: , [26] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conferene on, pages IEEE, [27] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprehmann, and Y. LeCun. Disentangling fators of variation in deep representation using adversarial training. In Advanes in Neural Information Proessing Systems, pages , [28] L. Mesheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoenoders and generative adversarial networks. In International Conferene on Mahine Learning (ICML), pages PMLR, [29] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spetral normalization for generative adversarial networks. In ICLR, [30] T. Miyato and M. Koyama. gans with projetion disriminator. arxiv preprint arxiv: , [31] H.-W. Ng and S. Winkler. A data-driven approah to leaning large fae datasets. In Image Proessing (ICIP), 2014 IEEE International Conferene on, pages IEEE, [32] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergene min-

11 imization. In Advanes in Neural Information Proessing Systems, pages , [33] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary lassifier gans. In International Conferene on Mahine Learning, pages , [34] D. J. Rezende, S. Mohamed, and D. Wierstra. Stohasti bakpropagation and approximate inferene in deep generative models. In International Conferene on Mahine Learning, pages , [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Sale Visual Reognition Challenge. International Journal of Computer Vision (IJCV), 115(3): , [36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved tehniques for training gans. In Advanes in Neural Information Proessing Systems, pages , [37] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shehtman, and D. Samaras. Neural fae editing with intrinsi image disentangling. In Computer Vision and Pattern Reognition (CVPR), 2017 IEEE Conferene on, pages IEEE, [38] K. Simonyan and A. Zisserman. Very deep onvolutional networks for large-sale image reognition. arxiv preprint arxiv: , [39] K. Sohn, H. Lee, and X. Yan. Learning strutured output representation using deep onditional generative models. In Advanes in Neural Information Proessing Systems, pages , [40] C. Szegedy, S. Ioffe, V. Vanhouke, and A. A. Alemi. Ineption-v4, ineption-resnet and the impat of residual onnetions on learning. In AAAI, volume 4, page 12, [41] C. Szegedy, V. Vanhouke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the ineption arhiteture for omputer vision. In Proeedings of the IEEE onferene on omputer vision and pattern reognition, pages , [42] P. Viola and M. Jones. Rapid objet detetion using a boosted asade of simple features. In Computer Vision and Pattern Reognition, CVPR Proeedings of the 2001 IEEE Computer Soiety Conferene on, volume 1, pages I I. IEEE, [43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Calteh-UCSD Birds Dataset. Tehnial Report CNS-TR , California Institute of Tehnology, [44] W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking feature distribution for loss funtions in image lassifiation. In Proeedings of the IEEE Conferene on Computer Vision and Pattern Reognition, pages , [45] Z. Wang, E. P. Simonelli, and A. C. Bovik. Multisale strutural similarity for image quality assessment. In The Thrity- Seventh Asilomar Conferene on Signals, Systems & Computers, 2003, volume 2, pages Ieee, [46] X. Xiong and F. De la Torre. Supervised desent method and its appliations to fae alignment. In Proeedings of the IEEE onferene on omputer vision and pattern reognition, pages , [47] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning fae representation from srath. arxiv preprint arxiv: , 2014.

12 Appendies A. Mathematial Proofs A.1. The ELBO of the log-likelihood objetive. We delare in Equation 2 that after dividing the latent spae into label relevant diemnsions z s and label irrelevant diemnsions z u, the ELBO of the log-likelihood objetive log p(x) beomes 3 terms in our setting. log p(x) = log p(x, z s, z u )dz s dz u E qψ (z s x),q φ (z u x)[log p θ (x z s, z u )] D KL (q φ (z u x) p(z u )) D KL (q ψ (z s, x) p(z s, )) Proof Our generative proess is desribed as follows. First, sample a label relevant ode z s p(z s, ) and a label irrelevant ode z u p(z u ). Then, a deoder p θ (x z s, z u ), taking the ombination of z s and z u as input, maps latent odes to images. Hene, we fatorize the joint distribution p(x, z s, z u ) as: p(x, z s, z u ) = p θ (x z s, z u )p(z s, )p(z u ) By using Jensens inequality, the log-likelihood log p(x) an be written as: log p(x) = log p(x, z s, z u )dz s dz u = log p θ (x z s, z u )p(z s, )p(z u )dz s dz u = log E qψ (z s, x),q φ (z u x) p θ (x z s, z u )p(z u )p(z s, ) q ψ (z s, x)q φ (z u x) E p θ(x z s, z u )p(z u )p(z s, ) qψ (z s, x),q φ (z u x)[log ] q ψ (z s, x)q φ (z u x) = E qψ (z s, x),q φ (z u x)[log p θ (x z s, z u )] + E qψ (z s, x),q φ (z u x)[log p(z u) q φ (z u x) ] + E qψ (z s, x),q φ (z u x)[log p(z s, ) q ψ (z s, x) ] = E qψ (z s x),q φ (z u x)[log p θ (x z s, z u )] D KL (q φ (z u x) p(z u )) D KL (q ψ (z s, x) p(z s, )) A.2. Log-likelihood regularization term in the label relevant branh Note that the KL divergene D KL (q ψ (z s, x) p(z s, )), the third term of the ELBO in Equation 2, is minimized. If we assume onditional independene between z s and the lass, then we have q ψ (z s, x) = q ψ (z s x)p( x) where p( x) is the one-hot enoding of the label y. If q ψ (z s x) is formulated as Gaussian distribution with Σ 0 and mean ẑ s output by enoder s, whih is atually a Dira delta funtion. q ψ (z s x) = δ(z s ẑ s ) The KL regularization term beomes D KL (q ψ (z s, x) p(z s, )) =D KL [δ(z s ẑ s )p( x) p(z s )p()] = p(z s )p() δ(z s ẑ s )p( x) log δ(z s ẑ s )p( x) dz s = δ(z s ẑ s )p( x) log p(z s )dz s + δ(z s ẑ s )p( x) log p() p( x) dz s δ(z s ẑ s )p( x) log δ(z s ẑ s )dz s = p( x) log p(ẑ s ) p( x) log p() p( x) + δ(z s ẑ s ) log δ(z s ẑ s )dz s The seond term relates to the prior distribution, so it an be regraded as a onstant. The third term is negative entropy of delta funtion and has nothing to do with ẑ s, hene we onsider it as a onstant too. Therefore, we have D KL (q ψ (z s, x) p(z s, )) = = p( x) log p(ẑ s ) + Const. I( = y) log p(ẑ s ) + Const. where the prior distribution p(ẑ s ) is set to N(ẑ s ; µ, Σ ). Ignoring the onstant term, it turns out to be the likelihood regularization term L lkd in Equation 7. L lkd = I( = y) log N(ẑ s ; µ, Σ ) = log N(ẑ s ; µ y, Σ y ) A.3. Cross-entropy objetive in the label relevant branh To enourage z s to beome label relevant as muh as possible, the mutual information I(z s ; ) is maximized,

13 where z s q ψ (z s x). In pratie, I(z s ; ) is hard to optimize diretly beause it requires aess to p( z s ). We an instead optimize its lower bound by introduing an auxiliary distribution q( z s ) to approximate p( z s ) as in info- GAN [9]. I(z s ; ) = H() H( z s ) = H() + E p(x) E qψ (z s x)e p( zs) log p( z s ) = H() + E p(x) E qψ (z s x)e p( zs) log p( z s) q( z s ) + E p(x) E qψ (z s x)e p( zs) log q( z s ) H() + E p(x) E qψ (z s x)e p( zs) log q( z s ) Sine we still need to sample from p( z s ) in the inner expetation, we adopt Lemma 5.1 in infogan to further remove the need of p( z s ). The first term of the lower bound is a onstant, so we ignore it. Then the seond term beomes E p(x) E qψ (z s x)e p( zs) log q( z s ) =E p( )E p(x )E qψ (z s x)e p( zs) log q( z s ) = p( )p(x )q ψ (z s x)p( z s ) log q( z s )dxdz s = p( )p( z s ) log q( z s )[ p(x )q ψ (z s x)dx]dz s We hold the assumption that the proess of sampling z s x is independent on, thus p(x )q ψ (z s x)dx = p(z s, x )dx = p(z s ) Aording to Lemma 5.1 in infogan, we have p( )p(z s )p( z s ) log q( z s )dz s = Hene p( )p(z s ) log q( z s )dz s p( )p( z s ) log q( z s )[ = p( )p(z s )p( z s ) log q( z s )dz s = p()p(z s ) log q( z s )dz s p(x )q ψ (z s x)dx]dz s We further fatorize p(z s ) as p(x )q ψ (z s x)dx, the equation above beomes p()p(z s ) log q( z s )dz s = p()p(x )q ψ (z s x) log q( z s )dz s dx = p(x)q ψ (z s x)p( x) log q( z s )dz s dx =E p(x) E qψ (z s x) p( x) log q( z s ) where p( x) is the one-hot enoding of the label y, i.e. p( x) = I( = y). To maximize it is to minimize its opposite, whih is exatly the lassifiation loss in Setion 3.3. L ls = E p(x) E qψ (z s x) I( = y) log q( z s ) B. Experimental Details B.1. Dataset Synthesis of Toy Example Our syntheti dataset of toy example is a modifiation of the two-moon dataset, whih ontains three half irles instead of two. The generative proess is desribed as follows. First, sample data points from three half unit irles with a horizontal interval of 2.2. Then, add Gaussian noises with std = 0.15 to all of them. B.2. Network Arhiteture of FaeSrub For the two enoders, Enoder s and Enoder u, we use VGG [38] arhiteture with bath normalization layers added to eah layer and replae the last three f layers with two f layers of 1024 and 512 units. For the deoder, an inverse struture of the enoders is applied. The adversarial lassifier in Setion 3.2 onsists of two f layers of 256 and 530 units, and the disriminator ontains 7 onvolution layers and two f layers (details are shown in Table 4). Note that spetral normalization [29] is applied to to the all of the weights in the disriminator and the label embedding is inorporated in the first f layer as in [30]. B.3. Network Arhiteture of Cifar-10 The network strutures of the two enoders, deoder and disriminator for Cifar-10 are shown in Table 3. The adversarial lassifier in the latent spae is similar as that used for FaeSrub, whih are two f layers of 256 and 10 units. Also, spetral normalization and label embedding are applied in the disriminator. B.4. Optimization We use Adam optimizer with α = , β 1 = 0 and β 2 = 0.9. Sine in the training proess, the first stage using

14 Enoder Deoder Disriminator input x R input z s R 100, z u R 200 input x R onv, 32, stride 2, bathnorm, relu onat 5 5 onv, 32, stride 1, lrelu 5 5 onv, 64, stride 2, bathnorm, relu f, 1024, bathnorm, relu 5 5 onv, 128, stride 2, lrelu 3 3 onv, 128, stride 2, bathnorm, relu 5 5 onv, 256, stride 2, bathnorm, relu 5 5 onv, 256, stride 2, lrelu 3 3 onv, 256, stride 2, bathnorm, relu 5 5 onv, 256, stride 1, bathnorm, relu 5 5 onv, 256, stride 2, lrelu f, 1024, bathnorm, relu 5 5 onv, 128, stride 2, bathnorm, relu f, 512, lrelu f, 100 (for z s ) / 200 (for z u ) 5 5 onv, 64, stride 2, bathnorm, relu f, onv, 32, stride 2, bathnorm, relu 5 5 onv, 3, stride 1, tanh Table 3. The network struture for Cifar-10. Disriminator for FaeSrub input x R onv, 64, stride 2, lrelu 3 3 onv, 128, stride 2, lrelu 3 3 onv, 256, stride 1, lrelu 3 3 onv, 256, stride 2, lrelu 3 3 onv, 512, stride 1, lrelu 3 3 onv, 512, stride 2, lrelu 3 3 onv, 512, stride 2, lrelu global average pooling f, 1024, lrelu f, 1 Table 4. The network struture of disriminator for FaeSrub. L GM is trained 3 times per seond stage iteration, L GM onverges fast. Continuously training after it onverges will ause unstability of L GM beause Σ goes down gradually. In pratie, we deay the learning rate of Σ by 0.01 after 2 epohes. B.5. Ineption Sore Reall that Ineption Sore requires aess to the onditional lass probability p(y x). We use lassifiation model of Ineption-ResNet-v1 [40] arhiteture trained on VGGFae2 [8] to evaluate generative models trained on FaeSrub. For generative models trained on Cifar-10, lassifiation model of Ineption-v3 [41] arhiteture trained on ImageNet [35] is used. C.2. Additional Experiment on CUB and Cifar-100 We additionally apply our method to CUB [43] and Cifar-100 [21] dataset. The CUB ontains 200 ategories of birds with 11,788 images in total. For CUB , we rop the images aording to the bounding boxes provided by the dataset and resize the ropped images to The network struture is just same as it used in FaeSrub. For Cifar-100, we use the same network as in Cifar-10. Generated images are shown in Figure 9. Results of Ineption Sore and intra-lass diversity are listed in Table 5 and Table 6, respetively. CUB Cifar-100 VAE [39] GAN [30] VAE-GAN [3] ours Table 5. Ineption Sore of different methods. CUB Cifar-100 VAE-GAN [3] ours Table 6. Intra-lass diversity of different methods. C. Additional Experiment Results C.1. More generated samples on FaeSrub and Cifar-10 Figure 8 shows generated samples of our method on FaeSrub and Cifar-10 with eah row orresponding to a ertain lass.

Progress on Generative Adversarial Networks

Progress on Generative Adversarial Networks Wangmeng Zuo Vision Perception and Cognition Centre Harbin Institute of Technology Content Image generation: problem formulation Three issues about GAN Discriminate