Lab meeting (Paper review session) Stacked Generative Adversarial Networks 2017. 02. 01. Saehoon Kim (Ph. D. candidate) Machine Learning Group
Papers to be covered Stacked Generative Adversarial Networks X. Huang (Cornell), Y. Li (Cornell), O. Poursaeed (Cornell), J. Hopcroft (Cornell), S. Belongie (Cornell) arxiv:1612.04357v1 [cs.cv] 13 Dec 2016 StackGAN: Text to Photo-realistic Synthesis with Stacked Generative Adversarial Networks H. Zhang (Rutgers Univ.) et al. arxiv:1612.03242v1 [cs.cv] 10 Dec 2016 1
Generative Adversarial Networks The generator and discriminator play the following twoplayer minimax game with value function The generator maps a latent space to a data space The discriminator represents the probability that the X came from the data rather than 2
An Theoretical Analysis of GANs [1] [1] Generative Adversarial Nets, I. J. Goodfellow, et al, NIPS 14 3
Practical Learning Techniques [1] To train the generative network, the objective function is slightly twisted (no theoretical analysis is guaranteed) [1] Improved techniques to train GANs, T. Salimans, et al, NIPS 16 4
Inception Score We apply the Inception model (GoogLeNet) to get the conditional label distribution We expect that a well-generated image has a conditional label distribution with low entropy We expect that the model to generate varied images the marginal with high entropy The following score is very natural to assess the quality of generative models 5
Deep Convolutional GANs (DCGANs) [1] 100-(4x4x1024) projection matrix Transposed convolution (a.k.a. deconvolution) [1] Unsupervised representation with deep convolutional GANs, A. Radford et al, ICLR 16 6
Transposed convolution [1] https://github.com/vdumoulin/conv_arithmetic 7
Stacked Generative Adversarial Networks X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, S. Belongie (Cornell University) In this paper we aim to leverage the powerful bottom-up discriminative representations to guide a top-down generative model. We propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a discriminative bottom-up deep network. Our model consists of a top-down stack of GANs, each trained to generate plausible lower-level representations, conditioned on higher level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, providing intermediate supervision. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. To the best of our knowledge, the entropy loss is the first attempt to tackle the conditional model collapse problem that is common in conditional GANs. We first train each GAN of the stack independently, and then we train the stack end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. 8
Hierarchical image generation Lower-level representation conditioned on higher-level representation 9
Stacked Generative Adversarial Network (SGAN) [Pre-trained Encoder] Convolution Pooling Convolution Pooling Fullyconnected input conv1 pool1 conv2 pool2 fc3 fc4 10
Stacked Generative Adversarial Network (SGAN) [Stacked Generators] Our goal is to train a top-down generator G that inverts E G consists of a top-down stack of generators G i which is trained to invert a bottom-up mapping E i The definition of each generator is defined as follows: 11
An overview of image generation New images can be sampled from SGAN by feeding random noise to each generator This is different from DCGAN, because multiple noise variables are considered to generate the single image Each generator can be designed by transposed convolution operators 12
An overview of SGAN 13
Training Discriminator [Standard Loss] A discriminator D i distinguishes generated representation h i from real representations h i The loss for the discriminator is defined as 14
Training Generator (1/3) [Adversarial Loss] They first train each GAN independently by using adv,indep L Gi They train them jointly in an end-to-end manner by using adv,joint L Gi 15
Training Generator (2/3) [Conditional Loss] They regularize the generator by feeding the generated lower-level representations back to the encoder They enforce the recovered representations to be similar to the original representations 16
Training Generator (3/3) [Entropy Loss] They encourage the generated representation h i to be sufficiently diverse when conditioned on h i+1 The conditional entropy H( h i h i+1 ) should be as high as possible They propose to maximize a variational lower bound on the conditional entropy 17
Experiments [Encoder] They use a small CNN as the encoder: conv1-pool1- conv2-pool2-fc3-fc4-softmax [Generator] The top GAN G 1 generates fc3 features from some random noise z 1, conditioned on label y The bottom GAN G 0 generates images from some random noise z 0, conditioned on fc3 features from GAN G 1 18
SVHN Results 19
CIFAR Results 20
Inception Scores (CIFAR-10) 21
Stack GAN: Text to Photo-realist Image Synthesis H. Zhang (Rutgers), T. Xu (Lehigh Univ.), H. Li (CUHK), S. Zhang (UNC), X. Huang (Lehigh Univ.), X. Wang (CUHK), D. Metaxas (Rutgers) In this paper, we propose stacked Generative Adversarial Networks (StackGAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photorealistic details. The Stage-II GAN is able to rectify defects and add compelling details with the refinement process. Samples generated by StackGAN are more plausible than those generated by existing approaches. Importantly, our StackGAN for the first time generates realistic 256 256 images conditioned on only text descriptions, while state-of-the-art methods can generate at most 128 128 images. To demonstrate the effectiveness of the proposed StackGAN, extensive experiments are conducted on CUB and Oxford-102 datasets. 22
Motivating Examples 23
The architecture of StackGAN 24
Stage-I GAN [Model Architecture] Transposed Conv. Feedforward NN LSTM or CNN with word embedding Dimension Reduction & Reshaped 25
Stage-II GAN [Model Architecture] Same with Stage-I generator Standard CNN Transposed Conv. Same with Stage-I discriminator 26
Examples 27
Comparison between Stage I and II 28