An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks

Size: px

Start display at page:

Download "An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks"

Roger Gardner
5 years ago
Views:

1 An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks Report for Undergraduate Project - CS396A Vinayak Tantia (Roll No: 14805) Guide: Prof Gaurav Sharma CSE, IIT Kanpur, India 1 Introduction and Objective Generative adversarial networks are generative models which have achieved great success in unsupervised learning in various fields including image generation, image styling, inpainting and 3D object generation. My aim through this project was to understand how generative adversarial networks (GANs) work, the techniques used to make it better and its applications in 3D modeling and style transfer. 1.1 Generative Models Generative models compose data which looks like it has been sampled from a dataset (on which it was trained) but actually, has novelties of its own as well. The objective of generative modeling is to help understand the world around us and in the process, create a representation of objects in the world, which enable us to use data better. This is possible because these models are composed of machine learning models like neural networks, which have significantly less parameters than the data we would like to model them with. Thus, we can later use these parameters for various purposes like unsupervised classification of images. 2 Generative Adversarial Networks 2.1 Introduction Generative adversarial models were introduced by Goodfellow et al [3]. They introduce a new process for training generative models - there is a generative model (G) which synthesizes the data, and a discriminative model (D) which tries to identify properties in data to identify fake samples (samples created by the discriminator). The generator is expected to produce novel and different samples each time, so it is supplied with a random noise. G tries to generate fake data (which looks like original) and D tries to identify (given both original and fake data) which is fake and which is original (from the dataset). When D is able to identify the fake data better, G incurs loss corresponding to the mistakes it made, 1

2 and it trains itself using it. Similarly, when G is able to fool D, D incurs loss and uses it to train itself. 2.2 Algorithm Figure 1: Algorithm, taken from [3] 2.3 Theory and Theorems Training G and D can be formulated as a minimax game between G and D with the value function V (G, D) as: min G max V (G, D) = E D x P data (x) log(d(x)) + E z P z(z) log(1 D(G(z))) D tries to assign the value 1 to x, as x is sampled from the original data, and the value 0 to D(G(z)) as G(z) is the fake generated sample. This minimax game satisfies the following properties: The global minima of V(G, D*) satisfies the property that the original data distribution and the generated data distribution are the same. 2

3 This global minima is obtained when G is trained using an optimum discriminator at each step. There were a few hacks needed to make it work: G is trained to maximise log(d(g(z))) instead of minimizing log(1 D(G(z))) to avoid saturation. D is not trained completely in inner loop. D is trained for k steps followed by training G for 1 step. 3 Improving GANs 3.1 Deep Convolutional GANs Introduction GANs needed a lot of hyperparameter tuning in order to produce results. This paper [8] basically solves this problem. CNNs obtain great results on supervised learning tasks, so they incorporate CNNs into GANs and solve the problem of extensive hyperparameter tuning which GANs face Methods Suggested The authors suggest that these methods should be used while training GANs: Deep convolutional neural networks to be used for both generator and discriminator Remove FC and pooling layers and substitute with convolutional layers Use RELUs instead of other activations Use batch normalization after each layer Features of Method Moving along latent space leads to items appearing and disappearing Supports vector addition among latent vectors Discriminator features can be used for other tasks like classification Results Results are displayed from the previous page. 3

4 Figure 2: Sample images generated on LSUN dataset, taken from [8] Figure 3: Discriminator accuracy for classification, taken from [8] Figure 4: Vector arithmetic on latent vectors, taken from [8] 4

5 3.2 Wasserstein GAN [1] Requirement for the paper GANs have achieved great success in generating images but still, they require carefully tuned hyperparameters and need a careful balance between training the discriminator and generator. The reason for this is that when they minimize/maximize the log loss, they are actually minimizing the Jensen-Shannon divergence between the generated distribution and the data distribution. When the distributions are too far apart and the discriminator is trained well, the loss feedback given by the discriminator suffers from vanishing gradient and the generator does not update itself. In order to avoid this, Arjovsky et al suggest new ways to train GANs Proposed Method Propose the training algorithm: Figure 5: Wasserstein-I algorithm, taken from [1] They show that this minimises the Wasserstein-I (Earth-Movers distance) and prove using various theorems that minimizing the Wasserstein-I, is better than minimizing the Jensen-Shannon divergence Benefits Meaningful loss metric Improved stability of the process 5

4 Conditional GANs This [6] is an extension to GANs which generates data, conditioned on an input. An input y is fed to both G and D. This can be something like a label for MNIST data.

6 4 Conditional GANs This [6] is an extension to GANs which generates data, conditioned on an input. An input y is fed to both G and D. This can be something like a label for MNIST data. For example, when the label 2 is fed as input, in the form of a 1-hot vector (or maybe in another representation), images similar to MNIST data of the digit 2 are generated. This GAN requires labeled data to generate the images. 5 3D Modeling Given 2D inputs 5.1 Task Input is given as one (or multiple) 2D images. The objective is to generate 3D models of the 2D image given as input. 5.2 Non-GAN approach - Multi-View 3D models from Single Images with A Convolutional Network [10] This approach feeds the input image into the CNN, alongwith an angle at which the 3D model is to be projected. The output is a 2D image, which shows the image in the projected dimension. Multiple such images can be obtained by supplying different angles as input. These images obtained are stitched into a 3D point cloud, which can be later transformed into a mesh. The losses used for this purpose are a Euclidean loss for RGB image and L1 loss for depth image. 3D models are required for training. The model is: Figure 6: Network architecture, taken from [10] 6

5.3 Learning a Probabilistic Latent Space of Object Shapes via 3D Generative- Adversarial Modeling [11] 5.3.1 Motivation Behind Solving Previous methods leave holes in the object.

7 5.3 Learning a Probabilistic Latent Space of Object Shapes via 3D Generative- Adversarial Modeling [11] Motivation Behind Solving Previous methods leave holes in the object. Also, the images are pretty blurred due to using L1 loss. In order to avoid these problems, the authors thought it best to generate the 3D model directly, instead of stitching 2D images Initial Experiment The authors train a GAN using a 3D model dataset. The motivation behind using the GAN is that it will force the 3D model to look original (without holes and blurs). Initially, they try to train a GAN which generates a voxel representation of a 3D models directly from a latent vector. They achieve success in their endeavor using the following generator: Figure 7: Generator as used in [11]. Discriminator mostly mirrors it Method A 3D model is taken and projected onto a 2D image, which is given as input. This is encoded into a latent vector so that the 3D model can be generated from that latent vector. An encoding layer E is added for this purpose. This latent vector is then used as input to the generator Losses Used The loss used for training is L = L 3D GAN + α 1 L KL + α 2 L recon L 3D GAN = log D(x) + log(1 D(G(z))). This loss ensures that the 3D model generated looks realistic. 7

8 L KL = D KL (q(z y) p(z)). q(z y) is the variational distribution. This loss ensures that this distribution is close to the latent vector distribution. L recon = G(E(y)) x This ensures that the 3D model generated represents the 2D image and not a random 3D model Benefits This is the first method to combine the benefits of GANs and volumetric CNNs for 3D modeling. The discrinator learnt (in an unsupervised manner) can be used for applications in 3D object recognition D Modeling Using 2D images Objective Gadelha et al [2] propose a method to learn 3D distributions using only 2D images. They want to induce a distribution of 3D structures given 2D views from multiple viewpoints. They are motivated by the observation that humans can predict what a chair would like if they view multiple 2D images of it, so why not a GAN? They train a GAN using multiple 2D images and force it to generate 3D models which capture the underlying distribution Method The generator, given an input vector, produces a 32x32x32 object. The 3D model obtained is passed into a projection module, which gives a 2D image as output. The discriminator is asked to infer whether this 2D image is real or fake. The projection module is also differentiable, making the entire model differentiable. Thus, using the feedback given to the output 2D image, the parameters for generating the 3D model are modified. This is their model: Figure 8: Network architecture, taken from [2] 8

9 6 3D CNNs A drawback of the above methods is that they can generate only 32x32x32 models, due to computation and memory limitations. In order to avoid these limitations, we searched for sparse 3D CNNs. We saw the use of 3D CNNs in 3D classification. 6.1 Pointnet [7] Motivation Building a model for 3D classification and segmentation, which takes only point clouds as input Method Takes a set of points, applies a function h to each of them, and then applies a function g (something like max pooling or average pooling) which takes h(x 1 ), h(x 2 ),... as input. Get a feature vector and apply FC layers after this for classification. This is not suitable for our case as we cannot use the CNN-GAN with it. 6.2 Octnet [9] Motivation Voxel representations have a lot of empty space. They wish to use this sparsity in order to speed up classification for 3D voxels Method Each 3D voxel grid is divided into 8 octets - each carries 0 or 1, depending on whether the entire octet is empty. It is not divided more than 3 times, as access time increases on doing so. 7 Style Transfer We now move to the task of style transfer. 7.1 Task The objective is to take an image of a particular style and transform it to another similar style. For example, take an image clicked during the day and transform it so that it appears like it has been clicked during the night. 9

Figure 9: Style Transfer Example, taken from [12] 7.2 Perceptual Losses This work [5] targets image to image trainslation and style transfer problems.

10 Figure 9: Style Transfer Example, taken from [12] 7.2 Perceptual Losses This work [5] targets image to image trainslation and style transfer problems. Typically, L1 losses are used between input and output images to maintain similarity. This paper suggests a new way of calculating losses - pass the 2 images (between which loss is to be calculated) through a CNN, take the higher level features (those in later layers) and take the differences between those as the loss). 7.3 Supervised Style Transfer Objective This work [4] uses conditional GANs for image translation task, to avoid hand-designed loss functions for each different set of categories, eg, same loss can be used while transferring an image from edges to photographs, and from night to day Model They modify a conventional model, which encodes the 2D image into a latent vector, and then converts it to another style - introducing skip layers from encoder to generator in order to avoid information loss in different layers. Their objective is that the core information, must be passed through the latent vector, other information can be passed using these U- layers, to make the image quality crisp. They also use 70x70 patches in the discriminator, instead of having a discriminator run on the whole image Results Results are in figure

Figure 10: Results, taken from [4] 7.4 Unsupervised Style Transfer 7.4.1 Motivation Paired data is difficult to find. So, they suggest an unsupervised method to make it work. 7.4.2 Method For each set of domains, a set of functions f and g are learnt.

11 Figure 10: Results, taken from [4] 7.4 Unsupervised Style Transfer Motivation Paired data is difficult to find. So, they suggest an unsupervised method to make it work Method For each set of domains, a set of functions f and g are learnt. f takes an image from domain 1 to domain 2 and g does vice-versa. The main changes they made are - they constrain f(g(x)) to be equal to x in order for their to be similarity between input and stylised image. They also use least square loss instead of log loss for GANs following along the lines of the Wasserstein GAN and Least-Squares GAN. They update their discriminators using a history of generated images. References [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arxiv preprint arxiv: , [2] M. Gadelha, S. Maji, and R. Wang. 3d shape induction from 2d views of multiple objects. arxiv preprint arxiv: , [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages , [4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arxiv preprint arxiv: ,

12 [5] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages Springer International Publishing, [6] M. Mirza and S. Osindero. Conditional generative adversarial nets. arxiv preprint arxiv: , [7] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arxiv preprint arxiv: , [8] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arxiv preprint arxiv: , [9] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. arxiv preprint arxiv: , [10] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages Springer International Publishing, [11] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances In Neural Information Processing Systems, pages 82 90, [12] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arxiv preprint arxiv: ,

Paired 3D Model Generation with Conditional Generative Adversarial Networks

Accepted to 3D Reconstruction in the Wild Workshop European Conference on Computer Vision (ECCV) 2018 Paired 3D Model Generation with Conditional Generative Adversarial Networks Cihan Öngün Alptekin Temizel