Naturalistic Image Synthesis Using Variational Auto-Encoder

Size: px

Start display at page:

Download "Naturalistic Image Synthesis Using Variational Auto-Encoder"

Arleen Bruce
6 years ago
Views:

1 Naturalistic Image Synthesis Using Variational Auto-Encoder Submitted as a project report for EECS 294 in Fall 2016 by Raaz Dwivedi and Orhan Ocal raaz.rsk@eecs.berkeley.edu ocal@eecs.berkeley.edu

2 Abstract We develop a deep generative model for naturalistic image synthesis using variational auto-encoders (VAE). Our model uses convolutional and fully-connected layers and includes an l 2 loss on features extracted from a VGGNet that was pre-trained for classification on ImageNet dataset. Feature loss is used to enahance naturality in the visual appeal of the images. These deviate from the traditional fully-connected models that use only pixel and latent loss for training VAEs. We show that use of convolutional layers in the model improves the performance for reconstruction and generation of images from the trained network. Although we obtain good results for MNIST handwritten digits dataset, we were unable generate realistic images using the diverse CIFAR-10 dataset. Furthermore, we could not conclude if incorporating feature consistency in the loss function led to better results. Hence, our results deviate from the findings presented in a recent paper by Hou et al. [7], where the authors used Celeb- Faces Attributes (CelebA) dataset, and showed that incorporating feature loss from a pre-trained VGGNet helped their VAE generate more realistic images compared to the existing models in the literature. Contents 1 Introduction Problem Statement Datasets Evaluating the Generative Model Organization Theory behind VAE 3 3 Baseline Models 5 4 Our Network and Implementation Implementation and Tools Our Results Remarks State of the Art Remarks Discussion 10

3 1 Introduction Understanding the nature around us has been an interesting problem for centuries. To teach the same to a machine, albeit a recent problem, has led to many interesting research problems. The scientific community has been trying hard to model the nature in many streams. The philosophy behind trying to understand the nature around us can be rightly attributed to Feynman s quote What I cannot create, I do not understand. Learning a generative model is one of the many approaches that try to accomplish this goal. Given a dataset, a generative model is a model for randomly generating observable values similar to that dataset. The goal is beyond reproducing the dataset, that is, the model should be able to generate images that are not a replica of a member of the training data, yet similar to the data. Such a model is helpful for many purposes. On one hand, if the model is smaller, one has successfully compressed the data. This can handle saving and transferring the data for many machine learning tasks much easily. To name a few, it can help get a larger training data for neural networks, de-noising of images, in-painting etc 1. Besides some domain specific concrete examples, generative models are also very useful in an abstract sense for the field of Artificial Intelligence. Human beings are intelligent because with time and experience, they become very good in predicting the outcome of many actions, and decide their action by taking into account the anticipated outcome. This is because, with time, they learn the generative model for various natural processes. Thus, learning the generative model is a necessity for a robot if it wants to become as intelligent as human and make decisions about which action to perform. 1.1 Problem Statement We want to learn a generative model for natural images. For this problem, various approaches have been taken in the deep learning community including Variation Auto-Encoders, General Adversarial Networks (GAN) and Pixel Recurrent Neural Networks. In this project we target synthesizing natural-looking images using Variational Auto-Encoders (VAE). This was introduced by Kingma and Welling [9] and has been widely researched since then. We start with the first work and build on till a very recent work [7]. 1.2 Datasets We use two datasets that are well known in machine learning community. First is the MNIST dataset of handwritten digits [13] with training set of 60, 000 samples, and a test set of 10, 000 samples. The handwritten digits have been size-normalized and centered in a monochromatic pixel image. Second is the CIFAR-10 dataset consisting of natural images [11]. It consists of 60, pixel RGB color images in 10 classes, with 6, 000 images per class. There are 50, 000 training images and 10, 000 test images. The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The classes are mutually exclusive (for example, there is no overlap between automobiles and trucks)

1.3 Evaluating the Generative Model In the literatutre [9, 6, 17, 7, 19], we were able to find three key ways to quantify the quality of the VAE built and compare it across various models: (1)

4 1.3 Evaluating the Generative Model In the literatutre [9, 6, 17, 7, 19], we were able to find three key ways to quantify the quality of the VAE built and compare it across various models: (1) likelihood of the training and test data using the trained VAE; (2) visual appeal of the reconstructed and generated images; (3) classification accuracy for the unlabelled images, when the network is trained with partially labelled data. In this project, because our goal is to generate natural looking images and not specifically on how well the VAE s can capture the likelihood functions, we choose to evaluate our models using the visual appeal of the reconstructed and generated images. We build a model that has two parts: encoder and decoder. Encoder takes an input image and outputs useful features of the image. The decoder, on the other hand, constructs an image, given the features of the image. Informally (to be made precise in Section 2), the decoder can been as the the generative model, and the encoder can be seen as providing the codebook for the model. The two operations that a VAE targets are: (a) Reconstruction: If we input an image to encoder, and pass its output (the code of the input image) through decoder, it should output an image that matches the input image; (b) Generation: If we input a random signal (noise) to decoder, it should use it to generate a random code from the codebook and use that code to output a natural-looking image. decoder latent variable encoder Fig. 2 illustrates these two operations. To be consistent with the literature we shall refer to the code as latent variable. input image output image decoder noise latent variable (a) Reconstruction (b) Generation Fig. 1: (top) Reconstruction of an image when passed through both encoder and decoder. (bottom) When input with noise, decoder outputs a natural-looking image. 1.4 Organization The organization of this report is as follows. We briefly discuss the theoretical foundations of VAE in Section 2. We then present the baseline model introduced in the work that 2

5 introduced VAEs to the literature in Section 3. We discuss the details of the network that we implemented, and the results in Sections 4 and 5 respectively. In Section 6, we report the performance of the various state of the art models and contrast them with our results. We end with an ongoing discussion in the deep learning community, about how to approach learning a generative model, in Section 7. 2 Theory behind VAE Suppose we are given a dataset {x i, i = 1,..., N} consisting of N independent and identically distributed samples of some random variable x, whose distribution is unknown. We assume that the generation of x is a two step process: first, a value z is generated from a some prior distribution p θ (z) and then, a value x is drawn from a conditional distribution p θ (x z). The issues are: θ is unknown; and we only observe samples of x. We usually refer to x as the data, while z (the unobserved random variable) is called the hidden/latent variable (the code). We further assume that the distributions involved belong to a parametric class Θ indexed by parameter θ. We also assume that the distributions have smooth densities associated with them. A popular scheme to estimate θ is to approximate it with ˆθ MLE where ˆθ MLE denotes the maximum-likelihood estimate for the observed data: ˆθ MLE = arg max θ Θ p θ(x 1,..., x N ) When Θ is complex, maximizing the likelihood can become a hard problem. Often, adopting the viewpoint of two step generation and hidden variable comes in handy. One can write write a lower bound on the log-likelihood function as follows (using p := p θ to de-clutter notation): log p(x) = log p(x, z)dz = log p(x) p(z x) z z q(z x) q(z x)dz [ = log E z q( x) p(x) p(z x) ] q(z x) E z q( x) [log ( p(x) p(z x) q(z x) where the inequality follows from Jensen s inequality. In fact, it turns out that if we optimize over q, this becomes an equality, that is, we can express log p(x) as follows: ( log p(x) = max E z q( x) [log p(x) p(z x) )] (1) q q(z x) = max [log p(x) D(q(z x) p(z x))], (2) q where D( ) is the Kullbeck-Leibler Divergence, and the optimizer q (argmax) is given by q(z x) = p(z x). Converting the objective to a maximization (or a minimization) problem like above, is known as the variational principle. It is a technique to convert a hard problem to a simpler one; for example, Expectation-Maximization algorithm optimizes equation (2) over p( ) and q( x) iteratively, in turns, to try to maximize the likelihood function. 3 )]

6 Using similar techniques, and Bayes rule, p(x) = p(z)p(x z)/p(z x) one can derive an alternate equality for log p(x) as follows: log p(x) = E q(z x) log p(x) [ p(z)p(x z) = E q(z x) log p(z x) q(z x) ] q(z x) = E q(z x) log p(x z) + ( ) q(z x) q(z x) log q(z x) log p(z x) z z = E q(z x) log p(x z) + D(q(z x) p(z x)) D(q(z x) p(z)). Moving D(q(z x) p(z x)) to left hand side, we get ( ) q(z x) p(z) log p(x) D(q(z x) p(z x)) = E q(z x) [log p(x z)] D(q(z x) p(z)). (3) Using equations (2) and (3), we get another variational representation of the log-likelihood: log p(x) = max E q(z x) log p(x z) D(q(z x) p(z)). (4) q }{{}}{{} (A) (B) Computing the term (A) and its gradient with respect to q is tricky. Kingma and Welling [9] suggest using i.i.d. samples from q(z x) to compute the Monte Carlo estimate of (A), and a clever re-parameterization trick (for variance reduction) to enable backpropogation gradients with respect to q through the samples. Also, convenient (but rich enough) choice of Gaussian forms for p(z) and q(z x) gives a closed form for the term (B). To make our statements precise, assume the following: p(z) N (0, I), p(z x) q(z x) = N (µ z (x), Σ z (x)) where Σ z (x) = diag(σ 2 z(x)) 2 is a diagonal matrix, p(x z) N (µ x (z), I). For image synthesis, the observed data x is the image, while z (the hidden variable) can be considered as the code for the image, that captures all the meaningful information of that image. This allows us to consider p(z x) as the encoder while p(x z) can be understood as the decoder. Note that µ x (z) denotes the mean of the distribution for x for a fixed z, and is a map from the code space to a vector in the data space. The mappings µ z (x), µ x (z), σ x (z) are built using Neural-networks. Let θ = {µ z (x)} denote the parameters for the encoder layer, and φ = {µ x (z), σ x (z)} denote the parameters for the decoder layer. Let r(x i ) and z i denote the reconstructed image and code respectively, for input x i. Putting together the pieces, equation (4) reduces to the following optimization problem: min θ,φ 1 N ( 2 σz(x) 2 = ( N r(x i) µ xi (z i ) }{{} i=1 σ (1) z pixel loss or decoder loss (x) 2,..., σ (h) z (x) 2) h ( log σ z (j) (x i ) 2 + σ z (j) (x i ) 2) ) µ z (x i ) 2 j=1 }{{} latent loss or encoder loss 4

Afterwards, decoder is h x (z) = tanh(w 4 z + b 4 ), µ x (z) = W 5 h x (z) + b 5, log σ 2 x(z) = W 6 h x (z) + b 6.

7 3 Baseline Models Kingma and Welling used 2 fully connected layers for the encoder and the decoder in their first paper [9] on VAEs. Mathematically, for the encoder h z (x) = tanh(w 1 x + b 1 ), µ z (x) = W 2 h z (x) + b 2, log σ 2 z(x) = W 3 h z (x) + b 3. Then z is generated according to N (z; µ z (x), diag(σ 2 z(x))). Afterwards, decoder is h x (z) = tanh(w 4 z + b 4 ), µ x (z) = W 5 h x (z) + b 5, log σ 2 x(z) = W 6 h x (z) + b 6. In the paper, they report results using VAEs trained using MNIST and Frey Face datasets 3. Sample images generated by their network can be seen in Fig. 2. Fig. 2: Generated digits (left) and faces (right) by the baseline model as reported in [9]. We were able to implement their model and reproduce the results on MNIST dataset. However, the reported model had poor reconstruction and generation quality for CIFAR-10 (as will be shown in Fig. 5 in Section 5). Similar poor results for CIFAR-10 have also been reported in the literature even with complex networks [4]. We discuss the possible reasons behind this in Section 6. 4 Our Network and Implementation When randomness in images is modeled at the pixel level, the implied distance between reconstructed image and the original image is measured also pixel-by-pixel. For example, when the pixel values of an image are modeled to come from a normal distribution around a mean, the distance becomes an l 2 -distance as in equation (4). It is well known that 3 Available at 5

8 models trained using such a pixel-by-pixel l 2 -loss suffers from a fundamental problem: it is incapable of capturing perceptual difference and spatial correlation between images [12]. A slight translation of pixels will create no perceptual difference to human eyes, but l 2 -loss between the original image and the image obtained after translation will be large. This is bound to affect the visual quality of the images generated by the fitted model. It is well known that early layers of pre-trained CNNs tend to capture spatial information of input images. It has also been observed that many filters resemble the classical Gabor Filters which are known to capture many shapes and spatial properties of the image (see, e.g., the paper by Yosinski et al. [20]). We believe that putting a penalty on the difference between activations of the original image and the reconstructed images when passed through such filters will help us impose better spatial properties in the reconstructed images. Thus we adopt two changes: (1) use convolutional layers in encoder and decoder and; (2) use l 2 - loss on activations/features (to be referred to as the feature loss) from the first layer of the original and reconstructed image when fed to a pre-trained 4 VGGNet [16] on ImageNet [2]. The details of the final network we use are outlined in Fig. 3. We refer to this network as the CNN model/network. 4.1 Implementation and Tools We implemented our network on TensorFlow [1] using Python. We chose to use TensorFlow because it has a simple interface to build and train networks, has pre-trained networks available for use to extract image features in our network, and the TensorBoard tool helps visualize and debug the networks in a convenient way. Our network is trained on Amazon Web Services 5 and Google Cloud Platform 6 servers. Availability of two could computing services helped us explore hyper-parameters and network configurations in parallel. 5 Our Results In this section, we present and discuss our final results obtained by training the CNN network with and without feature loss as discussed in Section 4. Results are illustrated in Figs We get good results on the easy dataset MNIST. Fig. 4 shows the images that the trained network generates randomly 7. As can be seen, the generated images look pretty natural. For CIFAR-10, however, the results are not as good. We compare four settings, using baseline and CNN model, with pixel loss and with pixel and feature loss. We make the following observations: Reconstruction: Fig. 5 shows that the reconstructed images using the CNN model (right column) are much better than the fully connected baseline model. However, it is not clear whether including feature loss is helpful or not. Row 2 shows reconstructed images that do not use feature loss, while row 3 displays results with pixel and feature loss (latent loss is always used). 4 Available at 5 Available at 6 Available at 7 We remark that reconstruction is easier than generation. 6

pre-trained VGGNet When encoder is fed with an image, decoder should reconstruct it. When decoder is fed with noise, it should generate a natural image.

9 pre-trained VGGNet When encoder is fed with an image, decoder should reconstruct it. When decoder is fed with noise, it should generate a natural image. decoder SUMMARY Decoder: FC: 100 x 2048 UPSAMPLING by 2 CONV: 3 x 3 x 32 x 16 CONV: 3 x 3 x 16 x 16 UPSAMPLING by 2 CONV: 3 x 3 x 16 x 3 CONV: 3 x 3 x 3 x 3 Encoder: CONV: 3 x 3 x 3 x 16 CONV: 3 x 3 x 16 x 16 MAX POOL: 2 x 2 CONV: 3 x 3 x 16 x 32 CONV: 3 x 3 x 32 x 32 MAX POOL: 2 x 2 2 FC: 2048 x 100 (one for mean, one for variance) latent variable encoder pre-trained VGGNet Fig. 3: Our network consists of an encoder and a decoder with convolutional and fully connected layers. In addition, we use the output of the first convolutional layer of a VGGNet that is pre-trained on ImageNet dataset as feature extractor on the input image and the decoder output. The features are used in the loss function together with the latent and pixel loss. 7

10 Generation: Results with CNN model are presented in Fig. 6. When we over-fit the network on a small dataset, we get decent looking images from generation, but they look like the replicas of the input images. However, training the model on the whole dataset turns out to be pretty hard. For this case, the generated images look very blurry which agrees with the observations noted in other works in the literature. Fig. 4: When trained on MNIST, our VAE can generate realistic handwritten digits. using baseline model using our model input images pixel loss pixel + feature loss Fig. 5: Using convolutional layers in encoder/decoder we get better reconstruction on CIFAR- 10. However, using loss on features did not yield significant improvements. small dataset large dataset Fig. 6: Generated images by our VAE when trained on CIFAR-10 is not realistic. 5.1 Remarks We list some possible changes for the network that may lead to better results: 8 As described in Section 4, we used the first layer of a pre-trained VGGNet for feature loss. We tried a variety weighing schemes for the three terms in the loss, namely, the latent, pixel and feature loss, and used Adam method [8] for training the network. In our experiments, we saw that the latent loss was sensitive to chosen weights, and it was relatively easy to get unbounded latent loss. It would be interesting to try optimization methods that stabilize 8 We do not contrast with other techniques or models here, and only present tweaks suited to our model. 8

the learning process, such as gradient clipping [15], to see if we can train our network for weight combinations that would otherwise fail with a vanilla Adam method.

classical computer vision to extract image features.

11 the learning process, such as gradient clipping [15], to see if we can train our network for weight combinations that would otherwise fail with a vanilla Adam method. Next, the choice of using only the first layer was motivated by the well known observation with CNNs, that is, their early layers tend to resemble Gabor filters which are also used extensively in classical computer vision to extract image features. We believed that modeling randomness on the feature level, and not simply on the pixel level, would make the model learn spatial structure of natural images better. As further experiments, different pre-trained networks (such as ResNet [5] or GoogLeNet [18]) and different layers of the networks can be tried. Because building the current model and experimenting with training it took considerable time, we decided to conclude the project with our current results. 6 State of the Art We discuss two state of the art VAE models: 1) Deep Recurrent Attentive Writer (DRAW) [4], and 2) VAE with Inverse Auto-Regressive Flow (IAF) [10]. DRAW uses an attention model for iterative construction of complex images. The paper reports good reconstruction of handwritten digits using MNIST dataset by tracing of lines much like a person with a pen. However, the results with CIFAR-10 are not as good. Some randomly generated images are presented in Fig. 7a and they appear to be blurry. In VAE with IAF, Kingma et al.[10] use multiple invertible parameterized transformations on the hidden variable besides using RESNet for encoder and decoder. This enables them to approximate the intractable posterior better thereby improving the lower bound on the log-likelihood. Their trained network generates shaper images (Fig. 7b), which, however, look unrealistic on a closer look. The (b) IAF (a) DRAW Fig. 7: Generated images from the state of the art models for CIFAR-10 closest model in the literature to our model was a recent work by Hou et al. [7]. Here VGGNet features are used in the loss, much like our loss function. Good results are reported for reconstruction and generation of images from Celebrity Face Dataset (CelebA) [14] which has more than 200, 000 images. They report improvement when including feature loss, in contrast to using just the pixel loss (Fig. 8). 6.1 Remarks The images generated by DRAW for CIFAR-10 are also blurry, and the VAE with IAF generates sharp but unrealistic images. The results from Fig. 8 look promising and that motivated us further to try harder. However, the authors made no comment on the performance of using feature loss for CIFAR-10 dataset. We believe that in contrast to CIFAR-10 dataset, the CelebA dataset (that they report the results for) can be said to have convenient noise as the images are pretty homogeneous (faces of human beings). CIFAR-10 has been a 9

12 Fig. 8: Feature consistent VAE [7] shows improvements when loss incorporates features extracted using a pre-trained deep network. (top) Images generated using only pixel loss appear blurry; (bottom) when feature loss is incorporated, the images look shaper. tough dataset for the VAEs, in fact, Gregor et al. remark after their results using DRAW: CIFAR-10 is very diverse, and with only 50,000 training examples it is very difficult to generate realistic-looking objects without over-fitting (in other words, without copying from the training set). We too observe a similar phenomenon. 7 Discussion Often, the objective for training a model is to make use of the model for some specific purpose. Since learning the model is usually cast as an optimization problem, and hence, the choice of the objective function and the constraint should match for what the learned model will be used. For many tasks, learning a generative model can be converted to finding the maximum-likelihood model for the data. However, the natural question is, if the goal is to generate natural-looking images, should we try and learn the maximum likelihood model? Classical statistics results show that in the limit of infinite data and well specified 9 modelclass, maximum likelihood estimate (MLE) of the model is consistent and recovers the true model. But in most of the applications, the data is finite and the model is mis-specified. Consequently, one needs to be careful if MLE is the right approach for the task at hand. Theis et al. [19] argue that if the goal is to generate natural-looking images, then MLE is not the perfect match. Let P denote the unknown distribution and let Q denote the approximate distribution that we learn using the dataset at hand. The authors argue that, maximizing likelihood is approximately same as solving min D(P Q). (5) Q On the other hand, using ideas from computational cognitive science they claim that min D(Q P ) (6) Q can be an idealized objective to train the model. For the finite data and the mis-specified case, the model learned by solving (5) tends to overgeneralize and put mass on areas where P has zero mass, leading to samples that look non-natural. On the other hand, in the same scenario, model learned from (6) tend to focus on the good modes. Put simply, MLE 9 true model belongs to the search space 10

13 tends to overgeneralize, while the solution of (6) under-generalizes. Such a claim, puts a question mark if the VAE approach, which focuses on finding MLE, is the right way to learn a generative model for natural looking images. Fig. 9 (taken from [19]) illustrates this using a simply toy example. Here P is a mixture of Gaussians while Q is a fit from amongst the isotropic Gaussian distributions with equal variance along the two axes. While the model learned after minimizing KL Divergence (KLD) 5 puts a lot of mass on non-data region, minimizing other divergences like maximum mean discrepancy (MMD) or Jensen-Shannon divergence (JSD) gives a model that fits one of the modes well, but ignores other parts of the data. DATA KLD MMD JSD Fig. 9: Illustration of the trade-off across various fits of isotropic Gaussian to a dataset drawn from a mixture of Gaussians. Lessons Learned In this project, we tried to generate natural-looking images using VAEs by training them on MNIST and CIFAR-10 datasets. We have seen that using deep CNNs yielded better performance for reconstruction and generation of complex images compared to using only shallow fully-connected networks. We tried improving the naturality of the generated images by incorporating a loss based on image features in the objective function. Although there is prior work [7] showing improvements for other datasets (CelebA), we could not observe significant improvements for the CIFAR-10 dataset. There is significant literature stating the hardness of generating natural-looking images using CIFAR-10, and our results align with those statements. On the implementation of neural networks, we have seen that debugging and training a big network takes lots of coding effort and time, even with compute clusters. For that, visualization and debugging tools such as TensorBoard is very useful. Furthermore, we have seen that good initialization of variables for training is important for convergence of the algorithm, and techniques such as Xavier Initialization helps a lot [3]. Team Contributions Both of the team members contributed similar amount of efforts. They shared efforts in both the theoretical understanding and the implementation of various models. TensorFlow 11

14 was new to the team and initially time was devoted in learning how to use it, and then experimenting with the tool. Orhan, having more interest on the implementation, had an edge in exploring various networks, finding good tools and implementing them efficiently. Raaz had useful discussions with Orhan on good coding practices and TensorBoard. Raaz, having more interest on the theory side, had an edge in efforts on learning the different directions and discussions in the literature. Orhan, had useful discussions with Raaz on various works on theory behind VAE and different works on generative models. The team members learned how to install and run TensorFlow on compute severs; Raaz learned about using Amazon Web Services, and Orhan learned about using Google Cloud Platform, and they taught each other how to use the platform they learned. Both spent equal time on trying different networks and parameters in order to improve the preliminary results. The marginal difference in efforts, if any, was compensated in time devoted to prepare the slides/poster/report/github page for the project. To conclude, the team members think that Orhan s contribution is 50% and Raaz s contribution is 50%. The interesting way the team members came up this contribution breakdown is not through providing supportive evidence as to increase their own shares, but by arguing how valuable the other member s contributions were to this project, and without their efforts, this project would not have been the same. References [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, Software available from tensorflow.org, 1, [2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages IEEE, [3] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages , [4] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arxiv preprint arxiv: , [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arxiv preprint arxiv: , [6] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158, [7] Xianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep feature consistent variational autoencoder. arxiv preprint arxiv: , October [8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv: ,

15 [9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), number 2014, [10] Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. arxiv preprint arxiv: , [11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images [12] Jon C Leachtenauer, William Malila, John Irvine, Linda Colburn, and Nanette Salvaggio. General image-quality equation: Giqe. Applied Optics, 36(32): , [13] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits, [14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), [15] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28: , [16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arxiv preprint arxiv: , [17] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arxiv preprint arxiv: , [18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, [19] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arxiv preprint arxiv: , [20] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages ,

Deep generative models of natural images

Deep generative models of natural images Spring 2016 1 Motivation 2 3 Variational autoencoders Generative adversarial networks Generative moment matching networks Evaluating generative models 4 Outline 1 Motivation 2 3 Variational autoencoders