Sliced Wasserstein Generative Models

Size: px

Start display at page:

Download "Sliced Wasserstein Generative Models"

Tyler Quinn
6 years ago
Views:

1 Jiqing Wu * 1 Zhiwu Huang * 1 Wen Li 1 Janine Thoma 1 Luc Van Gool 1 2 Following the standard paradigm of GANs, we introduce a new SWGAN model by applying a dual SOT block to the discriminator, such that it can easily aparxiv: v3 [cs.cv] 5 Mar 2018 Abstract In the paper, we introduce a model of sliced optimal transport (SOT), which measures the distribution affinity with sliced Wasserstein distance (SWD). Since SWD enjoys the property of factorizing high-dimensional joint distributions into their multiple one-dimensional marginal distributions, its dual and primal forms can be approximated easier compared to Wasserstein distance (WD). Thus, we propose two types of differentiable SOT blocks to equip modern generative frameworks Auto-Encoders (AEs) and Generative Adversarial Networks (GANs) with the primal and dual forms of SWD. The superiority of our SWAE and SWGAN over the state-of-theart generative models is studied both qualitatively and quantitatively on standard benchmarks. 1. Introduction The domain of unsupervised learning has experienced tremendous advances due to its potential to capitalize on large pools of unlabeled data. One of the most promising approaches is generative modeling, which typically estimates the real data distribution either by variational inference using Auto-Encoders (VAEs) (Kingma & Welling, 2013) or by the adversarial process with Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In other words, rather than estimating the real data distribution directly, they often learn a map from a parameterized distribution to the high-dimensional real data distribution. During the map learning process, a statistical divergence such as Kullback- Leibler (KL) or Jensen-Shannon (JS) divergence between the model distribution and the real data distribution is minimized in a low-dimensional manifold or in the latent space such that visually pleasing samples can be generated. State-of-the-art generative models Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017; Salimans et al., * Equal contribution 1 Computer Vision Lab, ETH Zurich, Switzerland 2 VISICS, KU Leuven, Belgium. Correspondence to: Jiqing Wu <jiqing.wu@vision.ee.ethz.ch>, Zhiwu Huang <zhiwu.huang@vision.ee.ethz.ch>. 2018; Wei et al., 2018; Miyato et al., 2018) and Wasserstein AEs (Tolstikhin et al., 2018) proposed to make use of optimal transport (OT) theory to measure the distribution distance with Wasserstein distance (WD), which has proved to own better properties than KL and JS divergences employed in early generative models. Nevertheless, WD has some bottlenecks. For example, in high-dimensional space, it is known that the primal form of WD is intractable although some works (Tolstikhin et al., 2018; Genevay et al., 2017; Salimans et al., 2018) have attempted to propose relaxed unconstrained versions of the primal form. While the dual form of WD can be derived easier, it still suffers from the challenges to approximate the k-lipschitz constraint required by the Wasserstein metric in (Arjovsky et al., 2017; Gulrajani et al., 2017). Particularly, the weight clipping technique used in (Arjovsky et al., 2017) merely covers a subset of the k-lipschitz functions for some k as studied by (Gulrajani et al., 2017), while the adopted 1-Lipschitz gradient penalty in (Gulrajani et al., 2017) only relies on limited samples to approximate the k-lipschitz constraint on a high-dimensional domain (Wei et al., 2018). To overcome the weakness of the original OT in generative models, we alternatively exploit a new generative modeling scheme from the viewpoint of sliced OT (SOT) for better distribution transfer. Since its resulting sliced Wasserstein distance (SWD) is obtainable by decomposing the single n- dimensional distribution into its n one-dimensional marginal distributions, all of which can be well approximated independently as studied in (Bonneel et al., 2015; Kolouri et al., 2016; 2017), the SOT theory is able to offer a more favorable model than the state-of-the-art generative models employing WD. In particular, we introduce SOT based network blocks for the primal and dual forms of SWD, and incorporate them into modern generative frameworks (i.e. AEs and GANs). Our contributions can be summarized as follows: We propose a novel SWAE model, which equips classic AEs with differentiable SOT blocks. These blocks approximate the SWD primal form of the encoder distribution and the prior distribution progressively in an end-to-end network learning fashion.

2 proximate the dual form of SWD. In order to train the Radon transform matrices embedded in the SOT blocks, we generalize the standard network optimization algorithm to Stiefel manifolds. We evaluate our SWAE and SWGAN models on standard benchmarks and report visual results and quantitative scores with the Fréchet Inception Distance (FID), both of which achieve superior performances compared to the state-of-the-art generative models. 2. From Wasserstein Distance to Sliced Wasserstein Distance 2.1. Wasserstein Distance (WD) and Related Models In the literature of generative modeling, we often need to measure the agreement between two probability distributions P X and P Y. There are many ways to do so, among which the classical Kullback-Leibler (KL) and Jensen- Shannon (JS) divergences are widely used. Recently, (Arjovsky et al., 2017) introduced a more powerful family of statistical distances, namely Wasserstein distance (WD), to address mode collapse and instability issues. WD was originally studied in the context of optimal transport (OT) theory (Villani, 2008). Its primal formulation is W p (P X, P Y ) = inf E (X,Y ) γ[c(x, Y )], (1) γ Π(P X,P y) where Π(P X, P Y ) denotes the set of all joint distributions γ(x, Y ) whose marginals are P X, P Y respectively, and c : X Y R + can be any general cost function. In general, it is normally assumed that (X, d) is a metric space with metric d and X = Y. The cost function is set to c(x, Y ) = d p (X, Y ), where p > 0. In this case, the p-th root of Eq. 1 with respect to d p is the so-called p-wd. For p = 1, we can show that the Kantorovich s dual of WD is W 1 (P X, P Y ) = sup E X PX [f(x)] E Y PY [f(y )], f Lip 1 (2) where Lip 1 is the set of all 1-Lipschitz functions. As a side note, the dual 1-WD becomes k W 1 if we replace Lip 1 with Lip k for k > 0. Wasserstein AEs (WAEs) The paradigm of classic AEbased generative models such as VAEs employs variational inference to minimize an upper bound on the match between the model distribution of a decoder G : Z X and real image distribution, while imposing a regularization to match the model distribution of an encoder Q : X Z with the prior distribution. Unfortunately, the constraint on Q typically makes the variational problem hard to solve. As studied in (Tolstikhin et al., 2018), the latent codes of different samples will generally be forced to stay close to each other, leading to worse reconstruction, and making the generative models produce more blurry samples. To solve this problem, WAEs (Tolstikhin et al., 2018) apply an OT cost to the latent variable models P G, reaching state-of-theart results among AE-based generative models (Kingma & Welling, 2013; Rezende et al., 2014; Salimans et al., 2015; Makhzani et al., 2016; Mescheder et al., 2017). Specifically, the WAEs proposed a relaxed OT formulation to estimate the primal form of WD for decoding, and introduced an additional divergence D Z to regularize the encoding map: min G min Q inf E P X E Q(Z X) [c(x, G(Z))] Q(Z X) Q + λd Z (Q Z, P Z ), where Z is random noise, Q is any nonparametric set of encoders, c(x, Y ) = d p (X, Y ) is the OT cost function (which was actually set to c(x, Y ) = X Y 2 2), λ > 0 is a hyperparameter, and D Z is a divergence between the marginal distribution Q Z and the prior distribution P Z of Z. In (Tolstikhin et al., 2018), D Z is finally instantiated by maximum mean discrepancy (MMD) and GANs, both of which can be regarded as a distribution matching strategy, which differs from the employed OT plan for the decoding, since it is indeed non-trivial to apply the primal form of WD to the constraint on the encoding map especially when the latent codes has more than hundreds of dimensions. Wasserstein GANs (WGANs) Following Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Zhao et al., 2016; Berthelot et al., 2017; Mao et al., 2017), WGANs also established a min-max adversarial game between two competing networks, where a generator network G maps a source of noise to the input space while a discriminator network D receives either a generated sample or a true data sample and must distinguish between them. To stabilize GAN training for better image generation, the original WGAN (Arjovsky et al., 2017) proposed to approximate the dual form of the 1-WD by adopting a weight clipping strategy, which however satisfies the k-lipschitz constraint poorly. To alleviate this problem for the approximation on the dual form of 1-WD, the improved training of Wasserstein GAN (WGAN-GP) (Gulrajani et al., 2017) proposed to penalize the norm of the gradient of the critic with respect to a limited number of input samples. In particular, this gradient penalty is simply added to the basic WGAN loss (i.e. the dual form of 1-WD) for the following full objective: min G max E X P X [D(X)] E G(Z) PG [D(G(Z))] D + λ E ˆX P ˆX [( ˆX D( ˆX) 2 1) 2 ], where G( ), D( ) denotes the generator and discriminator respectively, Z is random noise, ˆX is the random samples following the distribution P ˆX that is sampled uniformly (3) (4)

3 along straight lines between pairs of points sampled from P X and P G, and ˆX D( ˆX) is the gradient w.r.t. ˆX. As pointed out in (Liu et al., 2018), it is not sufficient to approximate the k-lipschitz constraint on a high-dimensional domain by limited samples Sliced Wasserstein Distance (SWD) and Related Models The idea underlying the sliced Wasserstein distance (SWD) is to slice the plane with lines passing through the origin, to project the measures onto these lines where the distances is computed, and to integrate those distances over all possible lines. Formally, the SWD is defined as: ( SW p (P X, P Y ) = Wp p (RP X (, θ), RP Y (, S N 1 (5) where S N 1 is the unit sphere in R N, θ S N 1, R is the Radon transform, which maps P X to the set of its integral over the hyperplane of R n with respect to angle θ: Sliced Wasserstein Generative Models ) 1 p θ)d θ, RP X (t, θ) = p X ( x)δ(t x θ)d x, R N (6) where δ(.) is a Dirac measure. Eq. 5 results in the explicit benefit of simpler onedimensional density estimation compared to direct highdimensional density estimation by WD. Meanwhile, several works (Bonneel et al., 2015; Kolouri et al., 2016; 2017) exploit the fact that the one-dimensional p-wd has a closed form solution for the OT plan. More specifically, there exists a unique monotonically increasing transport map: τ(x) = F 1 Y (F X(x)), (7) where F X (x) and F Y (y) are the corresponding cumulative distribution functions (CDFs) for their probability density functions (PDFs) p X and p Y, which are computed by F X (x) = x p X(t)dt, F Y (y) = y p Y (t)dt. With the transport map, the one-dimensional p-wd between two distributions P X and P Y can be consequently computed by: ( W p (P X, P Y ) = X ) 1 d p (F 1 Y (F p X(x)), x)dp X (x). Furthermore, as proven in (Bonnotte, 2013), the SWD is a valid distance, and it is equivalent to the WD. Accordingly, in contrast to the original WD, the SWD tends to have a more promising potential to enhance modern generative modeling especially when one has access to samples of high-dimensional PDFs. In practice, the SWD can be approximated by using a finite summation over projection angles as done in most of existing works (Pitié et al., 2007; Kolouri et al., 2016; Karras (8) et al., 2018) 1. A better SWD approximation is achieved by the iterative distribution transfer (IDT) technique (Pitié et al., 2007), which has been proved to converge well to an optimal SWD if the iteration number is large enough. Specially, at each iteration, IDT first randomly samples a Radon transform matrix θ = [θ 1,..., θ N ] R N N satisfying orthogonality for the approximation of Radon transform R, which leads to a series of one-dimensional marginal PDFs, and then transfers the current distribution of the source data to the target distribution by matching their one-dimensional marginals with the map in Eq. 7. To minimize the SWD progressively, such an iteration is typically repeated for a number of times. 3. Proposed Method 3.1. Overview After studying the advantages of SWD over WD, in this section we will make the first attempt to introduce the SOT plan to the literature of generative modeling. As mentioned before, in practice, existing methods including IDT typically demand a large number of iterations to achieve a favorable solution. Furthermore, it is non-trivial to apply them directly to the context of neural network optimization. To overcome these limitations, we propose a novel SOT model, which adapts the IDT technique to a network setting. More specifically, this paper proposes to use a limited number of differentiable SOT blocks for optimization on the primal form of SWD in the context of AEs. The improved approximation of the primal form of SWD over the latent space using the proposed primal SOT blocks leads to better generation results. We also propose a variant of our SOT blocks for the dual SWD case in the context of GANs. Benefiting from the Radon transform, which effectively factorizes high-dimensional joint distributions into one-dimensional marginal distributions, the dual form of SWD can be estimated easier, leading to a better metric for generation Sliced Wasserstein AE (SWAE) As the state-of-the-art AE-based generative model, (Tolstikhin et al., 2018) proposed a MMD or GAN based constraint for encoding and a relaxed OT plan for decoding. While this moderates the problem of proximate latent codes, it does not impose a consistent OT constraint on the encoding, leading to a sub-optimal solution. Since it is highly non-trivial to approximate the primal form of the original WD on the latent codes, we propose to express the probabilistic encoders as a SOT plan instead, which aims to match the encoder distribution and prior distribution in the 1 It is worth mentioning that (Karras et al., 2018) employed the approximation of SWD merely for quantitative comparison on different GANs, and did not use the SWD for the GAN loss.

4 (b) 1D PDF matching for primal SWD (a) 1D PDF Radon transform Algorithm 1 The proposed SWAE algorithm Require: The primal SOT block number m, the batch size n, the encoder Q = S p E, where E : X Y, S p : Y Z, and decoder G, training hyperparameters etc. repeat Sample real data x from P X Sample Gaussian noise z from N (0, 1) Update Q = S p E and G by descending: 1 n n i=1 x i G(Q(x i )) 2 2 until Convergence (c) k-lipschitz mapping for dual SWD Figure 1. The proposed sliced optimal transport (SOT) block for the SWD primal form (a) (b) and dual form (a) (c) latent space with the primal form of SWD. In particular, we introduce an implicit approximation of the primal form of SWD for encoders such that the full objective of the whole AE model avoids any explicit regularization. Formally, the objective of our SWAE model is: min min inf E P X E Q(Z X) [ X G(Q(X)) 2 2], G Q Q(Z X) Q (9) where Q, G indicate the encoder and decoder respectively, and Q is implicitly constrained by our proposed SOT model that aims to optimize the primal form of SWD (Eq.5). For the optimization on the primal form of SWD between prior and encoder distributions, we design a type of differentiable SOT blocks, which consist of Radon transform and 1D PDF matching as shown in Fig. 1 (a) (b). Inspired by (Pitié et al., 2007), who show that a carefully selected sequence of Radon transform matrices leads to faster convergence to the optimal SWD (Eq. 5), we stack a limited number of differentiable SOT blocks for the encoder to learn a favorable sequence of Radon transform matrices, realizing a SOT plan in a deep learning manner. To begin with, let s denote the input data as x = [x 1,..., x n ] R N n (N is the data dimension, n is the sample number) for the encoder Q = S p E, which first uses a common encoder E and then applies our designed primal SOT blocks S p. The output of E is denoted with y = [y 1,..., y K ] R K n with K, n being the data dimension and sample number respectively. Then y is fed into the primal SOT block S p, which is implemented with the following steps. Step (a) of Fig. 1: First, we project the latent codes y with the Radon transform matrix θ = [θ 1,..., θ K ] R K K with the map y θ T y, which indicates the Radon transform projecting the K-dimensional distribution to the K 1-dimensional marginal distributions. Step (b) of Fig. 1: Then, we compute the 1D PDF matching for the data [θ1 T y,..., θk T y] using the map Eq. 7. In the end, we remap the samples according to the 1D Radon transformations. Accordingly, the whole mapping function of each primal SOT block can be defined as τ 1 (θ1 T y) θ1 T y S p (y) = y + θ. τ N (θn T y) θt N y, (10) where τ i is the SOT map (Eq. 7, i.e., τ(y) = F 1 Z (F Y (y)), where F Y, F Z are the CDFs for the input data y and the input noise z respectively) which can be solved by using discrete look-up tables. In practice, to make the process differentiable, we propose to employ piece-wise interpolation. More specially, to approximate the CDFs in Eq. 7, we first estimate their PDFs. Technically, the PDFs can be estimated by the histogram assignment of target data y i to the bin center c k. However, to make this operation differentiable in the context of backpropagation, we propose a soft assignment version instead: â(y i ) = e α yi c k 2 k e α yi c k 2, (11) which assigns the weight of target data y i to the bin cluster c k proportional to their proximity, but relative to proximities to other bin centres. â(y i ) ranges between 0 and 1, with the highest weight assigned to the closest cluster center. α is a parameter that controls the decay of the response with the magnitude of the distance. We remark that for α this setting returns to the original histogram hard assignment for closest bin center being 1 and 0 for other bins. In practice, we set α = 1. Note that the PDF estimation is performed on one-dimensional data, and thus it suffices to estimate the distribution with a moderate amount of samples Sliced Wasserstein GANs (SWGAN) Though stacking the primal SOT blocks enables our SWAE model to better match the encoder distribution with the prior distribution, applying it directly to the design of modern

5 Algorithm 2 The proposed SWGAN algorithm Require: The dual SOT block number m, the batch size n, the generator G and discriminator D = S d E, training hyperparameters etc. repeat Sample real data x from P X Sample Gaussian noise z from N (0, 1) Sample two vectors µ 1, µ 2 from uniform distribution U[0, 1] such that for each i ˆx i = (1 µ 1,i )x i + µ 1,i G(z i ) ŷ i = (1 µ 2,i )E(x i ) + µ 2,i E(G(z i )) Update G by descending: 1 n n i=1 D(G(z i)) Update D, S d by descending: 1 n n i=1 (D(x i) D(G(z i ))) + n i=1 λ 1 ˆxi D(G(ˆx i )) n i=1 λ 2( ŷi S d (ŷ i ) 2 k) 2 until Convergence GAN models is not desirable. This is because the standard GAN framework relies on the adversarial training of generator and discriminator. The latter is typically required to explicitly compute a useful distance between fake and real data distribution, while the proposed primal SOT blocks transfer a source distribution to a target distribution by implicitly measuring the SWD. To address this issue, we resort to the dual form of SWD by extending the design of SOT blocks to a dual version. Though k-lipschitz gradient penalty may not be sufficient in high dimensional space, it is relatively easy to satisfy the k-lipschitz constraint on one-dimensional space, which can be a potential advantage for applying the dual form of 1-SWD, i.e., ( ) sup E Xθ P Xθ [f(x θ )] E Yθ P Yθ [f(y θ )] dθ, S N 1 f Lip 1 (12) where P Xθ, P Yθ are the marginal distributions obtained by Radon transform R (Eq. 6). Thus, instead of approximating the dual of N-dimensional WD (Arjovsky et al., 2017), we propose a sliced version of Wasserstein GANs (SWGAN) to estimate the dual of N one-dimensional marginal distributions required by 1-SWD. Since the real data distribution is supported by a low-dimensional manifold, we follow the setting of classic GANs generator to first encode the n inputs data x = [x 1,..., x n ] R N n to lower-dimensional latent codes y = [y 1,..., y n ] R K n by E(x) = y, where E indicates the encoder. Then, we apply the dual SOT block to approximate the optimal f Lip 1 in Eq. 12. Step (a) of Fig. 1: We first project the latent codes y by Radon transform matrices θ = [θ 1,..., θ K ] R K K, that is, y θ T y, which corresponds to the Radon transform projecting the K-dimensional distribution to the K 1-dimensional marginal distributions. Step (c) of Fig. 1: Then we compute the k-lipschitz mapping function of the dual SOT block as follow φ(λ 1 (θ1 T y) + b 1 ) S d (y) =., (13) φ(λ K (θk T y) + b K) where θ = [θ 1,..., θ K ] R K K is the Radon transform matrix, φ is an activation function, λ i, b i are scalars. Supported by the universal approximation theorem of a neural network (Hornik, 1991), it is expected that with the sum of few dual SOT blocks we can well approximate the one-dimensional dual in Eq. 12 with respect to an arbitrary angle. It also intimately relates to the fact that Eq. 13 can easily satisfy the k-lipschitz constraint by imposing k-lipschitz gradient penalty, as long as we use a reasonable activation function. Eventually, by computing S d (y) = 1 K K i=1 (φ(λ i(θ T 1 y) + b i )) to approximate the integral of Eq. 12, we have our complete discriminator D = S d E. To avoid gradient explosion and vanishing for S d, we implicitly constraint the gradients of S d by imposing the gradient regularizer on the entire D. The final objective is thus as follow min G max E X P X [D(X)] E Z PZ [D(G(Z))] D + λ 1 E ˆX P ˆX [ ˆX E( ˆX) 2 2] + λ 2 EŶ PŶ [( Ŷ S d (Ŷ )) 2 k) 2 ], (14) here we sample the ˆX, Ŷ based on (Gulrajani et al., 2017) (see Alg. 2), where λ 1, λ 2 are the coefficients to balance the penalty terms, λ 2 is also used to absorb the scale k caused by the k-lipschitz constraint Training for SWAE and SWGAN Given that the Radon transform matrix θ should be orthogonal throughout training, we cannot directly apply the standard optimization algorithm. Meanwhile, it is widely known that the space of orthogonal matrices is actually a Stiefel manifold 2. Hence, we need to update the θ on the curved manifold instead of the flat Euclidean space. By building upon the manifold-valued weight update rule well-studied in (Huang & Van Gool, 2017), we generalize the optimization algorithm to Stiefel manifolds. Following the standard optimization (Absil et al., 2009) on Riemannian manifolds, we first employ parallel transport to transport the Euclidean gradient in the tangent space at the anchored orthogonal matrix θ t to the one in the tangent space at the orthogonal matrix θ t+1. The resulting Euclidean gradient is then subtracted from the normal component of the Euclidean gradient L (k) θ t, where L is the loss for the k-th 2 A compact Stiefel manifold St(d, D) is the set of d- dimensional orthogonal matrices in R D.

6 layer (for simplicity, we remove the index k in the following). Subsequently, searching along the tangential direction leads to the update in the tangent space of the Stiefel manifold. In the end, the resulting update is projected back to the Stiefel manifold with a retraction operation Γ. For more details about the Riemannian geometry of Stiefel manifolds and the retraction operation on Riemannian manifolds, we refer the readers to (Edelman et al., 1998; Absil et al., 2009). Accordingly, the update of the current orthogonal matrix θ t on the Stiefel manifold respects the following form L θ t = L θ t L θ t(θ t ) T θ t, (15) θ t+1 = Γ(θ t λω( L θ t)), (16) where Γ denotes the retraction operation that actually corresponds to QR decomposition, λ is the learning rate, Ω( ) denotes the standard optimization, L θ t(θ t ) T θ t is the normal component of the Euclidean gradient L θ t, which can be computed by the conventional backpropagation. By experimental study, we favor updating the Radon transfer matrices by the Adam optimization (Kingma & Ba, 2014) generalized on Stiefel manifold as discussed above, while the rest of the weights are updated by standard Adam optimization. 4. Experiments We conduct various experiments on three standard benchmarks including CIFAR-10, CelebA (Liu et al., 2015) and LSUN (Yu et al., 2015) to evaluate the proposed SWAE and SWGAN models. Recently, (Heusel et al., 2017) introduced the Fréchet inception distance (FID) to measure the difference between fake and real data distribution, and verified that the FID measurement is more similar to human judgment than the inception score (IS) (Salimans et al., 2016). Later, (Lucic et al., 2017) conducted a thorough large-scale investigation on the original GAN and its variants by evaluating their FID scores. Therefore, we not only present visual results but also evaluate the FID scores for all datasets to further justify the superiority of our models Model Architectures and Hyperparameters We compare our SWAE to VAE (Kingma & Welling, 2013), WAE-MMD and WAE-GAN (Heusel et al., 2017), which is equivalent to AAE (Makhzani et al., 2016) when the OT cost function is c(x, Y ) = X Y 2 2 (Mescheder et al., 2017). We also compare our SWGAN to DCGAN (Radford et al., 2015), WGAN (Arjovsky et al., 2017), and WGAN- GP (Gulrajani et al., 2017). For the compared methods, we follow the default hyperparameters recommended by the authors. All the AE-based generative models including our SWAE are applied with the common convolutional architectures as suggested by (Berthelot et al., 2017) for the decoder, except the difference that we use a shallow encoder con- Encoder Kernel size Resample Output shape NearestNeighbor Down Linear Primal SOT block 128 Decoder Noise 128 Linear (Conv, ELU) blocks NearestNeighbor Up (Conv, ELU) blocks NearestNeighbor Up (Conv, ELU) blocks Conv, tanh Table 1. Network architecture for SWAE Generator Kernel size Resample Output shape Noise 128 Linear Residual block [3 3] 2 Up Residual block [3 3] 2 Up Residual block [3 3] 2 Up Conv, tanh Discriminator Residual block [3 3] 2 Down Residual block [3 3] 2 Down Residual block [3 3] Residual block [3 3] Linear Dual SOT blocks 128 Table 2. Network architecture for SWGAN CIFAR-10 CelebA LSUN VAE WAE-MMD AAE(WAE-GAN) SWAE DCGAN WGAN 85.0* * WGAN-GP SWGAN Table 3. FID comparison of VAE and GAN models. As studied in (Gulrajani et al., 2017), the original WGAN does not achieve good performance for various architectures. For WGAN results with a *, we are unable to reach scores comparable to those reported in (Lucic et al., 2017). However, this does not influence our final conclusion. taining downscaling and a linear transform layer instead of several convolutional blocks (Tab. 1). As to all GAN models including our SWGAN, we follow the official implementation of (Gulrajani et al., 2017), and employ the standard ResNet structure (Gulrajani et al., 2017) for generator and discriminator (Tab. 2), we apply the LeakyReLU activation for the dual SOT block based on experimental tuning. The learning rate of SWGAN and SWAE is determined to be , we set the critic of SWGAN to be 4 for LSUN and CelebA and 5 for CIFAR-10, we determine the λ 1, λ 2 to be 20, 10 by cross validation.

Sliced Wasserstein Generative Models Figure 2. Curves of FID vs. iteration (left), cost vs. iteration (middle), number of SOT blocks vs. FID (right) for SWAE. Figure 3. Curves of FID vs. iteration (top left), cost vs.

Interpolation results of the proposed SWAE (left) and SWGAN (right) models on CelebA. 4.2. Evaluation By comparing the state-of-the-art AE-based models, Tab.

adversarial training from GANs and takes advantage of its better generalization ability. Nevertheless, due to adversarial training AAE (WAE-GAN) is generally unstable as studied in (Heusel et al.

7 Sliced Wasserstein Generative Models Figure 2. Curves of FID vs. iteration (left), cost vs. iteration (middle), number of SOT blocks vs. FID (right) for SWAE. Figure 3. Curves of FID vs. iteration (top left), cost vs. iteration (top right), FID vs. number of SOT blocks (bottom left), FID vs. k-lipschitz constraint (bottom right) for SWGAN. Figure 4. Interpolation results of the proposed SWAE (left) and SWGAN (right) models on CelebA Evaluation By comparing the state-of-the-art AE-based models, Tab. 3 demonstrates that our proposed SWAE outperforms the pure VAE models by sufficient margin, meanwhile our FID score is also competitive to the AAE (WAE-GAN) model, which additionally leverages the adversarial training from GANs and takes advantage of its better generalization ability. Nevertheless, due to adversarial training AAE (WAE-GAN) is generally unstable as studied in (Heusel et al., 2017), while our model has a very stable training due to using a simple l2 reconstruction loss without any regularizations (Fig. 2, middle). By comparing with the recently published WAE-MMD method, we can observe that our SWAE shows clear advantage in both terms of visual results and FID scores. This verifies that the primal SOT blocks on encoding work better than other divergence constraints (e.g. MMD) employed by WAE. As all the evaluated AE-based models are not successful for the test of LSUN dataset, we do not include those results in the paper. Compared to the state-of-the-art GAN models, the proposed SWGAN achieves top performances as well (Tab. 3), which quantitatively exhibits the advantages of our models. Lately, (Miyato et al., 2018) reported the competitive FID score 17.5 on CIFAR-10, while relying on extra label information. To the best of our knowledge, our SWGAN has reached the lowest FID on CIFAR-10 among all the existing generative models, whereas on CelebA and LSUN it is mildly outperformed by the GAN model (Heusel et al., 2017) which employs a very sophisticated scheme of two time-scale update rule. Furthermore, the visual results reported in Fig. 5 are consistent with the FID scores. In particular, our SWGAN obtains more visually pleasing images compared to WGAN and WGAN-GP in terms of better facial semantics and sufficient diversity. The same conclusion can be drawn on CIFAR-10 and LSUN as well. This empirically proves that our dual SOT model works better than the original OT model employed in the state-of-the-art WGANs. We believe this is mainly caused by the easier independent approximation of the SWD on multiple one-dimensional marginal distributions of the training data than the estimation of the original WD that directly works on samples with higher dimensions. In addition, we also study some key properties of our SWAE and SWGAN. First, we show the FID curve and the objective curve during training to verify their effectiveness in both terms of quantitative and qualitative measurement. The first plot of Fig. 2 and Fig. 3 demonstrates that the training of our SWAE and SWGAN is more stable than other models in terms of FID. Meanwhile, we can also witness that our proposed objective functions can faithfully reflect the image quality of generated samples as the training iterations increase. Second, we evaluate the impact caused by the number of our designed SOT blocks in terms of the FID metric. As it turns out, merely 3 primal and 1 dual SOT block(s) are sufficient to achieve the top performance (Fig. 2 and Fig. 3), which confirms our intuition that instead of randomly sampling a long sequence of Radon transform matrices, it is possible to learn a short sequence of them (i.e., stacking a limited number of SOT blocks) that better matches two distributions. Additionally, we study the impact of k-lipschitz constraint for SWGAN. Fig. 3 shows that SWGAN favors relatively small k, the optimal choice is k = Finally, Fig. 4 shows the interpolation results of SWAE and SWGAN to justify that they are capable of generating reasonable geometry of the latent manifold.

8 Sliced Wasserstein Generative Models VAE WAE-MMD AAE(WAE-GAN) SWAE DCGAN WGAN WGAN-GP SWGAN Figure 5. Visual results of AE-based (top 2 rows) and GAN (bottom 3 rows) models on CIFAR-10, CelebA and LSUN. 5. Conclusion In the paper, we introduced a novel model of sliced optimal transport for generative modeling. In particular, we endowed the modern AE-based and GAN models with the proposed SOT blocks for better approximation of either primal or dual form of the sliced Wasserstein distance, which serves as a measurement between model distribution and data distribution. Both FID and qualitative results demonstrated our clear advantages over existing models. Future works include a theoretical analysis on the approximation ratio of our proposed model for the primal and dual forms of SWD, and the extension of our model to the context of progressive growing networks for better generation.

9 Acknowledgements We would like to thank Dr. Xianfeng David Gu for his insightful blog about the Optimal Transport theory, NVidia for donating the GPUs used in this work. References Absil, P-A, Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein generative adversarial networks. In ICML, Berthelot, David, Schumm, Tom, and Metz, Luke. BEGAN: Boundary equilibrium generative adversarial networks. arxiv preprint arxiv: , Bonneel, Nicolas, Rabin, Julien, Peyré, Gabriel, and Pfister, Hanspeter. Sliced and radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, Bonnotte, Nicolas. Unidimensional and evolution methods for optimal transportation. PhD thesis, Paris, Edelman, Alan, Arias, Tomás A, and Smith, Steven T. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20 (2): , Genevay, Aude, Peyr, Gabriel, and Cuturi, Marco. Learning generative models with Sinkhorn divergences. arxiv preprint arxiv: , Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron. Improved training of Wasserstein GANs. In NIPS, Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, and Hochreiter, Sepp. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp , Hornik, Kurt. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2): , Huang, Zhiwu and Van Gool, Luc. A Riemannian network for SPD matrix learning. In AAAI, Karras, Tero, Aila, Timo, Laine, Samuli, and Lehtinen, Jaakko. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arxiv preprint arxiv: , Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arxiv preprint arxiv: , Kolouri, Soheil, Zou, Yang, and Rohde, Gustavo K. Sliced Wasserstein kernels for probability distributions. In CVPR, Kolouri, Soheil, Park, Se Rim, Thorpe, Matthew, Slepcev, Dejan, and Rohde, Gustavo K. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43 59, Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In ICCV, Liu, Zixia, Wang, Liqiang, and Gong, Boqing. Improving the improved training of Wasserstein GANs. In ICLR, Lucic, Mario, Kurach, Karol, Michalski, Marcin, Gelly, Sylvain, and Bousquet, Olivier. Are GANs created equal? a large-scale study. arxiv preprint arxiv: , Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, and Frey, Brendan. Adversarial autoencoders. ICLR, Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, and Smolley, Stephen Paul. Least squares generative adversarial networks. ICCV, Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, Miyato, Takeru, Kataoka, Toshiki, Koyama, Masanori, and Yoshida, Yuichi. Spectral normalization for generative adversarial networks. In ICLR, Pitié, François, Kokaram, Anil C, and Dahyot, Rozenn. Automated colour grading using colour distribution transfer. CVIU, 107(1): , Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arxiv preprint arxiv: , 2015.

10 Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. ICML, Salimans, Tim, Kingma, Diederik P, Welling, Max, et al. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp , Salimans, Tim, Zhang, Han, Radford, Alec, and Metaxas, Dimitris. Improving GANs using optimal transport. In ICLR, Tolstikhin, Ilya, Bousquet, Olivier, Gelly, Sylvain, and Schoelkopf, Bernhard. Wasserstein auto-encoders. In ICLR, Villani, Cédric. Optimal transport: old and new, volume 338. Springer Science & Business Media, Wei, Xiang, Liu, Zixia, Wang, Liqiang, and Gong, Boqing. Improving the improved training of wasserstein gans. In ICLR, Yu, Fisher, Seff, Ari, Zhang, Yinda, Song, Shuran, Funkhouser, Thomas, and Xiao, Jianxiong. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arxiv preprint arxiv: , Zhao, Junbo, Mathieu, Michael, and LeCun, Yann. Energybased generative adversarial network. arxiv preprint arxiv: , Sliced Wasserstein Generative Models

Implicit generative models: dual vs. primal approaches

Implicit generative models: dual vs. primal approaches Ilya Tolstikhin MPI for Intelligent Systems ilya@tue.mpg.de Machine Learning Summer School 2017 Tübingen, Germany Contents 1. Unsupervised generative