EE-559 Deep learning Non-volume preserving networks

Size: px

Start display at page:

Download "EE-559 Deep learning Non-volume preserving networks"

Rudolf Weaver
5 years ago
Views:

1 EE-559 Deep learning 9.4. Non-volume preserving networks François Fleuret Mon Feb 8 3:36:25 UTC 209 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE A standard result of probability theory is that if f is continuous, invertible and [almost everywhere] differentiable, then x, p f Z x = p Z f x J f x. 3 f p Z 5 p f Z François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks / 24

2 From this equality, if f is a parametric function such that we can compute [and differentiate] p Z f x and J f x then, we can make the distribution of f Z fits the data by optimizing log p f Z x n = log p Z f x n J f x n. n n If we are able to do so, then we can synthesize a new X by sampling Z N 0, and computing f Z. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 2 / 24 If Z N 0, I, log p Z f x n = 2 f xn 2 + d log 2π. And remember that if f is is a composition of functions f = f K f we have so J f x = log J f x = K k= J f k K log J f k k= f k f x, f k f x. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 3 / 24

3 If f k are standard layers computing f z is impossible, and computing J f x is intractable. Dinh et al. 204 introduced the coupling layers to address both issues. The resulting Non-Volume Preserving network NVP is an example of a Normalizing flow Rezende and Mohamed, 205. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 4 / 24 We use here the formalism from Dinh et al Given a dimension d, a Boolean vector b {0, } d and two mappings s : R d R d t : R d R d, we define a [fully connected] coupling layer as the transformation c : R d R d x b x + b x expsb x + tb x where exp is component-wise, and is the Hadamard component-wise product. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 5 / 24

4 The expression cx = b x + b x expsb x + tb x can be understood as: forward b x unchanged, and apply to b x an invertible transformation parametrized by b x. s exp t 0 + b x cx François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 6 / 24 The consequence is that c is invertible, and if y = cx x = b y + b y tb y exp sb y. t s exp 0 + b cx x François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 7 / 24

5 The second property of this mapping is the simplicity of its Jacobian determinant. Since c i x = b i x i + b i x i exps i b x + t i b x we have, i, j, x, b i = c i x = x i and c i x j = {i=j} b i = 0 c i x = x i exps i b x + t i b x c i s i b x = {i=j} + x i x j x j }{{} 0 if b j =0 c i = {i=j} exps i b x + b j x j Hence c i x j can be non-zero only if i = j, or b i b j =. exps i b x + t i b x x j }{{} 0 if b j =0.... François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 8 / 24 If we re-order both the rows and columns of the Jacobian to put first the non-zeros entries of b, and then the zeros, it becomes lower triangular... 0 J c x = exps k x b. 0.. exps k x b its determinant remains unchanged, and we have log J f k x = s i x b i:b i =0 = i b s x b i. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 9 / 24

6 dim = 6 x = torch.empty, dim.normal_.requires_grad_ b = torch.zeros, dim b[:,:dim//2] =.0 s = nn.sequentialnn.lineardim, dim, nn.tanh t = nn.sequentialnn.lineardim, dim, nn.tanh c = b * x + - b * x * sb * x.exp + tb * x j = torch.cat[torch.autograd.gradc_k, x, retain_graph=true[0] for c_k in c[0]] printj prints tensor[[.0000, , , , , ], [ ,.0000, , , , ], [ , ,.0000, , , ], [ , , , , , ], [ , , 0.224, ,.4373, ], [ 0.030, 0.385, , , ,.0295]] François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 0 / 24 To recap, with f k, k =,..., K coupling layers, and x 0 n = x n and x k n = f k Lf = n 2 x K n f = f K f, x n k, we train by maximizing 2 + d log 2π + K log J f k k= x k n, with log J f k x = i b k s k x b k i. And to sample we just need to generate Z N 0, I and compute f Z. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks / 24

7 A coupling layer can be implemented with class NVPCouplingLayernn.Module: def init self, map_s, map_t, b: supernvpcouplinglayer, self. init self.map_s = map_s self.map_t = map_t self.b = b.clone.unsqueeze0 def forwardself, x_and_logdetjac: x, logdetjac = x_and_logdetjac s, t = self.map_sself.b * x, self.map_tself.b * x logdetjac += - self.b * s.sum y = self.b * x + - self.b * torch.exps * x + t return y, logdetjac def invertself, x: s, t = self.map_sself.b * x, self.map_tself.b * x return self.b * x + - self.b * torch.exp-s * x - t The forward here computes both the image of x and the update on the accumulated determinant of the Jacobian, i.e. x, u f x, u + J f x. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 2 / 24 We can then define a complete network with one-hidden layer tanh MLPs for the s and t mappings class NVPNetnn.Module: def init self, dim, hidden_dim, depth: supernvpnet, self. init b = torch.emptydim self.layers = nn.modulelist for d in rangedepth: if d%2 == 0: i = torch.randpermb.numel[0:b.numel // 2] b.zero_[i] = else: b = - b map_s = nn.sequentialnn.lineardim, hidden_dim, nn.tanh, nn.linearhidden_dim, dim map_t = nn.sequentialnn.lineardim, hidden_dim, nn.tanh, nn.linearhidden_dim, dim self.layers.appendnvpcouplinglayermap_s, map_t, b def forwardself, x_and_logdetjac: for m in self.layers: x_and_logdetjac = mx_and_logdetjac return x_and_logdetjac def invertself, x: for m in reversedself.layers: x = m.invertx return x François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 3 / 24

8 And the log-proba of individual samples of a batch def LogProbax_and_logdetjac: x, logdetjac = x_and_logdetjac log_p = logdetjac * x.pow2.addmath.log2 * math.pi.sum return log_p François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 4 / 24 Training is achieved by maximizing the mean log-proba batch_size = 00 model = NVPNetdim = 2, hidden_dim = 2, depth = 4 optimizer = optim.adammodel.parameters, lr = e-2 for e in rangeargs.nb_epochs: acc_loss = 0 for b in range0, nb_train_samples, batch_size: output = modelinput[b:b+batch_size], 0 loss = - LogProbaoutput.mean acc_loss = acc_loss + loss.item model.zero_grad loss.backward optimizer.step Finally, we can sample according to p f Z with z = torch.emptynb_train_samples, 2.normal_ x = model.invertz François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 5 / 24

9 4 Real Synth 4 Real Synth Real Synth 4 Real Synth François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 6 / 24 Dinh et al. 206 apply this approach to convolutional layers by using bs Published as a conference paper at ICLR 207 consistent with the activation map structure, and reducing the map size while increasing the number of channels Figure 3: Masking schemes for affine coupling layers. On the left, a spatial checkerboard pattern mask. On the right, a channel-wise masking. The squeezing operation reduces the 4 4 tensor on the left into a tensor on the right. Before the squeezing operation, a checkerboard pattern is used for coupling layers while a channel-wise masking pattern is used afterward. see Figure 2b, { y:d = x :d y d+:d = x d+:d exp sx :d + tx :d Dinh et al., { x:d = y :d x d+:d = y d+:d ty :d exp sy :d, 8 meaning that sampling is as efficient as inference for this model. Note again that computing the inverse of the coupling layer does not require computing the inverse of s or t, so these functions can François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 7 / 24

10 Published as a conference paper at ICLR 207 They combine these layers by alternating masks, and branching out half of the channels at certain points to forward them unchanged. + x = + x z z2 z3 z4 f 3 2 z3 h4 f 2 = + x = z z2 h3 h4 x x2 x3 x4 f a In this alternating pattern, units which remain identical in one transformation are modified in the next. b Factoring out variables. At each step, half the variables are directly modeled as Gaussians, while the other half undergo further transformation. 3.6 Multi-scale architecture Figure 4: Composition schemes for affine coupling layers. Dinh et al., 206 We implement a multi-scale architecture using a squeezing operation: for each channel, it divides the image into subsquares of shape 2 2 c, then reshapes them into subsquares of shape 4c. The squeezing operation transforms an s s c tensor into an s 2 s 2 4c tensor see Figure 3, effectively trading spatial size for number of channels. At each scale, we combine several operations into a sequence: we first apply three coupling layers with alternating checkerboard masks, then perform a squeezing operation, and finally apply three more coupling layers with alternating channel-wise masking. The channel-wise masking is chosen so that the resulting partitioning is not redundant with the previous checkerboard masking see Figure 3. For the final scale, we only apply four coupling layers with alternating checkerboard masks. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 8 / 24 Propagating a D dimensional vector through all the coupling layers would be cumbersome, in terms of computational and memory cost, and in terms of the number of parameters that would need to be trained. For this reason we follow the design choice of [57] and factor out half of the dimensions at regular intervals see Equation 4. We can define this operation recursively see Figure 4b, The structure for generating images consists of 2 stages h 0 = x 3 z i+, h i+ = f i+ h i 4 z L = f L h L 5 z = z,..., z L. 6 In our experiments, 3 checkerboard we use this coupling operationlayers, for i < L. The sequence of coupling-squeezing-coupling operations a squeezing described above layer, is performed per layer when computing f i Equation 4. At each layer, as 3 the spatial channel resolution coupling is reduced, layers, the number of hidden layer features in s and t is doubled. All variables a factor-out which have layer. been factored out at different scales are concatenated to obtain the final transformed output Equation 6. stage As a consequence, the model must Gaussianize units which are factored out at a finer scale in an earlier layer 4 before checkerboard those which coupling are factored layers out at a coarser scale in a later layer. This results in the definition aof factor-out intermediary layer. levels of representation [53, 49] corresponding to more local, fine-grained features as shown in Appendix D. Moreover, Gaussianizing and factoring out units in earlier layers has the practical benefit of distributing the loss function throughout the network, following the philosophy similar to guiding intermediate layers using intermediate classifiers [40]. It also reduces significantly the amount of computation and memory used by the model, allowing us to train larger models. The s and t mappings get more complex in the later layers. 6 François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 9 / 24

Figure 7: Samples from a model trained on Imagenet 64 64. Dinh et al.

Non-volume preserving networks 20 / 24 3 Figure 8: Samples from a model

11 Figure 7: Samples from a model trained on Imagenet Dinh et al., 206 François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 20 / 24 3 Figure 8: Samples from a model trained on CelebA. Dinh et al., 206 François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 2 / 24

Non-volume preserving networks 22 / 24 5 Figure 0: Samples from a model trained on

12 Figure 9: Samples from a model trained on LSUN bedroom category. Dinh et al., 206 François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 22 / 24 5 Figure 0: Samples from a model trained on LSUN church outdoor category. Dinh et al., 206 François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 23 / 24

Figure 6: Manifold generated from four examples in the dataset. Clockwise from top left: CelebA, Imagenet 64 64, LSUN tower, LSUN bedroom. over sample quality in a limited capacity setting.

, 206 the samples generated from our model look not only globally coherent but also sharp.

high frequency components. Unlike autoregressive models, sampling from our model is done very efficiently as it is parallelized over input dimensions.

shadows. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 24 / 24 We also illustrate the smooth semantically consistent meaning of our latent variables.

13 Figure 6: Manifold generated from four examples in the dataset. Clockwise from top left: CelebA, Imagenet 64 64, LSUN tower, LSUN bedroom. over sample quality in a limited capacity setting. As a result, our model outputs sometimes highly improbable samples as we can notice especially on CelebA. As opposed to variational Dinh etautoencoders, al., 206 the samples generated from our model look not only globally coherent but also sharp. Our hypothesis is that as opposed to these models, real NVP does not rely on fixed form reconstruction cost like an L 2 norm which tends to reward capturing low frequency components more heavily than high frequency components. Unlike autoregressive models, sampling from our model is done very efficiently as it is parallelized over input dimensions. On Imagenet and LSUN, our model seems to have captured well the notion of background/foreground and lighting interactions such as luminosity and consistent light source direction for reflectance and shadows. François Fleuret EE-559 Deep learning / 9.4. Non-volume preserving networks 24 / 24 We also illustrate the smooth semantically consistent meaning of our latent variables. In the latent space, we define a manifold based on four validation examples z, z 2, z 3, z 4, and parametrized by two parameters φ and φ by, z = cosφ cosφ z + sinφ z 2 + sinφ cosφ z 3 + sinφ z 4. 9 We project the resulting manifold back into the data space by computing gz. Results are shown Figure 6. We observe that the model seems to have organized the latent space with a notion of meaning that goes well beyond pixel space interpolation. More visualization are shown in the Appendix. References 5 Discussion and conclusion L. InDinh, this paper, D. Krueger, we haveand defined Y. Bengio. a class ofnice: invertible non-linear functions independent with tractable components Jacobian determinant, estimation. enabling CoRR, exact abs/40.856, and tractable log-likelihood 204. evaluation, inference, and sampling. We have shown that this class of generative model achieves competitive performances, both in terms of sample quality L. anddinh, log-likelihood. J. Sohl-Dickstein, Many avenues and exist S. Bengio. to further Density improveestimation the functional using formreal of the NVP. transformations, CoRR, forabs/ , instance by exploiting 206. the latest advances in dilated convolutions [69] and residual networks D. architectures Rezende and [60]. S. Mohamed. Variational inference with normalizing flows. CoRR, This abs/ , paper presented205. a technique bridging the gap between auto-regressive models, variational autoencoders, and generative adversarial networks. Like auto-regressive models, it allows tractable and exact log-likelihood evaluation for training. It allows however a much more flexible functional form, similar to that in the generative model of variational autoencoders. This allows for fast and exact sampling from the model distribution. Like GANs, and unlike variational autoencoders, our technique does not require the use of a fixed form reconstruction cost, and instead defines a cost in terms of higher level features, generating sharper images. Finally, unlike both variational autoencoders and GANs, our technique is able to learn a semantically meaningful latent space which is as high dimensional as the input space. This may make the algorithm particularly well suited to semi-supervised learning tasks, as we hope to explore in future work. 9

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative