Fast Patch-based Style Transfer of Arbitrary Style

Size: px

Start display at page:

Download "Fast Patch-based Style Transfer of Arbitrary Style"

Colleen Jones
6 years ago
Views:

1 Fast Patch-based Style Transfer of Arbitrary Style Tian Qi Chen Department of Computer Science University of British Columbia Mark Schmidt Department of Computer Science University of British Columbia Abstract Artistic style transfer is an image synthesis problem where the content of an image is reproduced with the style of another. Recent works show that a visually appealing style transfer can be achieved by using the hidden activations of a pretrained convolutional neural network. However, existing methods either apply (i) an optimization procedure that works for any style image but is very expensive, or (ii) an efficient feedforward network that only allows a limited number of trained styles. In this work we propose a simpler optimization objective based on locality matching that combines the content structure and style textures in a single layer of the pretrained network. We show that our objective has desirable properties such as a simpler optimization landscape and consistent frame-by-frame performance on video. Furthermore, we use 80,000 natural images and 80,000 paintings to train an inverse network that approximates the result of the optimization. This results in a procedure for artistic style transfer that is efficient but also allows arbitrary content and style images. 1 Introduction Famous artists are typically renowned for a particular artistic style, which takes years to develop. Even once perfected, a single piece of art can take days or even months to create. This motivates us to explore efficient computational strategies for creating artistic images. While there is a large classical literature on texture synthesis methods that create artwork from a blank canvas [6, 16, 18, 26], several recent approaches study the problem of transferring the desired style from one image onto the structural content of another image. This approach is known as artistic style transfer. Artistic style transfer based on convolutional neural network (CNN) has recently shown impressive results [7, 8, 9, 17], and even created a market for mobile applications that can stylize user-provided images on demand. Despite this renewed interest, the actual process of style transfer is based on solving a complex optimization procedure, which can take minutes on today s hardware. This may be too slow for applications where we want to stylize videos, and has motivated recent approaches that train another neural network to efficiently approximate the optimum of the optimization problem [14, 24, 25]. While much faster, these approaches sacrifice the versatility of being able to perform style transfer with arbitrary style image, as the feed-forward networks are trained to mimic a certain style or a small set of styles. In this work we propose a method that has addresses these limitations: a new method for artistic style transfer that is efficient but is not limited to a finite set of styles. We tackle this problem by defining a new optimization objective for style transfer that notably only depends on one layer of the CNN (as opposed to existing methods that use multiple layers). The new objective still leads to a visually appealing style transfer while this simple restriction allows us to use an inverse network" to deterministically invert the activations from that layer to yield the stylized image. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2 2 Our Optimization Formulation The main component of our style transfer method is a patch-based operation for constructing the target activations in a single layer, given the style and content images. We refer to this procedure as swapping the style of an image, as the content image is replaced patch-by-patch by the style image. We only present this operation at a high level here due to space restrictions, but it is possible to formulate this operation as a convolution followed by a simple argmax and then a transposed convolution. Let C and S denote the RGB representations of the content and style images (respectively), and let Φ( ) be the function represented by a fully convolutional part of the pretrained CNN that maps an image from the RGB space to some intermediate activation space. We extract overlapping activation patches after mapping the content and style images to their activations, Φ(C) and Φ(S). Let {φ i (C)} i Nc and {φ j (S)} j Ns denote the set of extracted activation patches for content and style respectively. The activation patches can be extracted with arbitrary size and overlap, although the style and content activation patches must have the same patch size. We then perform a patch-wise similarity matching between the content activation patches and style activation patches, using normalized cross-correlation. In particular, for each content patch we find the style patch maximizing φ ss φ i (C), φ j (S) i (C, S) := arg max φ j(s), j N s φ i (C) φ j (S). (1) We replace every content activation patch φ i (C) with its best matching style activation patch φ ss i (C, S). The complete activations for the style-swapped image, which we denote by Φss (C, S), are then formed by recombining the patches {φ ss i (C, S)} i N c. We average the activation values between overlapping patches, producing a linear interpolation effect in activation space. Thus, the hidden activations can be viewed as coming from a single original image. The stylized image can be computed by placing a loss function on the activation space with target activations Φ ss (C, S). Similar to prior works on style transfer [9, 17], we use the squared-error loss and define our optimization objective as I stylized (C, S) = arg min Φ(I) Φ ss (C, S) 2 + λl T V (I) (2) I R 3 H W where l T V (I) is a total variance regularization term widely used in image generation methods [1, 14, 21]. Because Φ( ) may contain multiple max-pooling operations that downsample the image, we use this regularization as a natural image prior, obtaining spatially smooth results for the re-upsampled image. Since the function Φ( ) is part of a pretrained CNN and is at least once subdifferentiable, (2) can be minimize using standard subgradient-based optimization methods. 3 Approximating the Optimum With an Inverse Network An alternative approach to using optimization methods is to train an inverse network that approximates the optimum of the loss function (2). In particular, instead of placing a loss on the RGB space and trying to optimize in RGB space, our inverse network is trained using the loss (2) on the activations. In particular, we train a network approximating Φ 1 using the loss function 3.1 Training Method 1 min Φ n 1 n Φ( Φ ( ) 1 (Φ j )) Φ j 2 + λl Φ 1 T V (Φ j ). (3) j=1 The function Φ( ) is non-surjective. That is, not all hidden activations correspond to real images. This causes a problem if we only train on real images. In this case the Φ j obtained by (3) from the real images at test time would be inverting activations that are outside the trained domain (as these activations would be the result of style swapping). To ensure the network can invert style-swapped activations, we simply augment the training set to include these types of activations. More precisely, given a set of content and style images (and their 2

Standard Deviation Content Image RGB relu1_1 relu2_1 Style Image relu3_1 relu4_1 relu5_1 Figure 1: The effect of style

Due to the naming convention of VGG-19, relux_1 refers to the first ReLU layer after the (X 1)-th maxpooling layer.

optimization. 0.4 0.3 0.2 Standard Deviation of Pixels Gatys et al. Li and Wand Style Swap Content Image Gatys et al.

1 0 0 100 200 500 Optimization Iteration (a) Style Image Our method with random init (b) Figure 2: (a) Standard

The lines show the mean value and the shaded regions are within one standard deviation of the mean.

corresponding activations), we augment the training set with style-swapped activations based on pairs of images.

we re using optimization as described in Section 2).

As we choose a target layer that is deeper in the network, textures of the style image are more pronounced.

structurally consistent with the content.

3 Standard Deviation Content Image RGB relu1_1 relu2_1 Style Image relu3_1 relu4_1 relu5_1 Figure 1: The effect of style swapping in different layers of VGG-19 [23], and also in RGB space. Due to the naming convention of VGG-19, relux_1 refers to the first ReLU layer after the (X 1)-th maxpooling layer. The style swap operation uses patches of size 3 3 and stride 1, and then the RGB image is constructed using optimization Standard Deviation of Pixels Gatys et al. Li and Wand Style Swap Content Image Gatys et al. with random init Optimization Iteration (a) Style Image Our method with random init (b) Figure 2: (a) Standard deviation of the RGB pixels over the course of optimization is shown for 40 random initializations. The lines show the mean value and the shaded regions are within one standard deviation of the mean. The vertical dashed lines indicate the end of optimization. (b) Samples using random initializations. corresponding activations), we augment the training set with style-swapped activations based on pairs of images. This augmented set of activations is then used to train the inverse network using a stochastic gradient method applied to (3). 4 Experiments Target Layer. The effects of style swapping in different layers of the VGG-19 network are shown in Figure 1 (where in these figures we re using optimization as described in Section 2). We see that while we can style swap directly in RGB space, the result is nothing more than a recolor. As we choose a target layer that is deeper in the network, textures of the style image are more pronounced. We find that style swapping on the relu3_1 layer provides the most visually pleasing results, while staying structurally consistent with the content. We restrict our method to the relu3_1 layer in the following experiments and in the inverse network training. Qualitative results are shown in Figure 4, where our results are placed side-by-side with images stylized using Gatys et al s method. Consistency. Style swapping concatenates the content and style information into a single target feature vector. The optimization procedure is then much easier compared to other approaches. Figure 2 shows the difference in optimization between our formulation and existing works. Random initializations have almost no effect in the stylized result, indicating that we have far fewer local 3

Loss Function 4 3 2 1 10 4 Validation Loss Optimization InvNet-NoAug InvNet-Aug 0 0 20

images and 6 style images, using patches of size 3 3. Method N. Iters. Time/Iter.

86 Style Swap (Optim) 100 0.0466 4.66 Style Swap (InvNet) 1 1.2483 1.

style images. Times are taken for images of resolution 300 500 on a GeForce GTX 980 Ti.

viewed as a very rough estimate. optima than other style transfer objectives.

method is able to adapt to video without any explicit gluing procedure, such as using

We train the inversion network using the Microsoft COCO (MSCOCO) dataset [19] and a

Each dataset has roughly 80, 000 natural images and paintings, respectively.

We construct each minibatch using 2 natural images, 2 paintings, and 4 style-swapped

Figure 3 shows the approximation results for inverting style swapped activations with

Though only trained on images of size 256 256, we achieve reasonable results for

We additionally compare against an inverse that has the same architecture but was not

As expected, the network that never sees style-swapped activations during training

Computation times for methods that can handle arbitary style images are shown in Table

4 Loss Function Validation Loss Optimization InvNet-NoAug InvNet-Aug Iteration Figure 3: Validation loss of inverse networks on 2000 content images and 6 style images, using patches of size 3 3. Method N. Iters. Time/Iter. (s) Total (s) Gatys et al. [9] Li and Wand [17] Style Swap (Optim) Style Swap (InvNet) Table 1: Mean computation times of style transfer methods that can handle arbitary style images. Times are taken for images of resolution on a GeForce GTX 980 Ti. Note that the number of iterations for optimization-based approaches should only be viewed as a very rough estimate. optima than other style transfer objectives. This consistency property is advantageous when stylizing videos frame by frame, as our method is able to adapt to video without any explicit gluing procedure, such as using optical flow [22]. 4.1 Inverse Network Dataset and Training. We train the inversion network using the Microsoft COCO (MSCOCO) dataset [19] and a dataset of paintings sourced from wikiart.org and hosted by Kaggle [4]. Each dataset has roughly 80, 000 natural images and paintings, respectively. We train using Adam [15] for approximately 2 epochs on each dataset. We construct each minibatch using 2 natural images, 2 paintings, and 4 style-swapped activations using the images in the minibatch. Result. Figure 3 shows the approximation results for inverting style swapped activations with 3 3 patches. Though only trained on images of size , we achieve reasonable results for arbitrary full-sized images. We additionally compare against an inverse that has the same architecture but was not trained with the augmentation of style-swapped activations. As expected, the network that never sees style-swapped activations during training performs worse than the network with the augmented training set. Computation Time. Computation times for methods that can handle arbitary style images are shown in Table 1. Both our optimization-based and feedforward variants beat existing methods on speed while maintaining the same level of versatility. To the best of our knowledge, this is the first CNN-based feedforward approach that can generalize to any style image. Style Images: Content Ours Gatys et al. Content Ours Gatys et al. Figure 4: Qualitative examples of our method compared with Gatys et al. s formulation [9]. 4

5 References [1] Hussein A Aly and Eric Dubois. Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing, 14(10): , [2] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, [3] Alexey Dosovitskiy and Thomas Brox. Inverting convolutional networks with convolutional networks. CoRR, abs/ , [4] Small Yellow Duck. Painter by numbers, wikiart.org. painter-by-numbers, [5] Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages ACM, [6] Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In Computer Vision, The Proceedings of the Seventh IEEE International Conference on, volume 2, pages IEEE, [7] Michael Elad and Peyman Milanfar. Style-transfer via texture-synthesis. arxiv preprint arxiv: , [8] Oriel Frigo, Neus Sabater, Julie Delon, and Pierre Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer [9] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/ , [10] Kun He, Yan Wang, and John E. Hopcroft. A powerful generative model using random weights for the deep image representation. CoRR, abs/ , [11] Aaron Hertzmann. Paint By Relaxation. Proceedings Computer Graphics International (CGI), pages 47 54, [12] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages ACM, [13] Justin Johnson. neural-style [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. Arxiv, [15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv: , [16] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. Texture optimization for example-based synthesis. ACM Transactions on Graphics (ToG), 24(3): , [17] Chuan Li and Michael Wand. Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis. Cvpr 2016, page 9, [18] Lin Liang, Ce Liu, Ying-Qing Xu, Baining Guo, and Heung-Yeung Shum. Real-time texture synthesis by patch-based sampling. ACM Transactions on Graphics (ToG), 20(3): , [19] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/ , [20] Peter Litwinowicz. Processing Images and Video for an Impressionist Effect. Proc. SIGGRAPH, pages , [21] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), pages IEEE, [22] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style transfer for videos. pages 1 14, [23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv: , [24] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. CoRR, [25] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/ , [26] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages ACM Press/Addison-Wesley Publishing Co.,

Decoder Network over Lightweight Reconstructed Feature for Fast Semantic Style Transfer

Decoder Network over Lightweight Reconstructed Feature for Fast Semantic Style Transfer Ming Lu 1, Hao Zhao 1, Anbang Yao 2, Feng Xu 3, Yurong Chen 2, and Li Zhang 1 1 Department of Electronic Engineering,