Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Xintao Wang Ke Yu Chao Dong Chen Change Loy

Problem enlarge 4 times Low-resolution image High-resolution image

Previous work Contemporary SR algorithms are mostly CNN-based methods [1]. Most of CNN-based methods use pixel-wise loss function. (MSE-based model) good at recovering edges and smooth areas not good at texture recovery Adversarial loss is introduced in SRGAN [2] and EnhanceNet [3]. (GAN-based model) encourage the network to favor solutions that look more like natural images visual quality of reconstruction is significantly improved SRCNN SRGAN Ground-truth [1] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014. [2] C. Ledig, L. Theis, F. Husz ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. [3] M. S. Sajjadi, B. Sch olkopf, and M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.

Motivation building x4 plant x4 building plant swap s

Semantic categorical building water animal sky grass plant mountain

Issues 1. How to represent the semantic categorical? Our approach: explore semantic segmentation probability maps as the categorical up to pixel level. 2. How categorical can be incorporated into the reconstruction process effectively? Our approach: propose a novel Spatial Feature Transform that is capable of altering the network behavior conditioned on other information.

Represent categorical Contemporary CNN segmentation network [1] fine-tuned on LR images K categories ResNet 101 argmax probability maps semantic categorical [1] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.

Examples on segmentation Input LR images Ground-truth Segments on HR images Segments on LR images sky grass building mountain plant water animal background

Incorporate conditions Categorical Ψ = (P 1, P 2,, P K ) Ψ probability maps P 1, P 2,, P K? y = G θ (x Ψ) CNN for SR y = G θ (x) input LR image x net G parametrized by θ restored image y

Spatial Feature Transform By learning a mapping function M, the Ψ is modeled by a pair of affine transformation parameters (γ, β). M: Ψ (γ, β) The modulation is then carried out by an affine transformation on feature maps F. SFT F γ, β = γ F+ β y = G θ (x Ψ) M: Ψ (γ, β) SFT F γ, β = γ F+ β y = G θ (x γ, β)

Conv Conv Conv Conv Conv Conv Conv Conv Conv SFT layer Conv SFT layer Conv Residual block Residual block SFT layer Conv Upsampling Conv Conv Conv Spatial Feature Transform Residual block SFT layer Segmentation probability maps Condition Network conditions Shared SFT conditions features + γ i β i

loss function Generator Adversarial loss [1] encourage the network to generate images that reside on the manifold of natural images min θ max η Ε y~phr logd η y + Ε x~plr log(1 D η G θ (x) ) Compete Discriminator Perceptual loss [2] use a pre-trained 19-layer VGG network (features before conv54) optimize a super-resolution model in a feature space φ VGG y φ VGG y 2 2 [1] Goodfellow, Ian, et al. Generative adversarial nets. In NIPS. 2014. [2] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.

Spatial condition The modulation parameters (γ, β) have a close relationship with probability maps P and contain spatial information. Input P building map P grass map γ map of C 6 β map of C 7 LR patch Restored

Delicate modulation LR patch P plant map γ map of C 51 β map of C 1 Restored P grass map γ map of C 14 β map of C 5

Results SRCNN SRGAN EnhanceNet SFT-Net (ours) GT PSNR: 24.83dB PSNR: 23.36dB PSNR: 22.71dB PSNR: 22.90dB

Results Bicubic SRCNN VDSR LapSRN DRRN MemNet EnhanceNet SRGAN SFT-Net (ours) GT MSE-based method GAN-based method

User study part I Ours 85 15 EnhanceNet Ours 67 33 SRGAN 54.5 76.4 68 75 56.4 68.7 65.7 sky building grass animal plant water mountain

User study part II GT Ours 18.6 79.6 80.4 18.4 MemNet 61.3 36.3 SRCNN 37 62.4 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Rank-1 Rank-2 Rank-3 Rank-4

animal plant water mountain grass sky building Impact of different s bicubic building sky grass mountain water plant animal building sky grass mountain water plant animal building

animal plant water mountain grass sky building Impact of different s bicubic building sky grass mountain water plant animal mountain bicubic building sky grass mountain water plant animal

Other conditioning methods Input concatenation Compositional mapping [1] FiLM [2] [1] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your own prada: Fashion synthesis with structural coherence. In ICCV, 2017. [2] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville. FiLM: Visual reasoning with a general conditioning layer. arxiv preprint arxiv:1709.07871, 2017.

Comparison with other conditioning methods SFT-Net (ours) Input concatenation Compositional mapping FiLM

Robustness to out-of-category SRGAN Ours SRGAN Ours

Conclusion Explore semantic segmentation maps as categorical for realistic texture recovery. Propose a novel Spatial Feature Transform layer to efficiently incorporate the categorical conditions into a CNN-based SR network. Extensive comparisons and a user study demonstrate the capability of SFT-Net in generating realistic and visually pleasing textures.

Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning Ke Yu Chao Dong Liang Lin Chen Change Loy

Image Restoration There are many individual tasks Denoising Deblurring JPEG Deblocking Super-Resolution Towards more complicated distortions Address multiple levels of degradation in one task Address multiple individual tasks [3] [1, 2]

Image Restoration A New Setting Consider multiple distortions simultaneously Real-world: Image capture and storage Synthetic: Gaussian blur, Gaussian noise and JPEG compression Real-world Scenario Gaussian Blur Gaussian Noise JPEG Compression Synthetic Setting Our New Task

Motivation Can we use a single CNN to address multiple distortions? Inefficient: Require a huge network to handle all the possibilities Inflexible: All kinds of distorted images are processed with the same structure Find a more efficient and flexible approach! Process different distortion in a different way

Method Decision Making Progressively restore the image quality Treat image restoration as a decision making process Artifacts! Blurry! Noisy! Try a Good enough :) deblocking deblurring denoising tool

Method Overview Our framework requires a toolbox and an agent Agent Toolbox Agent Toolbox

Method Toolbox We design 12 tools, each of which addresses a simple task 3-layer CNN [4] 8-layer CNN

Method Agent Use reinforcement learning to address tool selection current distorted image action at last step 12 tools stopping State Action Reward: PSNR gain at each step Structure : I 1 Input Image v 1 S 1 Feature Extractor One-hot Encoder Agent LSTM v 1

Method Joint Training Challenge of Middle State Intermediate results after several steps of processing None of the tools has seen these intermediate results Joint Training forward backward toolchain 1 forward toolchain 2 backward MSE loss......... MSE loss

Experimental Results Dataset: DIV2K [5] Comparison with generic models for image restoration VDSR [1] DnCNN [3]

Experimental Results Quantitative results on DIV2K Competitive performance Better generality Runtime Analyses More efficient

Experimental Results Qualitative results on DIV2K Mild (unseen) Moderate Severe (unseen) Input 1 st step 2 nd step 3 rd step VDSR-s VDSR [1]

Experimental Results Qualitative results on real-world images Input 1 st step 2 nd step 3 rd step VDSR [1]

Experimental Results Ablation Study Joint training Stopping action

Conclusion Contributions Address image restoration in a reinforcement learning framework Propose joint learning to cope with middle processing state Dynamically formed toolchain performs competitively against human-designed networks with less computational complexity Future work Incorporate more tools (trained with GAN loss) Handle spatial-variant distortions

Thanks! Q & A

Reference [1] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016. [2] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory network for image restoration. In ICCV, 2017. [3] K. Zhang,W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 2017. [4] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. TPAMI, 38(2):295 307, 2016. [5] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPR Workshop, 2017.