One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Similar documents
arxiv: v1 [cs.cv] 29 Mar 2017

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

CNN for Low Level Image Processing. Huanjing Yue

arxiv: v1 [cs.cv] 17 Nov 2016

Lecture 7: Semantic Segmentation

Image Restoration with Deep Generative Models

Hallucinating Very Low-Resolution Unaligned and Noisy Face Images by Transformative Discriminative Autoencoders

Single Image Super Resolution of Textures via CNNs. Andrew Palmer

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

DCGANs for image super-resolution, denoising and debluring

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Structured Prediction using Convolutional Neural Networks

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Image Restoration Using DNN

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Denoising an Image by Denoising its Components in a Moving Frame

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis Supplementary

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Bilevel Sparse Coding

Convolution Neural Networks for Chinese Handwriting Recognition

Introduction to Generative Adversarial Networks

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

arxiv: v1 [cs.cv] 29 Sep 2016

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

A New Orthogonalization of Locality Preserving Projection and Applications

Efficient Module Based Single Image Super Resolution for Multiple Problems

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

arxiv: v1 [cs.cv] 25 Dec 2017

arxiv: v1 [cs.cv] 20 Dec 2016

Face Hallucination Based on Eigentransformation Learning

arxiv: v2 [cs.cv] 1 Sep 2016

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

Infrared Image Enhancement in Maritime Environment with Convolutional Neural Networks

Unsupervised Learning

Course Evaluations. h"p:// 4 Random Individuals will win an ATI Radeon tm HD2900XT

Example-Based Image Super-Resolution Techniques

Deep Learning for Visual Manipulation and Synthesis

Super-Resolution on Image and Video

Computational Photography Denoising

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

arxiv: v1 [cs.ne] 11 Jun 2018

Inverting The Generator Of A Generative Adversarial Network

arxiv: v1 [cs.cv] 8 Feb 2018

SINGLE image super-resolution (SR) aims to reconstruct

Multi-View 3D Object Detection Network for Autonomous Driving

Optimal Denoising of Natural Images and their Multiscale Geometry and Density

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Alternatives to Direct Supervision

Deep Back-Projection Networks For Super-Resolution Supplementary Material

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

IMAGE RESTORATION VIA EFFICIENT GAUSSIAN MIXTURE MODEL LEARNING

Supplemental Material Deep Fluids: A Generative Network for Parameterized Fluid Simulations

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Structured Face Hallucination

Day 3 Lecture 1. Unsupervised Learning

Patch-Based Color Image Denoising using efficient Pixel-Wise Weighting Techniques

Boosting face recognition via neural Super-Resolution

Deep Learning for Computer Vision II

GENERATIVE ADVERSARIAL NETWORK-BASED VIR-

Pixel-level Generative Model

Fast and Accurate Image Super-Resolution Using A Combined Loss

arxiv: v1 [cs.cv] 23 Sep 2017

Inception Network Overview. David White CS793

A Novel Multi-Frame Color Images Super-Resolution Framework based on Deep Convolutional Neural Network. Zhe Li, Shu Li, Jianmin Wang and Hongyang Wang

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

Machine Learning 13. week

Aggregating Descriptors with Local Gaussian Metrics

IMAGE DENOISING TO ESTIMATE THE GRADIENT HISTOGRAM PRESERVATION USING VARIOUS ALGORITHMS

Guided Image Super-Resolution: A New Technique for Photogeometric Super-Resolution in Hybrid 3-D Range Imaging

Blind Deblurring using Internal Patch Recurrence. Tomer Michaeli & Michal Irani Weizmann Institute

IMAGE DENOISING USING NL-MEANS VIA SMOOTH PATCH ORDERING

Handwritten Hindi Numerals Recognition System

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Extended Dictionary Learning : Convolutional and Multiple Feature Spaces

Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet

arxiv: v3 [cs.lg] 30 Dec 2016

Supplemental Document for Deep Photo Style Transfer

Study of Residual Networks for Image Recognition

CSE 559A: Computer Vision

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Introduction to Neural Networks

Direct Matrix Factorization and Alignment Refinement: Application to Defect Detection

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Connecting Image Denoising and High-Level Vision Tasks via Deep Learning

Sketchable Histograms of Oriented Gradients for Object Detection

EE795: Computer Vision and Intelligent Systems

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Single Image Super-resolution. Slides from Libin Geoffrey Sun and James Hays

Image Super-Resolution Using Dense Skip Connections

SINGLE image super-resolution (SR) aims to reconstruct

The exam is closed book, closed notes except your one-page cheat sheet.

Transcription:

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models [Supplemental Materials] 1. Network Architecture b ref b ref +1 We now describe the architecture of the networks used in the paper. We use exponential linear unit () [1] as activation function. We also use virtual batch normalization [6], where the reference batch size b ref is equal to the batch size used for stochastic gradient descent. We weight the reference batch with. We define some shorthands for the basic components used in the networks. conv(w, c, s): convolution with w w window size, c output channels and s stride. dconv(w, c, s) deconvolution (transpose of the convolution operation) with w w window size, c output channels and s stride. : virtual batch normalization. bottleneck(same/half/quarter): bottleneck residual units [5] having the same, half, or one-fourth of the dimensionality of the. Their block diagrams are shown in Figure 1. cfc: a channel-wise fully connected layer, whose output dimension is with same size as the dimension. fc(s): a fully-connected layer with the output size s. To simply the notation, we use the subscript ve on a component to indicate that it is followed by and. Projection network P. The projection network P is composed of one encoder network E and one decoder network, like a typical autoencoder. The encoder E projects an to a 124-dimensional latent space, and the decoder projects the latent representation back to the image space. The architecture of E is as follows. Input conv(4, 64, 1) ve conv(4, 128, 1) ve conv(4, 256, 2) ve conv(4, 512, 2) ve conv(4, 124, 2) ve cfc conv(2, 124, 1) ve (latent) (1) The decoder is a symmetric counter part of the encoder: latent dconv(4, 512, 2) ve dconv(4, 256, 2) ve dconv(4, 128, 1) ve dconv(4, 64, 1) ve dconv(4, 3, 1) (Output) (2) Image-space classifier D. As shown in Figure 3 of the paper, we use two classifiers one operates in the image space R d and discriminates natural images from the projection outputs, the other operates in the latent space of P based on the hypothesis that after encoded by E, a perturbed image and a natural image should already lie in the same set. For the image-space classifier D, we use the 5-layer architecture of [4] but use the bottleneck blocks suggested in [5]. The detailed architecture is as follows. Input conv(4, 64, 1) bottleneck(half ) {bottleneck(same)} 3 bottleneck(half ) {bottleneck(same)} 4 bottleneck(half ) {bottleneck(same)} 6 bottleneck(half ) {bottleneck(same)} 3 & fc(1) (output), (3) where {} n means we repeat the building block n times. Latent-space classifier D l. The latent space classifier D l operates on the output of the encoder E. Since the dimension is smaller than that of D, we use fewer bottleneck blocks than we did in D. Input bottleneck(same) 3 bottleneck(quarter) {bottleneck(same)} 2 & fc(1) (output) 2. More Implementation Details We now describe the details of the training procedure on each dataset. (i) MNIST dataset. The images in the dataset are 28 28 and grayscale. We train the projector and the classifier (4) 1

Input [N N C] Input [N N C] Input [N N C] conv(1, c 4, 1) conv(1, 2C, 2) conv(1, C, 2) conv(1, C, 2) conv(1, C 4, 2) conv(3, c 4, 1) conv(3, C 2, 1) conv(3, C 4, 1) conv(1, c, 1) conv(1, C 2, 1) conv(1, C, 1) + + + Output [N N C] (a) bottleneck(same) Output [ N 2 N 2 2C] (b) bottleneck(half) Output [ N 2 N 2 C] (c) bottleneck(quarter) Figure 1: Block diagrams of the bottleneck components used in the paper. (a) bottleneck(same) preserves the dimensionality of the by maintaining the same output spatial dimension and the numger of channels. (b) bottleneck(half) reduces dimensionality by 2 via halving each spatial dimension and doubling the number of channels. (c) bottleneck(quarter) reduces dimensionality by 4 via halving each spatial dimension. (a) Non-smooth D and D l (b) Smooth D and D l Figure 2: Comparison of ADMM convergence between (a) a projection network trained with indifferentiable D and D l and (b) the architecture, in which the gradient of D and D l are Lipschitz continuous. We perform ing with box size equal to 6 and a total of 1 boxes on 1 random images in ImageNet dataset. For both cases, we set ρ =.5. We use transparent lines (α =.1) in order to show densities.

[3] (PSNR=2.5) (PSNR=23.2) bicubic interpolation [3] bicubic interpolation [3] bicubic interpolation [3] Figure 3: Results of 4 (on the first three rows) and 8 (on the last row) super-resolution of Freeman and Fattal [3] (on the third column) and the method (on the last column). All the images are from [3]. Note that all the images, except for the one in the first row, do not have. For the method, we use the projection network trained on ImageNet dataset and set ρ = 1..

networks on the training set and test the results on the test set. Since the dataset is relatively simpler, we remove the upper three layers from both D and D l, and we do not perturb the images by smoothing. We use batches with 32 instances and train the networks for, iterations. (ii) MS-Celeb-1M dataset. The dataset contains a total of 8 million aligned and cropped face images of 1 thousand people from different viewing angles. We randomly select images of 73, 678 people as the training set and those of 25, 923 people as the test set. We resize the images into 64 64. We use batches with 25 instances and train the network for 1, iterations. (iii) ImageNet dataset. ImageNet contains 1.2 million training images and 1 thousand test images on the Internet. We randomly crop a square image based on the shorter side of the images and resize the cropped image into 64 64. We use batches with 25 instances and train the network for 68, iterations. (iv) LabelMe dataset. The dataset also contains images from the Internet. We do not train a projection network on the dataset. We use the test set to quantitatively evaluative the performance of the projection network trained on ImageNet dataset. Since the images in the dataset have very high resolution, we first resize the images to 469 387, which is the average resolution of the images in ImageNet dataset. We then follow the same procedure as that used with ImageNet to generate test images. Image perturbation. We generate perturbed images with two methods adding Gaussian noise with spatially varying standard deviations and smoothing the images. We generate the noise by multiplying a randomly sampled standard Gaussian noise with a weighted mask upsampled from a low-dimensional mask with bicubic algorithm. The weighted mask is randomly sampled from a uniform distribution ranging from [.5,.5]. Note that the images are ranging from [ 1, 1]. To smooth the images, we first downsample the images and then use nearest-neighbor method to upsample the results. The ratio to the downsample is uniformly sampled from [.2,.95]. After smoothing the images, we add the noise described above. We only use the smoothed images on ImageNet and MS-Cele-1M datasets. 3. Convergence of ADMM Theorem 1 states a sufficient condition for the nonconvex ADMM to converge. Based on Theorem 1, we use exponential linear units as the activation functions in D and D l and truncate their weights after each training iteration, in order for the gradient of D and D l to be Lipschitz continuous. Even though Theorem 1 is just a sufficient condition, in practice, we observe improvement in terms of convergence. We conduct experiments on ing on ImageNet dataset using two projection networks one trained with D and D l using the smooth exponential linear units, and the other trained with D and D l using the non-smooth leaky rectified linear units. Note that leaky rectified linear units are indifferentiable and thus violate the sufficient condition provided by Theorem 1. Figure 2 shows the root mean square error of x z, which is a good indicator of the convergence of ADMM, of the two networks. As can be seen, using leaky rectified linear units results in higher and spikier root mean square error of x z than using exponential linear units. This indicates a less stable ADMM process. It shows that following Theorem 1 can help the convergence of ADMM. 4. Super-resolution results We compare the method with that of Freeman and Fattal [3] on 4 - and 8 -super resolution. The results are shown in Figure 3. 5. Denoising results We compare the method with the state-of-the-art denoising algorithm, [2]. We add Gaussian random noise with different standard deviation σ (out of 255) to the test images, which were taken by the author with a cell phone camera. The value of σ of each image is provided to. For the method, we let A = I, the identity matrix, and set ρ = 3 255σ. To perform the projection operation on the 384 512 images, we use the same projection network learned from ImageNet dataset and apply it to 64 64 patches. As shown in Figure 4 and Figure 5, when σ is larger than 4, the method consistently outperform. 6. More examples More results on MS-Celeb-1M and ImageNet dataset are shown in Figure 6 and Figure 7, respectively. References [1] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (s). arxiv preprint arxiv:1511.7289, 215. 1 [2] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Bm3d image denoising with shape-adaptive principal component analysis. In Signal Processing with Adaptive Sparse Structured Representations, 9. 4 [3] G. Freedman and R. Fattal. Image and video upscaling from local self-examples. ACM TOG, 28(3):1 1, 21. 3, 4 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 216. 1 [5] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 216. 1 [6] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 216. 1

(σ = 4) (PSNR=26.56) (PSNR=27.47) (σ = 1) (PSNR=22.54) (PSNR=23.64) (σ = ) (PSNR=19.13) (PSNR=2.32) (σ = 4) (PSNR=28.32) (PSNR=29.33) (σ = 1) (PSNR=25.6) (PSNR=25.53) (σ = ) (PSNR=21.4) (PSNR=22.57) 35 25 2 15 4 35 25 2 4 Figure 4: Comparison to on image denoising

(σ = 4) (PSNR=25.16) (PSNR=26.38) (σ = 1) (PSNR=21.51) (PSNR=22.58) (σ = ) (PSNR=19.1) (PSNR=19.84) (σ = 4) (PSNR=26.6) (PSNR=26.86) (σ = 1) (PSNR=22.33) (PSNR=23.) (σ = ) (PSNR=19.27) (PSNR=19.97) 28 26 24 22 2 18 4 35 25 2 15 4 Figure 5: Comparison to on image denoising

compressive pixelwise, 2 superresolution 4 superresolution compressive pixelwise, 2 superresolution 4 superresolution ground truth/ `1 prior speciallytrained network Figure 6: More results on MS-Celeb-1M dataset. The PSNR values are shown in the lower-right corner of each image. For compressive, we test on m d =.1. For pixelwise ing, we drop 5% of the pixels and add Gaussian noise with σ =.1. We use ρ = 1. on both super resolution tasks. compressive pixelwise, superresolution compressive pixelwise, superresolution / `1 prior speciallytrained network / `1 prior speciallytrained network Figure 7: More results on ImageNet dataset. Compressive uses m d =.1. For pixelwise ing, we drop 5% of the pixels and add Gaussian noise with σ =.1. We use ρ =.5 on ing and ρ =.5 on super resolution.