Supplemental Material for End-to-End Learning of Video Super-Resolution with Motion Compensation

Similar documents
arxiv: v1 [cs.cv] 3 Jul 2017

End-to-End Learning of Video Super-Resolution with Motion Compensation

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

A Deep Learning Approach to Vehicle Speed Estimation

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

arxiv: v4 [cs.cv] 25 Mar 2018

OPTICAL Character Recognition systems aim at converting

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

arxiv: v2 [cs.cv] 25 Oct 2018

Video Frame Interpolation Using Recurrent Convolutional Layers

arxiv: v1 [cs.cv] 10 Apr 2017

Example-Based Image Super-Resolution Techniques

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep Back-Projection Networks For Super-Resolution Supplementary Material

Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos

RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017

Single Image Super Resolution of Textures via CNNs. Andrew Palmer

arxiv: v1 [cs.cv] 20 Jul 2018

Learning visual odometry with a convolutional network

Exploiting Reflectional and Rotational Invariance in Single Image Superresolution

Unsupervised Learning of Stereo Matching

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Peripheral drift illusion

FAST: A Framework to Accelerate Super- Resolution Processing on Compressed Videos

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Boosting face recognition via neural Super-Resolution

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

arxiv: v2 [cs.cv] 11 Nov 2016

Real-time Object Detection CS 229 Course Project

Supplementary Material for "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks"

Structured Prediction using Convolutional Neural Networks

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

arxiv: v2 [cs.cv] 10 Apr 2017

S+U Learning through ANs - Pranjit Kalita

Convolutional Neural Network Implementation of Superresolution Video

Spatio-Temporal Transformer Network for Video Restoration

A Deep Learning Framework for Authorship Classification of Paintings

arxiv: v1 [cs.cv] 6 Nov 2015

DCGANs for image super-resolution, denoising and debluring

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Multi-view 3D Models from Single Images with a Convolutional Network

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Hallucinating Very Low-Resolution Unaligned and Noisy Face Images by Transformative Discriminative Autoencoders

Super-Resolution on Image and Video

RGBD Occlusion Detection via Deep Convolutional Neural Networks

Video Enhancement with Task-Oriented Flow

arxiv: v1 [cs.cv] 22 Jan 2016

Deep Learning with Tensorflow AlexNet

Dynamic Routing Between Capsules

arxiv: v1 [cs.cv] 8 Feb 2018

A Novel Multi-Frame Color Images Super-Resolution Framework based on Deep Convolutional Neural Network. Zhe Li, Shu Li, Jianmin Wang and Hongyang Wang

arxiv: v1 [cs.cv] 6 Dec 2016

Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution

arxiv: v1 [cs.cv] 7 May 2018

AUTOMATED DETECTION AND CLASSIFICATION OF CANCER METASTASES IN WHOLE-SLIDE HISTOPATHOLOGY IMAGES USING DEEP LEARNING

Joint Inference in Image Databases via Dense Correspondence. Michael Rubinstein MIT CSAIL (while interning at Microsoft Research)

Ninio, J. and Stevens, K. A. (2000) Variations on the Hermann grid: an extinction illusion. Perception, 29,

arxiv: v1 [cs.cv] 3 Jan 2017

Depth from Stereo. Dominic Cheng February 7, 2018

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Markov Random Fields and Gibbs Sampling for Image Denoising

Efficient Module Based Single Image Super Resolution for Multiple Problems

Finding Tiny Faces Supplementary Materials

Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation

Edge Detection Using Convolutional Neural Network

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

CAP 6412 Advanced Computer Vision

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Fuzzy Set Theory in Computer Vision: Example 3, Part II

UDNet: Up-Down Network for Compact and Efficient Feature Representation in Image Super-Resolution

CS231N Section. Video Understanding 6/1/2018

Infrared Image Enhancement in Maritime Environment with Convolutional Neural Networks

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Multi-Glance Attention Models For Image Classification

Large Displacement Optical Flow & Applications

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Convolution Neural Networks for Chinese Handwriting Recognition

Fast and Accurate Image Super-Resolution Using A Combined Loss

An Implementation on Histogram of Oriented Gradients for Human Detection

Fuzzy Set Theory in Computer Vision: Example 3

Recurrent Convolutional Neural Networks for Scene Labeling

CS 4495 Computer Vision Motion and Optic Flow

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CAP 6412 Advanced Computer Vision

Depth image super-resolution via multi-frame registration and deep learning

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation

IRGUN : Improved Residue based Gradual Up-Scaling Network for Single Image Super Resolution

Perceptual Loss for Convolutional Neural Network Based Optical Flow Estimation. Zong-qing LU, Xiang ZHU and Qing-min LIAO *

Guided Image Super-Resolution: A New Technique for Photogeometric Super-Resolution in Hybrid 3-D Range Imaging

CS 229 Final Report: Artistic Style Transfer for Face Portraits

arxiv: v1 [cs.cv] 7 Mar 2018

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Transcription:

Supplemental Material for End-to-End Learning of Video Super-Resolution with Motion Compensation Osama Makansi, Eddy Ilg, and Thomas Brox Department of Computer Science, University of Freiburg 1 Computation of PSNR values For all our evaluation results, the reported PSNR values are computed using only the Y channel of the estimated YCbCr image. In case of RGB images, we first convert to YCbCr color space and then compute on the Y channel. For all experiments using the SRCNN [1] or VSR [5] architecture, we follow [5] and for technical reasons crop away 12 pixels of the boundary from the estimated high-resolution images before computing PSNR values. 2 Displacement magnitudes We have noted that improvements using motion compensation are generally smaller on Myanmar than on Videoset4. In Table 1, we compute the average motion magnitudes of the datasets and note that the displacements are also generally smaller in the Myanmar validation set. Dataset Myanmar training Myanmar validation Videoset4 Avg. Mag. 1.50px 0.43px 1.29px Table 1: Average motion magnitudes computed using FlowNet2 [4]. The numbers show that the Myanmar validation set has the smallest displacements. 3 Video super-resolution with patch-based training Using patch-based training, we retrain and evaluate SRCNN [1] and VSR [5] using different kind of motion compensations. However, the resulting PSNR scores in Table 2 are all similar and we conclude that motion compensation on Myanmar has no effect. We also evaluate on Videoset4 (Table 3) and there see a small increment of 0.18 for FlowNet2 [4] and FlowNet2-SD [4].

2 Osama Makansi, Eddy Ilg and Thomas Brox Tested on only no Drulea FlowNet2- FlowNet2 Arch. Trained on center warp [2] SD [4] [4] SRCNN only center 31.62 - - - - VSR only center 31.76 - - - - no warp 31.80 31.83 - - - Drulea [2] 31.77 31.74 31.81 - - FlowNet2-SD [4] 31.75 31.75 31.79 31.77 - FlowNet2 [4] 31.76 31.76 31.80 31.78 31.79 Table 2: PSNR scores for patch-based video superresolution on the Myanmar validation set. We retrained the architecture of [5] using only the center frames (replicated three times), original images, and motion compensated frames. One can observe that all scores are nearly the same and motion compensation on the Myanmar validation set has no effect over providing original images or even only the center image. Motion compensation Videoset4 during training and testing PSNR only center 24.60 no warp 24.59 Drulea [2] 24.69 FlowNet2-SD [4] 24.77 FlowNet2 [4] 24.77 Table 3: Evaluation of the different retrained VSR models from Table 2 on Videoset4. Motion compensation shows a small performance improvement. Setting Patch-based Image-based Learning rate 1e 05 1e 05 Learning rate policy fixed multistep Momentum 0.9 0.9 Weight decay 0.0005 0.0004 Batch size 240 2 Input resolution 36 36 960 540 Image pixels in batch 311k 1M Training iterations 200k 300k Training time 7 hours 32 hours Table 4: Different settings of patch- and image-based traing. Settings are very similar, except that the number of pixels and trainig time in image-based training are larger. Note that the number of pixels is also further boosted much more by sliding the convolutions from the VSR architecture over the entire images with a stride of one. multiplied by 0.5 every 50k iterations.

Supplemental Material 3 4 Video super-resolution with image-based training We perform the same set of experiments as in the last section for image-based training. Comparing Table 2 to Table 5, we find that PSNRs are generally around 1 point higher. We also provide all the training settings in Table 4. Imagebased training in general processes more training data and sees a lot of similar data during training by sliding the convolutions over an entire image. Motion compensation on Myanmar (Table 5) still seams to have little effect, while motion compensation on Videoset4 does show better PSNR values (Table 6). Arch. VSR Tested on only no Drulea FlowNet2- FlowNet2 Trained on center warp [2] SD [4] [4] only center 32.41 - - - - no warp 32.38 32.55 - - - Drulea [2] 32.37 32.26 32.60 - - FlowNet2-SD [4] 32.35 32.37 32.58 32.62 - FlowNet2 [4] 32.37 32.36 32.61 32.61 32.63 Table 5: PSNR scores from Myanmar validation (ours). We now train the architecture of [5] by applying it as a convolution over the complete images. We again evaluate using only the center frame, original images and differently motion compensated frames. One can observe that scores are significantly better compared to the patch-based training, but motion compensation on the Myanmar validation set still has negligible effect compared to training on original frames. Training input PSNR only center 24.66 no warp 24.79 Drulea [2] 24.91 FlowNet2-SD [4] 25.12 FlowNet2 [4] 25.13 Table 6: Evaluation of the different image-based models on Videoset4. Motion compensation in this case also shows a performance improvement.

4 Osama Makansi, Eddy Ilg and Thomas Brox 5 Joint Training Since the FlowNet2-SD [4] is completely trainable, we can refine the optical flow for the task of video super-resolution by training the whole network end-to-end with the super-resolution loss. This potentially allows the optical flow estimation to focus on aspects that are most relevant for the super-resolution task. As an initialization we took the VSR network trained on FlowNet2-SD [4] from the last section and used the same settings from Table 4, but now cropped the images to a resolution of 256 256 to enable a batch size of 8. We then trained for 100k more iterations. The result is given in Table 7 and Figures 1(b) and 1(e). We cannot see the flow itself improve, but we see a small improvement in the PSNR value on Videoset4 and from the images one can observe that the ringing artifacts disappear. (a) Initialization (b) After joint training (c) After joint training with smoothness (d) Initialization (e) After joint training (f) After joint training with smoothness Fig. 1: Example super-resolved image after training FlowNet2-SD [4] with VSR (a and d) jointly (b and e) and including a smoothness constraint (c and f). Test set Myanmar validation VideoSet4 After After joint After joint initialization training training with smoothness 32.62 25.12 32.63 25.21 32.61 25.19 Table 7: Evaluation of refining FlowNet2-SD [4] on the super-resolution task.

Supplemental Material 5 In Figure 1(e), one can observe that many image details become flow artifacts. This is due to nature of the gradient through the warping operation; it corrects the flow vector to the best directly neighboring pixel, which is in most cases a local minimum. Following [3], we add a regularization loss that penalizes deviations from smoothness in the optical flow field, weighted with an image edge-aware term: L R = ( ) e xii,j ( x u + x v ) + e yii,j ( y u + y v ) (1) i,j where I is the first image. i, j is a pixel location and u, v are the x, y components of the flow vector. The results of training with this additional smoothness term are given in Table 7 and Figures 1(c) and 1(f). The flow shows less artifacts than Figure 1(e), but compared to Figure 1(b) some very slight ringing artifacts still remain. 6 Evaluating Architectures and Datasets In first part of the paper, the architecture from Dong et al. [1] adapted to video super-resolution by Kappeler et al. [5] and the Myanmar training dataset were used. Here, we investigate the effect of architectures and training datasets. We extended the Myanmar training set by more high-resolution videos that we downloaded from Youtube. The resulting dataset has 162k frames of resolution 960 540 and we named it MYT. We evaluate and compare the SR- CNN [1], the FlowNet2-SD [4] (here used for super-resolution, not flow) and the encoder-/decoder part of the architecture from Tao et al. [6] (SPMC-ED) for single image super-resolution on the old and new datasets. The results are given in Table 8. SRCNN [1] trained on SRCNN [1] FlowNet2-SD [4] SPMC-ED [6] Myanmar training (ours) trained on MYT trained on MYT trained on MYT Myanmar validation (ours) 32.42 31.98 31.47 32.63 Videoset4 24.63 24.70 24.93 25.07 Number of parameters 57K 57K 14M 491K Table 8: PSNR values for different architectures and training datasets tested for single-image super-resolution. One can observe that SRCNN [1] tends to overfit on the Myanmar dataset. The much deeper FlowNet2-SD [4] architecture performs worse on Myanmar, but can generalize better to Videoset4. The size of SPMC-ED [6] is between the former two and we observe that it performs best on Myanmar and also for generalization to Videoset4. It clearly gives better results than SRCNN [1] and for this reason we also use it for the final network in the paper.

6 Osama Makansi, Eddy Ilg and Thomas Brox References 1. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(2), 295 307 (Feb 2016) 2. Drulea, M., Nedevschi, S.: Total variation regularization of local-global optical flow. In: IEEE Conference on Intelligent Transportation Systems (ITSC). pp. 318 323 (Oct 2011) 3. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. CoRR abs/1609.03677 (2016), http://arxiv.org/abs/1609.03677 4. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 5. Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging (TCI) 2(2), 109 122 (June 2016) 6. Tao, X., Gao, H., Liao, R., Wang, J., Jia, J.: Detail-revealing deep video superresolution. CoRR abs/1704.02738 (2017), http://arxiv.org/abs/1704.02738