arxiv: v3 [cs.lg] 30 Dec 2016

Similar documents
Video Pixel Networks. Abstract. 1. Introduction

Neural Machine Translation In Linear Time

PredCNN: Predictive Learning with Cascade Convolutions

Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations

Machine Learning 13. week

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Fast scene understanding and prediction for autonomous platforms. Bert De Brabandere, KU Leuven, October 2017

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Channel Locality Block: A Variant of Squeeze-and-Excitation

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

arxiv: v1 [cs.cv] 14 Jul 2017

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Deep Learning for Computer Vision II

Supplementary Material for: Video Prediction with Appearance and Motion Conditions

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Multi-Task Self-Supervised Visual Learning

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Outline GF-RNN ReNet. Outline

arxiv: v2 [cs.cv] 14 May 2018

Know your data - many types of networks

Dynamic Filter Networks

Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction

Pixel-level Generative Model

arxiv: v1 [cs.cv] 17 Nov 2016

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

A spatiotemporal model with visual attention for video classification

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

PIXELCNN++: IMPROVING THE PIXELCNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS

ais Rheinische Friedrich-Wilhelms-Universität Bonn Rheinische Master thesis

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

arxiv: v1 [stat.ml] 10 Dec 2018

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Content-Based Image Recovery

Depth from Stereo. Dominic Cheng February 7, 2018

Deep Learning with Tensorflow AlexNet

Learning visual odometry with a convolutional network

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

arxiv: v1 [cs.cv] 20 Dec 2016

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

arxiv: v2 [cs.cv] 16 Mar 2018

Deep Residual Learning

Classifying a specific image region using convolutional nets with an ROI mask as input

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

AAR-CNNs: Auto Adaptive Regularized Convolutional Neural Networks

arxiv: v2 [cs.ne] 10 Nov 2018

CS489/698: Intro to ML

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

MoonRiver: Deep Neural Network in C++

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

arxiv: v1 [cs.cv] 31 Mar 2016

Image Captioning with Object Detection and Localization

Thinking Outside the Box: Spatial Anticipation of Semantic Categories

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Amortised MAP Inference for Image Super-resolution. Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi & Ferenc Huszár ICLR 2017

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Learning Deep Representations for Visual Recognition

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

FULLY CONVOLUTIONAL STRUCTURED LSTM NETWORKS FOR JOINT 4D MEDICAL IMAGE SEGMENTATION

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Single Object Tracking with Organic Optic Attenuation

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

SEMI-SUPERVISED LEARNING WITH GANS:

arxiv: v1 [cs.cv] 16 Jul 2017

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

arxiv: v2 [cs.lg] 24 Apr 2017

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Report: Privacy-Preserving Classification on Deep Neural Network

Towards End-to-End Audio-Sheet-Music Retrieval

Hello Edge: Keyword Spotting on Microcontrollers

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

Recurrent Neural Networks and Transfer Learning for Action Recognition

Weighted Convolutional Neural Network. Ensemble.

Image Captioning with Attention

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Study of Residual Networks for Image Recognition

End-To-End Spam Classification With Neural Networks

SYNTHETIC GRADIENT METHODS

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Learning Deep Features for Visual Recognition

Supplementary material for Analyzing Filters Toward Efficient ConvNet

3D Randomized Connection Network with Graph-Based Inference

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

Bayesian model ensembling using meta-trained recurrent neural networks

Transcription:

Video Ladder Networks Francesco Cricri Nokia Technologies francesco.cricri@nokia.com Xingyang Ni Tampere University of Technology xingyang.ni@tut.fi arxiv:1612.01756v3 [cs.lg] 30 Dec 2016 Mikko Honkala Nokia Technologies Espoo, Finland mikko.honkala@nokia.com Emre Aksu Nokia Technologies emre.aksu@nokia.com Abstract Moncef Gabbouj Tampere University of Technology moncef.gabbouj@tut.fi We present the Video Ladder Network (VLN) for efficiently generating future video frames. VLN is a neural encoder-decoder model augmented at all layers by both recurrent and feedforward lateral connections. At each layer, these connections form a lateral recurrent residual block, where the feedforward connection represents a skip connection and the recurrent connection represents the residual. Thanks to the recurrent connections, the decoder can exploit temporal summaries generated from all layers of the encoder. This way, the top layer is relieved from the pressure of modeling lower-level spatial and temporal details. Furthermore, we extend the basic version of VLN to incorporate ResNet-style residual blocks in the encoder and decoder, which help improving the prediction results. VLN is trained in selfsupervised regime on the Moving MNIST dataset, achieving competitive results while having very simple structure and providing fast inference. 1 Introduction In recent years, several research groups in the deep learning and computer vision communities have targeted video prediction (Srivastava et al. [8], Brabandere et al. [1], Kalchbrenner et al. [5]). The task consists of providing a model with a sequence of past frames, and asking it to generate the next frames in the sequence (also referred to as the future frames). This is a very challenging task, as the model needs to embed a rich internal representation of the world and its physical rules, which are highly-structured. Machine learning -based models for video prediction are typically trained in self-supervised regime, where the ground-truth future frames are provided as targets. This is sometimes referred to also as unsupervised training, as no manually-labelled data are needed. Training a model to predict future video frames is beneficial for a number of applications. First of all, the learned internal representations can be used for extracting rich semantic features in both space and time, which can then be utilized for different supervised discriminative tasks such as action or activity recognition (as in Srivastava et al. [8]), semantic segmentation, etc. Also, the ability to predict (or imagine) the future is an important skill of humans, that it used for anticipating the consequence of actions in the real-world, thus allowing us to make decisions about which action to perform. Hence, these internal video representations may support robots in their action decision process. We propose the Video Ladder Network (VLN), a neural network architecture for generating future video frames efficiently. The proposed network is evaluated on a benchmark dataset, the Moving MNIST dataset, and is compared to most relevant prior works. We show that our model achieves competitive results, while having simple structure and providing fast inference. Equal contribution

The rest of the paper is organized as follows. Section 2 describes the proposed model. Section 3 provides experimental results. Finally, Section 4 draws some concluding remarks. 2 Proposed Model The Video Ladder Network is a fully convolutional neural encoder-decoder network augmented at all layers by specifically-designed lateral connections (see Figure 1). Figure 1: Overview of VLN. x t is a frame at time t. ˆx t+1 is a predicted frame at time t + 1 2.1 Encoder and Decoder The encoder and decoder of VLN are common fully-convolutional feedforward neural networks. In particular, the encoder consists of dilated convolutional layers (Yu and Koltun [10]), whereas the decoder uses normal convolutions. Both encoder and decoder use batch-normalization (BN) (Ioffe and Szegedy [4]) and leaky-relu activation function (except for the last decoder s layer, which uses a sigmoid activation). We have also experimented with a simple extension of the basic VLN model, by incorporating ResNet-style residual blocks (similar to those used in He et al. [2]) into the encoder and decoder. This extended model, named VLN-ResNet, is shown in Figure 2. VLN-ResNet has more layers than VLN but similar number of parameters. As can be seen in the figure, forward and backward signals can flow using multiple paths both in the encoder-decoder sides and in the lateral connections. 2.2 Recurrent Residual Blocks Valpola [9] showed that supplying the decoder with information from all encoder layers improves the learned image features and makes them more invariant to small changes in the input. VLN transfers this concept to unsupervised learning in the temporal domain, using both recurrent lateral connections and feedforward lateral connections. At each layer, a recurrent connection and a feedforward connection form a recurrent residual block, where the recurrent connection represents the residual, and the feedforward connection represents the skip connection. Having a separate recurrent residual block at different layers in the hierarchy relieves the higher layers from the pressure of modeling lower-level spatial and temporal details. The feedforward lateral connections supply the decoder only with information about the latest input sample, and we expect them to be especially useful for modeling static parts. Each recurrent connection consists of a convolutional Long Short-Term Memory (conv-lstm) (Shi et al. [7]), which is an extension of the fully-connected LSTM (Hochreiter and Schmidhuber [3]). In particular, conv-lstms replace the matrix multiplications with convolution operations and thus allow 2

Figure 2: Overview of VLN-ResNet, which incorporates ResNet-style residual blocks in the encoder and decoder. for preserving the spatial information. At each time-step t and for each encoder layer l, the input gate i l t, forget gate f l t, output gate o l t and the candidate for the cell state c l t of the convolutional-lstm are computed as follows: i l t = σ ( z l t W l zi + h l t 1 W l hi + b l i), (1) f l t = σ ( z l t W l zf + h l t 1 W l hf + b l f ), (2) o l t = σ ( z l t W l zo + h l t 1 W l ho + b l o), (3) c l t = tanh ( z l t W l z c + h l t 1 W l h c + b l c), (4) where zt l are the input feature maps from the l-th convolution layer, h l t 1 is the previous hidden state, Wzi l, W zf l, W zo, l Wz c l, W hi l, W hf l, W ho l, W h c l are convolutional-kernel tensors, bl i, bl f, bl o, bl c are bias terms, σ and tanh are the sigmoid and hyperbolic tangent non-linearities, respectively, and is the convolution operator. Then, the hidden state h l t, output by a recurrent lateral connection at layer l, is computed as follows: h l t = o l t tanh ( c l t i l t + c l t 1 f l t), (5) where represents element-wise multiplication. The output of the recurrent connection h l t, of the feedforward connection zt l and of the upper decoder layer z t+1 l+1 are merged as follows: ẑt+1 l = LReLU (( l+1 LReLU (( z t+1, ) ) ) t) hl W l h, z l t W l z, (6) where LReLU is the leaky-relu non-linearity, (, ) denotes channel-wise concatenation, W l h and W l z are (1, 1) convolutional-kernel tensors. No batch-normalization is applied to the conv-lstm. 3

3 Experiments We evaluate the Video Ladder Networks on the Moving MNIST dataset, that we generate using the code provided by the authors 1. This dataset is derived from the popular MNIST dataset, which contains 60000 training samples and 10000 testing samples. Validation samples are selected by picking out 20% training samples, while preserving the percentage of samples for each class. We produce the Moving MNIST samples from corresponding MNIST partitions on the fly, generating 10k train samples and 1k validation samples per epoch. Furthermore, evaluation is performed on the provided test-set of 10k samples. Each video sequence is generated by moving two digits within the video frames. Each digit is randomly chosen and is initially positioned randomly inside a patch. It is then moved along a random direction with constant speed, bouncing off when hitting the frame s boundaries. Digits in the same frame move independently and are allowed to overlap. Each video sequence consists of 20 frames, of which the first 10 represent past frames and are used as input frames to the model, whereas the following 10 frames represent the future ground-truth frames to be predicted. Each frame has size 64 64. 3.1 Model Architectures In VLN, the encoder consists of 3 dilated convolutional layers with stride (2, 2), dilations 1, 2, 4, and number of filters 32, 64, 96. The decoder has 3 convolutional layers preceded by up-sampling. Empirically, we observed that using dilation at the decoder does not bring any benefit. All layers use kernel size (3, 3). The convolutional LSTMs use same number of channels as in their input. In VLN-ResNet (shown in Figure 2), both the encoder and decoder consist of 3 residual blocks, where each block contains two convolutional layers with no stride, kernel size (3, 3), and number of filters 28, 28, 58, 58, 90, 90. The encoder uses dilated convolutions, with dilation rates 1, 2, 2, 4, 4, 8, followed by element-wise sum of the skip connection and by a strided convolution for halving the resolution. At the decoder side, each residual block starts by up-sampling. We compare our models to most relevant prior art and to two baselines, VLN-BL and VLN-BL-FF. Specifically, VLN-BL has only one convolutional-lstm at the top layer (with 128 channels to match the number of parameters of VLN) and no feedforward lateral connections. VLN-BL-FF includes also feedforward connections at all layers. In order to improve the computational efficiency of the model, we decided to extend the effective receptive field of the model by reducing the resolution via striding when going up in the model hierarchy. If computational efficiency is not an issue, it is possible to use larger dilations and/or more depth and preserve the resolution at all layers as in Kalchbrenner et al. [5]. This is left as future work. 3.2 Training and Evaluation We train and evaluate VLN using the binary cross-entropy loss, by interpreting the grayscale targets as probabilities as in Kalchbrenner et al. [5]. Ideally, the loss used for training should be averaged over 10 predictions (as in prior works) but, due to limited computational resources, we average it over 5 predictions. This is a disadvantage with respect to prior works. However, the losses used for model evaluation are averaged over 10 frames, in order to fairly compare to the prior art. We train all our models for 1700 epochs, using RMSprop with initial learning rate 0.0001 (except for VLN-ResNet, which is trained only for 1300 epochs, with initial learning rate 0.0005). We use a windowing approach for training and prediction. The input window has length 10. The latest prediction ˆx t+1 is fed back to input and the window is shifted, so that the used ground-truth is only x nt+1:10 (n t is the number of fed back predictions at time t). When shifting the window, at training phase the conv-lstm hidden state is preserved, whereas at evaluation phase the state is reset. See Figure 3 for a plot of the test-set loss for each time-step. As can be seen, losses for the last 5 predictions have more variance and are more far apart from each other, compared to the first 5 predictions. This causes the average test loss to have high variance too. 4

600 P 1 P 6 Mean P 2 P 7 P 3 P 8 P 4 P 9 P 5 P 10 500 400 Loss 300 200 100 0 200 400 600 800 1000 1200 1400 1600 1800 Epoch Figure 3: Test-set loss for each time-step. Model Test loss Depth # Parameters conv-lstm (Shi et al. [7]) 367.1 Unknown 7.6M fclstm (Srivastava et al. [8]) 341.2 2 143M DFN (Brabandere et al. [1]) 285.2 Unknown 0.6M VPN-BL (Kalchbrenner et al. [5]) 110.1 81 30M VLN-BL 222.3 7 1.2M VLN-BL-FF 220.1 8 1.2M VLN 207.0 9 1.2M VLN-ResNet 187.7 15 1.3M Table 1: Results on test-set ( means estimated from available information) 3.3 Results The results in Table 1 show that VLN and VLN-ResNet outperform all other models except for the baseline of Video Pixel Networks (VPN-BL), which has much more parameters and depth. Our baselines, VLN-BL and VLN-BL-FF, perform well too and we attribute their competitive results to the use of convolutional-lstm, batch-normalization, and our windowing training and prediction approach. By comparing VLN-BL and VLN-BL-FF, it is worth noticing that feedforward connections bring only a marginal improvement on the Moving MNIST dataset, providing evidence that the main benefit of VLN comes from the recurrent lateral connections. Figure 4 shows four randomly drawn test-set samples, where the predictions are generated by VLN-ResNet. We left out from the comparison two other works which are not comparable to VLN. Specifically, in the VPN model described in Kalchbrenner et al. [5], each RGB frame is predicted pixel by pixel, which is very slow as it uses about 80 layers and needs N N 3 iterations (N N is the spatial resolution). The model described in Patraucean et al. [6] encodes motion explicitly. Our model has simple structure, faster inference (10 iterations) and does not encode motion explicitly. Furthermore, VLN s recurrent residual blocks can be easily included in other encoder-decoder models, such as the VPN baseline. 1 http://www.cs.toronto.edu/~nitish/unsupervised_video/ 5

Figure 4: Four randomly selected test-set samples. First row of each sample: the first 10 frames are the input past frames and the remaining 10 frames are the ground-truth future frames. Second row of each sample: future frames predicted by VLN-ResNet. 4 Conclusions We proposed the Video Ladder Network, a novel neural architecture for generating future video frames, conditioned on past frames. VLN is characterized by its lateral recurrent residual blocks. These are lateral connections connecting the encoder and the decoder, and which summarize both spatial and temporal information at all layers. This way, the top layer is relieved from the pressure of modeling all spatial and temporal details, and the decoder can generate future frames by using information from different semantic levels. Furthermore, we proposed an extension of the VLN which uses ResNet-style encoder and decoder. The proposed models were evaluated on the Moving MNIST dataset, comparing them to prior works, including the current state-of-the-art model (Kalchbrenner et al. [5]). We showed that VLN and VLN-ResNet achieve very competitive results, while having simple structure and providing fast inference. References [1] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, 2016. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, 2016. [3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. [4] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015. [5] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video Pixel Networks. arxiv preprint arxiv:1610.00527, 2016. 6

[6] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differentiable memory. In International Conference on Learning Representations (ICLR) Workshop, 2016. [7] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai kin Wong, and Wang chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802 810, 2015. [8] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using LSTMs. In International Conference in Machine Learning, 2015. [9] Harri Valpola. From neural PCA to deep unsupervised learning. Advances in Independent Component Analysis and Learning Machines, pages 143 171, 2015. [10] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016. 7