MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Similar documents
Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

A Deep Learning Approach to Vehicle Speed Estimation

Supplementary Material for "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks"

arxiv: v1 [cs.cv] 22 Jan 2016

arxiv: v1 [cs.cv] 9 May 2018

Supplemental Material for End-to-End Learning of Video Super-Resolution with Motion Compensation

Optical flow. Cordelia Schmid

Optical flow. Cordelia Schmid

arxiv: v1 [cs.cv] 25 Feb 2019

arxiv: v2 [cs.cv] 4 Apr 2018

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

arxiv: v1 [cs.cv] 6 Dec 2016

Semi-Supervised Learning for Optical Flow with Generative Adversarial Networks

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Deep Optical Flow Estimation Via Multi-Scale Correspondence Structure Learning

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

Fully Convolutional Networks for Semantic Segmentation

Perceptual Loss for Convolutional Neural Network Based Optical Flow Estimation. Zong-qing LU, Xiang ZHU and Qing-min LIAO *

Video Frame Interpolation Using Recurrent Convolutional Layers

arxiv: v1 [cs.cv] 30 Nov 2017

EVALUATION OF DEEP LEARNING BASED STEREO MATCHING METHODS: FROM GROUND TO AERIAL IMAGES

Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Depth from Stereo. Dominic Cheng February 7, 2018

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Unsupervised Learning of Stereo Matching

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

DENSENET FOR DENSE FLOW. Yi Zhu and Shawn Newsam. University of California, Merced 5200 N Lake Rd, Merced, CA, US {yzhu25,

CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss

EE795: Computer Vision and Intelligent Systems

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

A Fusion Approach for Multi-Frame Optical Flow Estimation

Channel Locality Block: A Variant of Squeeze-and-Excitation

Structured Prediction using Convolutional Neural Networks

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

arxiv: v1 [cs.cv] 21 Sep 2018

LEARNING RIGIDITY IN DYNAMIC SCENES FOR SCENE FLOW ESTIMATION

Learning visual odometry with a convolutional network

Optical Flow and Deep Learning Based Approach to Visual Odometry

Peripheral drift illusion

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Lecture 7: Semantic Segmentation

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Visual Object Tracking. Jianan Wu Megvii (Face++) Researcher Dec 2017

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

Study of Residual Networks for Image Recognition

Know your data - many types of networks

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning with Tensorflow AlexNet

Fully-Convolutional Siamese Networks for Object Tracking

CAP 6412 Advanced Computer Vision

Notes 9: Optical Flow

Real-time Object Detection CS 229 Course Project

arxiv: v1 [cs.cv] 7 May 2018

arxiv: v1 [cs.cv] 6 Dec 2015

An Exploration of Computer Vision Techniques for Bird Species Classification

Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation

Robust Local Optical Flow: Dense Motion Vector Field Interpolation

Finding Tiny Faces Supplementary Materials

Convolution Neural Networks for Chinese Handwriting Recognition

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

arxiv: v3 [cs.cv] 22 Mar 2018

Convolutional Neural Network Implementation of Superresolution Video

Conditional Prior Networks for Optical Flow

arxiv: v1 [cs.cv] 31 Mar 2016

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Machine Learning 13. week

CS231N Section. Video Understanding 6/1/2018

Classifying Depositional Environments in Satellite Images

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Deconvolutions in Convolutional Neural Networks

arxiv: v2 [cs.cv] 14 May 2018

Deep Lidar CNN to Understand the Dynamics of Moving Vehicles

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Supplementary Material for SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

arxiv: v1 [cs.cv] 13 Apr 2018

Ninio, J. and Stevens, K. A. (2000) Variations on the Hermann grid: an extinction illusion. Perception, 29,

Two-Stream Convolutional Networks for Action Recognition in Videos

ELEC Dr Reji Mathew Electrical Engineering UNSW

Perceptron: This is convolution!

arxiv: v1 [cs.cv] 14 Dec 2016

Spatio-Temporal Transformer Network for Video Restoration

CS 4495 Computer Vision Motion and Optic Flow

Deep Learning and Its Applications

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

End-to-End Learning of Video Super-Resolution with Motion Compensation

Where s Waldo? A Deep Learning approach to Template Matching

Fully-Convolutional Siamese Networks for Object Tracking

Content-Based Image Recovery

Exploiting Semantic Information and Deep Matching for Optical Flow

arxiv: v1 [cs.cv] 16 Nov 2017

Facial Expression Classification with Random Filters Feature Extraction

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Transcription:

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS Mustafa Ozan Tezcan Boston University Department of Electrical and Computer Engineering 8 Saint Mary s Street Boston, MA 2215 www.bu.edu/ece Dec. 19, 217 Technical Report No. ECE-217-4

Contents 1 Introduction 1 2 Literature Review 1 3 Problem Definition 1 4 Proposed Solution: FlowNet 2 4.1 Dataset.................................. 2 4.2 Network Architectures.......................... 3 5 Experimental Results and Discussion 7 6 Challenges and Future Work 9 i

List of Figures 1 Convolution + ReLU........................... iv 2 Convolution + ReLU + Pooling..................... iv 3 Deconvolution. Note that all deconvolution layers use 2 2 interpolation and 3 3 kernel in convolution................... iv 4 Optical Flow Prediction.......................... iv 5 Concatenation............................... v 6 Motion estimation via supervised learning............... 2 7 Flying Chairs Dataset. First two columns show frame pairs and the last column shows the color coded optical flows (Taken from [1])... 3 8 Color coding of the motion vectors (Taken from [1]).......... 4 9 Overview of FNS............................. 4 1 Details of FNS.............................. 5 11 FNS with detailed refinement network (RN).............. 6 12 Details of FNC.............................. 7 13 Motion estimation results for an example test image.......... 1 14 MCPE results for an example test image................ 11 15 Motion estimation results for an example test image.......... 12 16 MCPE results for an example test image................ 13 17 Motion estimation results for Miss America - neighboring frames... 14 18 MCPE results for Miss America - neighboring frames......... 15 19 Motion estimation results for Miss America - 4 frames apart..... 16 2 MCPE results for Miss America - 4 frames apart............ 17 21 MCPE results for an example video - neighboring frames....... 18 22 MCPE results for an example video - 5 frames apart......... 19 23 MCPE results for an example video - 1 frames apart......... 2 24 MCPE results for an example video - 3 frames apart......... 21 25 MCPE results for an example video - frames apart......... 22 ii

List of Tables 1 Comparison of EPE results using my network with those form the original paper [1].............................. 8 iii

Diagram Shortcuts Since convolutional neural networks include too many repetitions, I used some shortcuts for the common operations. Figures 1-4 shows the shortcuts. Figure 1: Convolution + ReLU Figure 2: Convolution + ReLU + Pooling Figure 3: Deconvolution. Note that all deconvolution layers use 2 2 interpolation and 3 3 kernel in convolution. Figure 4: Optical Flow Prediction. iv

Figure 5: Concatenation v

1 Introduction In recent years, deep learning has been applied to many image processing problems such as image classification, object recognition, semantic segmentation etc. and achieved state-of-the-art results. However, most of those methods are single image based. When it comes to video processing, deep learning based methods are not so popular, yet. In this project, I have used deep learning to design a motion estimation network between two frames. This problem is being addressed in several deep learning papers. One of the best ones is Flownet [1, 2]. Flownet is a convolutional neural network which takes an image pair as an input and tries to predict optical flow which is given as ground truth. However, there is no ground truth optical flow for real images and the amount of the synthetic training data is not sufficient. So, the authors designed a new dataset which is called flying chairs [1] and consists of 22.872 image pairs with their optical flows. In the dataset, they used real images as backgrounds and synthetic chairs as moving objects. The network was trained on this dataset and then fine-tuned on Sintel [3] and KITTI [4] datasets. 2 Literature Review Model-based approaches dominated the motion estimation problem for years. Some of the most popular motion estimation algorithms are block matching, phase correlation, Lucas Kanade and Horn Schunk (HS) methods [5]. These methods use different mathematical formulations to model the motion. Even though they are very successful for some cases, their performance is limited by model limitations. For example, most of the model-based approaches start by modeling motion in the continuous domain, so they assume small motion between the input frames. For overcoming these model limitations, deep learning based algorithms have been proposed [1, 6, 7]. Even though these methods don t have model limitations, they suffer from data limitations. In this project, I investigated FlowNet [1] and compared it with HS. 3 Problem Definition I defined motion estimation as a supervised learning problem. Given a set of frame pairs and their optical flows, we can design a learning-based method for motion estimation. Figure 6 illustrates this approach. Let {F1 k, F2 k R w h 3, OF k R w h 2 } N k=1 represent our dataset, where F1 k and F2 k represent the frame pair and OF k is the optical flow from F1 k to F2 k. We aim to design an algorithm H(F1 k, F2 k ) = ÔF k to predict the optical flow of an unseen frame pair. For measuring the performance of the algorithm, we can use end point error (EPE) or motion compensated prediction error (MCPE) which are defined below. 1

Figure 6: Motion estimation via supervised learning EP E = N ÔF k OF k 2 F (1) k=1 where X F is the Frobenius norm defined by the square root of the sum of the squared elements of X. MCP E = N warp(f1 k, ÔF k) F2 k 2 F (2) k=1 here warp(f1 k, ÔF k) returns the warped version of F1 k with respect to the motion vectors in ÔF k using linear interpolation. 4 Proposed Solution: FlowNet In this project, I replicated the simpler version of the work described in [1]. The authors made two contributions. Their first contribution is to design of a new dataset called Flying Chairs and their second contribution is the implementation of two different convolutional neural networks (CNN) for motion estimation. 4.1 Dataset There are several existing dataset for evaluation of motion estimation algorithms [3, 4], however they don t have too many input frames. For optimizing the weights of a CNN, one needs a large dataset. One of the options is to create a dataset with optical flow ground truth, but this a very complicated problem. So the authors decided to design a synthetic dataset called Flying Chairs [1]. They used real background images and added synthetic chair images on top of those backgrounds. Then, they applied randomly generated affine morion to the backgrounds and 3D affine motion to the individual chair objects for creating the optical flow field and the frame pairs. Using this approach, they created 22.782 frame pairs with their optical flow ground 2

Figure 7: Flying Chairs Dataset. First two columns show frame pairs and the last column shows the color coded optical flows (Taken from [1]) truths. The size of the frames is 384x512. Figure 7 shows some examples from the dataset. For visualization of the optical flow, I used color coding used in [3]. The direction of the motion is being represented by different colors, and the magnitude is represented by the intensity of that color. Figure 8 shows the color coding. 4.2 Network Architectures FlowNet has two different variants. The first one is called FlowNet-Simple (FNS). It takes the concatenation of two frames as an input and predicts the optical flow matrix. So for 3-channel frames (i.e. RGB), it takes W L 6 dimensional array as an input and produces W L 2 dimensional output for the vertical and horizontal coordinates of the optical flow. Figure 9 shows the overview of FNS network. The second variant is called FlowNet-Correlation (FNC). It takes two frames as two separate inputs of size W L 3 and concatenates their embeddings in the middle layers using a correlation function. In my project, I did not use the original networks since they were too deep and it was taking too long to train an algorithm using them. I decimated the input frames to 192 256 and took 48 48 sized random crops from the decimated frames. So my networks are trained with 48 48 3 dimensional frame pairs and their 48 48 2 optical flow matrices. 3

Figure 8: Color coding of the motion vectors (Taken from [1]) Figure 9: Overview of FNS 4

Figure 1: Details of FNS 4.2.1 FlowNet-Simple (FNS) Figure 1 shows the details of FNS. The architecture is very similar to an autoencoder. The idea is to first extract some low dimensional features from the image by using a fully convolutional network (no fully connected layers) and then predicting the optical flow fields from these features by using a fully deconvolutional network. I will call the convolutional part Feature Extractor Network (FEN) and the deconvolutional part Refinement Network (RN). FEN is very similar to classification networks as it uses convolution, pooling, batch normalization and nonlinear activation layers. The size of the convolution filters starts from 7 7 and decreases to 3 3. RN uses the compressed information from FEN to get a motion prediction. Since the spatial dimensions decreased in FEN, RN uses deconvolutional networks to increase the spatial dimensions. Figure 11 shows a more detailed version of FNS including the refinement network. In the early research of deep learning, it was observed that the initial layers of the networks learn the very fundamental features related to images such as edge representations and those kinds of features are used frequently in many computer vision tasks including motion estimation. So, it is a good idea to use these feature representations in the RN part of the network to get a better motion estimation. Note that this approach is not usable in the classification-based CNNs since the sizes of the last layers are smaller than the first ones. Every layer of RN is being concatenated with a layer of the same size from FEN and also with the optical flow estimation from the previous layer. For example, let s look at the layer after Deconv2 operation. Deconv2 produces an output with 64 channels and this output is being concatenated with the output of Conv2 1 which has 128 channels and also with the output of Pred3 which has 2 channels. So, overall the size of this layer is 194. As a result, RN produces several predictions of the optical flow in different sizes. Predictions are computed at (48 2 n ) (48 2 n ) resolutions where n =, 1, 2, 3. So, in a way RN uses a similar approach to pyramid networks. 5

Figure 11: FNS with detailed refinement network (RN) 4.2.2 FlowNet-Correlation (FNC) In FNC, authors split the FEN part into two siamese networks (sharing the same weights) for two frames and used a correlation layer to combine these two networks. This procedure is shown in Figure 12. The correlation layer compares different patches from feature representation of the input frames. Given two feature maps f 1, f 2 : R 2 R c where c is the number of channels and f i (x) is a c dimensional feature vector of i th frame s feature representation at location x = [x 1, x 2 ] T (In MATLAB notations f i (x 1, x 2 ) = F i (x 1, x 2, :) where F i R w h c is the feature representation of the i th frame), corr[x, d] = o [ k,k] [ k,k] f 1 (x + o), f 2 (x + o + d) (3) where d = [d 1, d 2 ] T is the displacement, K = 2k + 1 is the window size, and.,. denotes the inner product. In this project, I used k = (single pixel correlation). Note that this computation needs to be done for every x and d pair which will result in w 2 h 2 elements in total where w and h represents the spatial dimensions of the feature representation. However, it is possible to reduce the number of elements by limiting the amount of motion. So, I calculated Eq. 3 for only d 1, d 2 7. Then I reshaped the output as a 3D matrix with ((2 7) + 1) 2 = 225 channels (R w h 15 15 R w h 225 ). 6

Figure 12: Details of FNC 4.2.3 Loss Function As explained in Section 4.2.1, the refinement network is used for computing several motion predictions in different spatial dimensions. Of course, only the last one can be used as the overall prediction of motion since it is the only one which has the same size as input frames. However, while defining the loss function we have to use all of the predictions to make sure that they all give a reasonable motion prediction (i.e. account for motion features at different scales). So the overall loss function of the network is formulated as below. L = N ( ÔF k OF k 2 F + ÔF2 k S 2 (OF k ) 2 F + k=1 ) ÔF3 k S 4 (OF k ) 2 F + ÔF4 k S 8 (OF k ) 2 F (4) where S m (.) means (m m) decimation. 5 Experimental Results and Discussion In my experiments, I used 8% of the resized frame pairs as the training set and the remainder as the test set. Both algorithms are trained with ADAM optimizer with a learning rate of.1, momentum of.9, weight decay of.4 and mini-batch size of 32. I applied random data transformations composed of cropping, flipping and rotating the images. 7

Table 1: Comparison of EPE results using my network with those form the original paper [1]. Method Test EPE FNS 3.72 FNC 3.28 FNS-original 2.71 FNC-original 2.19 Table 1 shows the comparison of EPE results for original FNS and FNC (taken from the FlowNet paper) and my shallower implementations. Even though I used a much smaller input size it looks like, I achieved a close performance to the original implementations. In Figures 13-25, I compared the results of my implementation with the Horn Schunk (HS) algorithm. Figures 13-16 show MCPE and EPE results for two different frame pairs from the test set. In terms of the EPE performance, both FNS and FNC outperforms HS which was expected since the HS algorithm does not know anything about the ground truth. However, in terms of the MCPE performance, HS performed better than FNS and slightly worse than FNC. This result shows that there is no direct relationship between MCPE and EPE. Even though HS performed poorly in terms of the EPE, it still performs reasonably well according to the MCPE. One can realize that, even for the ground truth, MCPEs are close. The main reason for this big ground truth MCPE is the significant amount of occlusion regions caused by the large motion between frames. In some sense, the ground truth MCPEs are the lower bounds for FNS and FNC since they are trying to achieve the ground truth. Finally, in the color-coded motion visualizations, it can be observed that FNS and FNC predict more patch-wise constant motion field than HS. So, FNS and FNC predictions may be used in motion based image segmentation. For examining the generalization performance of FNS and FNC, I also tested the algorithms on Miss America frames. Figures 17-2 show the comparisons for neighboring frames and 4 frames apart. For neighboring frames, FNC performed poorly and HS outperformed both of FNS and FNC. It is very surprising. One can observe that in the flying chairs examples, the initial MCPE between two frames is on the order of, however for Miss America with one frame difference, the initial MCPE is only 3. I investigated most of the examples in flying chairs and it seems like most of those examples have a significant amount of motion (much bigger than for neighboring frames at 3 FPS). So, FNS and FNC don t know how to predict small motion and fail to predict it. Another interesting fact is that FNS performed better than FNC for small motion. It can be explained by memorization. Probably, in some of the layers, FNC learned to predict motion vectors with big magnitudes. When I increased the frame spacing from 1 to 4, FNS and FNC outperformed HS. After seeing theses results, I decided to compare the algorithm s MCPE performance with respect to increasing motion. 8

I downloaded a 36 64, 3 FPS video from a talk show. I chose this video because it is somewhat similar to the flying chairs dataset. It has a moving background, moving person in the middle of the screen and also moving audience. Figures 21-25 show the performance comparison for differently spaced frames. It can be observed that the performance of FNS and FNC increased with increasing frame spacing and FNC outperformed HS when the frame spacing is bigger than 5, however HS outperformed FNS in nearly all of these examples. This fact shows the importance of adding a model-based information (correlation layer) to a learning-based method. One interesting region in these frames is the channel logo on the upper-left (also the channel name in the upper right). Even though the background is moving, the channel logo remains in the same position. HS estimates the motion in the logo perfectly (since it is zero motion), however FNC fails to predict it. I guess, FNC tries to predict a patch-wise constant motion field and so it fails for small details. 6 Challenges and Future Work For me, the biggest challenge of this project was the network design. In the beginning it seemed like simple CNN architecture however it has lots of details which makes it harder to design the network using deep learning libraries. Some of the ideas for the future work are listed below. FlowNet architectures try to optimize the weights for EPE. However, one can try to optimize it for prediction error as well. This will result in unsupervised learning problem where we don t have any ground truth. Recurrent neural networks can be used to estimate the motion in a video. FNS and FNC can be combined to get a better motion prediction The networks can be fine-tuned on real optical flow datasets like KITTI [4]. We can add more frame pairs with small motion to training data for increasing the performance of the algorithm for small motion. We can try to use RNNs for finding an iterative solution. For example, in the first step, we can give 3 inputs to the network: 2 frames and an estimation for the optical flow which can be calculated by classical methods such as block matching or Horn and Schunck. In the following steps, the network will take the first frame, the warped version of the second frame and the motion estimation from the previous iteration. Actually, I was planning to try this improvement, but replicating the original paper took more time than I expected. 9

First Frame Ground Truth FlowNetS, EPE = 3.6 4 4 4 Second Frame Horn Schunck, EPE = 5.14 FlowNetC, EPE = 3.48 4 4 4 Figure 13: Motion estimation results for an example test image 1

First Frame Ground truth, Prediction MSE = 96.6 FlownNetS, Prediction MSE = 1143.4 4 4 4 Initial Prediction MSE = 1545.9 Horn Schunck, Prediction MSE = 944.3 FlownNetC, Prediction MSE = 926.7 4 4 4 Figure 14: MCPE results for an example test image 11

First Frame Ground Truth FlowNetS, EPE = 4.14 4 4 4 Second Frame Horn Schunck, EPE = 4.75 FlowNetC, EPE = 3.76 4 4 4 Figure 15: Motion estimation results for an example test image 12

First Frame Ground truth, Prediction MSE = 936.5 FlownNetS, Prediction MSE = 1453.3 4 4 4 Initial Prediction MSE = 17.7 Horn Schunck, Prediction MSE = 1262.3 FlownNetC, Prediction MSE = 1238.9 4 4 4 Figure 16: MCPE results for an example test image 13

First Frame FlowNetS 4 4 Second Frame Horn Schunck FlowNetC 4 4 4 Figure 17: Motion estimation results for Miss America - neighboring frames 14

First Frame Initial Prediction MSE = 3.5 FlownNetS, Prediction MSE = 4.1 4 4 4 Second Frame Horn Schunck, Prediction MSE = 5.9 FlownNetC, Prediction MSE = 38.1 4 4 4 Figure 18: MCPE results for Miss America - neighboring frames 15

First Frame FlowNetS 4 4 Second Frame Horn Schunck FlowNetC 4 4 4 Figure 19: Motion estimation results for Miss America - 4 frames apart 16

First Frame Initial Prediction MSE = 183.4 FlownNetS, Prediction MSE = 39.1 4 4 4 Second Frame Horn Schunck, Prediction MSE = 51.2 FlownNetC, Prediction MSE = 39.2 4 4 4 Figure 2: MCPE results for Miss America - 4 frames apart 17

First Frame Initial Prediction MSE = 25.7 FlownNetS, Prediction MSE = 87.3 4 4 4 Second Frame Horn Schunck, Prediction MSE = 75.1 FlownNetC, Prediction MSE = 487. 4 4 4 Figure 21: MCPE results for an example video - neighboring frames 18

First Frame Initial Prediction MSE = 726.2 FlownNetS, Prediction MSE = 348.9 4 4 4 Second Frame Horn Schunck, Prediction MSE = 314.1 FlownNetC, Prediction MSE = 361.5 4 4 4 Figure 22: MCPE results for an example video - 5 frames apart 19

First Frame Initial Prediction MSE = 1173.2 FlownNetS, Prediction MSE = 622.9 4 4 4 Second Frame Horn Schunck, Prediction MSE = 546.6 FlownNetC, Prediction MSE = 472.4 4 4 4 Figure 23: MCPE results for an example video - 1 frames apart 2

First Frame Initial Prediction MSE = 1.1 FlownNetS, Prediction MSE = 728.9 4 4 4 Second Frame Horn Schunck, Prediction MSE = 792.2 FlownNetC, Prediction MSE = 69. 4 4 4 Figure 24: MCPE results for an example video - 3 frames apart 21

First Frame Initial Prediction MSE = 2288. FlownNetS, Prediction MSE = 1567.1 4 4 4 Second Frame Horn Schunck, Prediction MSE = 1.2 FlownNetC, Prediction MSE = 1258.8 4 4 4 Figure 25: MCPE results for an example video - frames apart 22

References [1] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, Flownet: Learning optical flow with convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758 2766, 215. [2] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, Flownet 2.: Evolution of optical flow estimation with deep networks, arxiv preprint arxiv:1612.1925, 216. [3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, A naturalistic open source movie for optical flow evaluation, in European Conf. on Computer Vision (ECCV) (A. Fitzgibbon et al. (Eds.), ed.), Part IV, LNCS 7577, pp. 611 625, Springer-Verlag, Oct. 212. [4] M. Menze and A. Geiger, Object scene flow for autonomous vehicles, in Conference on Computer Vision and Pattern Recognition (CVPR), 215. [5] A. M. Tekalp, Digital video processing. Prentice Hall Press, 215. [6] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, Deepflow: Large displacement optical flow with deep matching, in Proceedings of the IEEE International Conference on Computer Vision, pp. 1385 1392, 213. [7] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Epicflow: Edgepreserving interpolation of correspondences for optical flow, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1164 1172, 215. 23