Video Frame Interpolation Using Recurrent Convolutional Layers

Size: px

Start display at page:

Download "Video Frame Interpolation Using Recurrent Convolutional Layers"

Jodie Brooks
5 years ago
Views:

2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM) Video Frame Interpolation Using Recurrent Convolutional Layers Zhifeng Zhang 1, Li Song 1,2, Rong Xie 2, Li Chen 1 1 Institute

1 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM) Video Frame Interpolation Using Recurrent Convolutional Layers Zhifeng Zhang 1, Li Song 1,2, Rong Xie 2, Li Chen 1 1 Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University 2 Cooperative Medianet Innovation Center, Shanghai, China {maplezzf, song li, xierong, hilichen}@sjtu.edu.cn Abstract Frame interpolation attempts to generate intermediate frames given existing ones, which is challenging because of complex video scenes and motion. Standard methods first estimate motion between two continuous frames and then synthesize new ones. In this paper, we propose a novel frame interpolation method based on a video synthesis approach deep voxel flow (DVF). In DVF, a deep convolutional encoder-decoder predicts 3D voxel flow, and then a volume sampling layer synthesizes the intermediate frame guided by the flow. To improve the accuracy of voxel flow, we employ recurrent convolutional layers (RCL) in the encoder-decoder module to refine the flow step by step, called DVF-RCL. We also incorporate perceptual loss to increase the visual quality. Experiments demonstrate that our method greatly improves the performance of original DVF and produce results that compare favorably to state-of-the-art methods both quantitatively and qualitatively. Index Terms frame interpolation, recurrent convolutional layers, deep voxel flow, video processing I. INTRODUCTION Video frame interpolation is a classic problem in computer vision and video processing field. It attempts to synthesize one or more intermediate frames from existing ones, which is widely used in many applications. For example, increasing video frame rate is required in video transcoding system to improve visual quality [1]. In addition, dropping video frames is often used in video transmission due to limited bandwidth at the emitting end and it is necessary to recover them via interpolation at the receiving end [2]. Traditional frame interpolation methods usually takes two steps, i.e., motion estimation between adjacent frames, and pixel synthesis guided by the motion. However, the performance of these methods relies heavily on the accuracy of optical flow, which is hard to be estimated in regions with occlusion, motion blur, large displacement, and abrupt changes in lighting [3]. In recent years, deep convolutional neural network (CNN) has proved its remarkable performance in many computer vision problems. For example, CNN-based methods have set up new state-of-the-art results in many high-level vision tasks, such as image classification [4] and object detection [5]. Besides, impressive results are also produced in image superresolution [6] and some other low-level vision problems [7] by utilizing deep learning approaches. More recently, optical flow estimation have been solved as a supervised learning problem with their ground truth [8], [9]. CNN-based approaches are also promising for frame synthesis using end-to-end models, (a) Ground truth (b) DVF [10] (c) Ours-L 1 (d) Ours-L F Fig. 1. Visual example of frame interpolation. Compared to original DVF [10] (b), our proposed method produce visually more pleasing results, especially the one with perceptual loss L F (d). while improvement is still needed to better handle large displacement and generate visually more pleasing results. In this paper, we present a novel method for video frame interpolation problem. Our approach is an end-to-end network based on a frame synthesis method Deep Voxel Flow (DVF) [10]. Two parts are included in the model: a convolutional encoder-decoder predicts 3D voxel flow, and then a volume sampling layer synthesizes the intermediate frame guided by the flow. Especially, in the decoder part, recurrent convolutional layers (RCL) [11] are used to progressively leverage feature maps from lower-level convolutional layers and refine the voxel flow step by step. It recurrently generates a refined voxel flow with 2 resolution until the resolution equals the original input. This way we can obtain more accurate flow estimation compared to original DVF. Besides, perceptual loss functions are incorporated to further improve the visual quality of synthesized frames, as illustrated in Fig. 1. To our best knowledge, we firstly utilize RCL for video frame interpolation and achieve state-of-the-art results on UCF-101 test set [12]. The rest of this paper is organized as follows: Section II introduces the related works of video frame interpolation. We describe the details of our proposed method in Section III. The /18/$ IEEE

2 I 0 I : convolution layer : average pooling layer : bilinear upsampling layer : volume sampling layer : recurrent connection : concatenation connection : recurrent convolutional layer Fig. 2. Architecture of our proposed method DVF-RCL. Given input frames I 0 and I 1, voxel flow is predicted by a convolutional encoder-decoder network and then used to synthesize the interpolated frame via a volume sampling layer. Recurrent convolutional layers are used in the decoder part to refine the voxel flow step-by-step. experimental results are given in Section IV. Finally, Section V draws a conclusion. II. RELATED WORK Video frame interpolation is one of the basic video processing technologies, which attempts to synthesize intermediate frames given existing ones. Common frame interpolation approaches first estimate dense motion, especially optical flow, between continuous input frames and then generate one or more middle frames based on the estimated motion [13]. Multiple methods can be used for motion estimation, including traditional motion-compensated methods [14] and recent neural-network-based methods [8]. However, flow amplitudes vary a lot from slow motion to large displacement, which is challenging to be predicted accurately. Given estimated optical flow between two continuous frames, an intermediate frame can be synthesized by projecting pixel values bidirectionally from these two frames. The details of a standard interpolation algorithm can be found in [13]. Therefore, the quality of the synthesized frames relies on the accuracy of optical flow and interpolation algorithms. Different from these flow-based methods, Mahajan et al. developed a path framework to copy pixel gradients from input images to the interpolated frame via a Poisson reconstruction [15]. Meyer et al. also presented a novel phase-based technique for video frame interpolation [16]. Although this method often generates impressive results, further improvement is still required to handle larger motions and maintain high-frequency details. Recently, convolutional neural networks have achieved state-of-the-art results in many computer vision tasks. They can also be applied to optical flow estimation [8], [9]. However, these approaches require supervision, i.e., optical flow ground-truth, which is difficult to obtain. Long et al. applied a convolutional neural network for frame interpolation and then inverted the learned CNN; however, their method generates the interpolated frame as an intermediate step and their end-goal is optical flow estimation [17]. Zhou et al. trained a CNN to predict appearance flow and then reconstructed novel views using this estimation [18]. Their method can produce a frame between input ones by warping individual input views using the appearance flows. There are a number of papers that directly generate images or videos using CNNs. Mathieu et al. presented a multi-scale network for video prediction using a gradient difference loss function and adversarial training method; but containing artifacts and blurriness is still a problem for this method [19]. Liu et al. combined the strengths of optical-flow-based and neuralnetwork-based methods [10]. They predicted 3D voxel flow using a deep fully differentiable network and then synthesized new frames by flowing pixel values from existing ones, but the estimated frames are still not visually satisfying. Niklaus et al. considered pixel synthesis for frame interpolation as local convolution between inputs and employed a deep CNN to estimate spatially-adaptive convolution 2D kernels for each pixel to capture both local motion and coefficients [3]. Then a new frame can be generated by convolving the kernels with the input frames; but the memory demand increases quadratically with the kernel size. Their extended work approximate a 2D kernel with a pair of 1D kernels to solve this problem because much less parameters are required [20]. When handling large motion, however, memory requirement is still a problem for this adaptive separable convolution method. III. PROPOSED METHOD The method proposed is based on an end-to-end fully differentiable CNN framework that generates intermediate frames directly from input frames, which is referred as Deep Voxel Flow (DVF) [10]. Starting from this baseline model, we describe our proposed method that utilizes recurrent convolutional layers (RCL) [11] to predict the voxel flow step-by-step and improve the quality of synthesized frames, which is called DVF-RCL here.

3 A. Network Architecture 1) Deep Voxel Flow: We first briefly describe our baseline model DVF and define notations. DVF predicts the 3D voxel flow using a convolutional encoder-decoder and then synthesizes the desired frame by a volume sampling layer. Let us assume the 3D voxel flow field as F = { x, y, t}. Specifically, we can seperate F into F motion = { x, y} and F mask = { t}. The spatial component F motion of voxel flow F represents the optical flow from the intermediate frame to the next frame and the temporal component F mask serves as sampling weights for trilinear interpolation. The interpolated frame is synthesized using the voxel flow F : Î = t I 0 ( x, y) + (1 t) I 1 ( x, y) (1) where I 0 ( x, y) and I 1 ( x, y) represent the resampled frames guided by the voxel flow F respectively and denotes the Hadamard product. This method is presented in [10] for video frame synthesis, which combines the advantages of flow-based and CNN-based methods and merges two steps of flow-based methods into a single process. The volume sampling layer only performs spatial transformation and contains no learnable parameters. Hence the performance of DVF is closely related to the encoder-decoder part, which decides the accuracy of voxel flow directly. 2) Recurrent Convolutional Layer: Recurrent convolutional layer (RCL) is proposed in [11] for object recognition and also used for video saliency estimation [21] and object segmentation [22] tasks. The basic idea is to add recurrent connections along time axis within every convolutional layer of the feed-forward CNN. This structure enables the units to be modulated by other units in the same layer, which leads to larger effective receptive field of one layer when time increases. The recurrent connections increases the depth of the original CNN without adding network parameters by weight sharing between layers [11]. Therefore, stacks of RCL can help leverage local information to refine details of voxel flow step by step. Fig. 2 illustrates the framework of our proposed DVF-RCL. Same as original DVF, an encoder-decoder network predicts the 3D voxel flow and a volume sampling layer synthesizes the intermediate frame. In encoder part, we choose the encoder module from SepConv [20] for its strong capability to extract motion features from input frames. Specifically, we use stacks of three 3 3 convolution layers with ReLU activation and downsample the feature maps 5 times. In decoder part, recurrent convolutional layers are employed to fuse voxel flow and feature maps. Initial voxel flow with coarse resolution is 2 upsampled and then concatenated with corresponding lowerlevel feature maps from the encoder part as the input of RCL. Refined voxel flow can be obtained from the last convolutional layer. In this way, we can get the full image resolution voxel flow progressively from coarser scale to better one. In addition, we choose 3 3 convolution kernel size in both feed-forward and recurrent connections and feature maps are generated Initial voxel flow RCL Lower-level feature maps n+3 t=0 Unfolding the RCL t=1 t=2 Feed-Forward Connection t=3 Recurrent Connection Refined voxel flow Fig. 3. The detailed framework of RCL. The RCL is unfolded along with 3 time steps in the blue box. Note that n represents the number of lower-level feature maps from the encoder part. for each time step in RCL. We replace the local response normalization (LRN) with batch normalization [23] for better interpolated results and other parameters are all the same as in [11]. Fig. 3 presents the unfolded RCL with 3 time steps. The upsampled initial voxel flow is concatenated with lower-level feature maps from corresponding encoder part to compose the initial input of the RCL. In each time step, RCL will receive inputs from both feed-forward connection and recurrent connection. Then the outputs can be passed to next time step. In [11], they found that 3 time steps work best for their object detection model but it is not verified for other problems. Therefore, we attempt RCL of different time steps in the experiment to find the best hyper-parameter setting for our model. B. Network Training Following [20], two types of loss functions have been considered for the training of our model. We first use l 1 norm reconstruction loss based on the pixel-wise color difference between the synthesized frame Î and the ground truth I gt, which can be defined as below, L 1 = Î I gt 1 (2) Additionally, l 2 norm loss function can be used but we found it led to blurry results compared to l 1 norm, which is in line with the findings from [20]. However, the ability of l 1 norm loss function to capture perceptual difference, such as high-frequency details, is very limited as they are defined based on pixel-wise color difference. Another type of loss function is perceptual loss, which has been found effective in artistic style-transfer [24] and image super-resolution [25] tasks. Perceptual loss can generate visually more appealing results than pixel-wise distance metrics. Therefore, we employ high level features from ReLU 5 4 layer of the pre-trained VGG-19 network [14] as a perceptual loss. The distance metric is therefore defined as follows, L F = Î I gt 1 + λ vgg φ(î) φ(i gt) 2 2 (3) where φ(î) and φ(i gt) are high level features of images from VGG network and we empirically set λ vgg as in this work.

PSNR 36.30 36.26 36.22 36.18 36.14 0 1 2 3 4 5 Time steps Fig. 4. PSNR results of different time steps in RCL. Note that at t = 0 only feed-forward computation takes place.

4 PSNR Time steps Fig. 4. PSNR results of different time steps in RCL. Note that at t = 0 only feed-forward computation takes place. TABLE I COMPARISON OF STATE-OF-THE-ART METHODS Ground truth SepConv-L 1 [20] SepConv-L F [20] Method PSNR SSIM FlowNet2 [9] MDP-Flow2 [27] Meyer et al. [16] DVF [10] SepConv-L 1 [20] SepConv-L F [20] Ours-L Ours-L F DVF [10] Ours-L 1 Ours-L F A. Datasets IV. EXPERIMENTS The dataset used is from the UCF-101 training set [12], including videos that belongs to 101 action categories. It is a standard action recognition dataset with various video scenes and motion. Training samples were extracted by randomly cropping patches from three continuous frames. We selected samples with obvious motion from the extracted 3 triplets, discarding samples that are temporally close to each other. To compute the motion, DeepFlow2 [26] was used to predict the optical flow between frames. Overall, we randomly selected 100,000 samples from extracted triplets to compose our training dataset. Following [10], UCF-101 test set is chosen as a benchmark for evaluation. We use both PSNR and SSIM metrics to evaluate the quality on the luminance channel of images. B. Parameter Settings All convolutional layers are initialized using Xavier [28] method in our proposed model. We set the learning rate as and trained the network using Adam [29] optimizer with β 1 = 0.9 and β 2 = Same as [20], a small mini-batch size of 16 samples was selected because we found using more samples per mini-batch led to quality degradation. With an NVIDIA GeForce GTX 1080 GPU, our proposed framework was implemented on PyTorch [30] platform for both the training and testing process. C. Analysis of Recurrent Convolutional Layer To find the best hyper-parameter setting, we trained our model by utilizing RCL with different time steps. Fig. 4 illustrates the PSNR results of time steps from 0 to 5. Note that at t = 0 only feed-forward computation takes place and the model Fig. 5. Visual example of frame interpolation results. We compare our method using different loss functions to original DVF [10] and SepConv [20]. degrades into a plain version without recurrent connections. We find that DVF-RCL with 3 time steps performs better than less time steps. The effective receptive field of a RCL unit in the feature maps expands when the time step increases [22]. Hence employing RCL in the model can progressively refine the voxel flow and increase the accuracy in the end. However, using 4 or more time steps achieves no more gain and even decays a little. Therefore, we set time steps of a RCL to be 3 in our final model. As shown in Fig. 5, we compare the visual examples between original DVF and our DVF-RCL using recurrent convolutional layers. Results demonstrate that our DVF-RCL greatly improves the visual quality of original model. This example shows a scene with large motion that is hard to be estimated. The interpolated frame from DVF tends to be blurry due to inaccurate voxel flow estimation. On the contrast, our proposed method is able to predict more accurate voxel flow and produce visually pleasing results. Compared to SepConv with two different loss functions, our method also performs better in regions with large displacement. D. Perceptual Loss Two types of loss functions are considered in our model, i.e., pixel-wise loss L 1 and perceptual loss L F. It is wellknown that only pixel-wise distance metrics often produce over-smoothing results and combining the perceptual loss can help improve visual quality. Therefore, two versions are trained for our DVF-RCL model. We first trained the model

Ground truth Ours-L F FlowNet2 [9] MDP-Flow2 [27] Meyer et al. [16] DVF [10] SepConv-L F [20] Fig. 6. Visual comparisons among different methods. with L 1 loss and then L F loss was used.

However, we found that incorporating L F loss leads to shaper frames, as shown in Fig. 1 and Fig. 5. This is in line with the findings from recent work [20].

However, our DVF-RCL L F can generate visually more pleasing results than SepConv L F. In Fig. 5, for example, it depicts a scene of horse racing with large motion.

5 Ground truth Ours-L F FlowNet2 [9] MDP-Flow2 [27] Meyer et al. [16] DVF [10] SepConv-L F [20] Fig. 6. Visual comparisons among different methods. with L 1 loss and then L F loss was used. As listed in Table I, we compare the quantitative results of our model with different loss functions. The PSNR result of L F loss is relatively lower than single pixel-wise distance loss function L 1. However, we found that incorporating L F loss leads to shaper frames, as shown in Fig. 1 and Fig. 5. This is in line with the findings from recent work [20]. We also compare our results to two versions of SepConv L 1 and L F in Fig. 5. Same as our findings, SepConv L F with perceptual loss produces sharper results than SepConv L 1. However, our DVF-RCL L F can generate visually more pleasing results than SepConv L F. In Fig. 5, for example, it depicts a scene of horse racing with large motion. Our method can produce perceptually more satisfying frame, especially in the regions of horse s legs. It demonstrates that our model can handle large displacement properly. E. Comparison with State-of-the-art In this section, several state-of-the-art frame interpolation methods are compared with our proposed model. We first consider two flow-based techniques FlowNet2 [9] and MDP- Flow2 [27]. The interpolation algorithm described in [13] is applied to generate the intermediate frames given optical flow. We also compare our model with the phase-based interpolation method [16]. For neural-network-based approaches, our baseline model DVF [10] and SepConv [20] are both considered. Besides, we include results from SepConv with both L 1 loss and L F loss for a complete comparison. As shown in Table I, our proposed DVF-RCL L 1 is the best performing method in terms of PSNR numerical metric. Fig. 6 illustrates the visual examples of these different methods. Notice that original DVF often generates blurry results with artifacts. On the contrast, our DVF-RCL L F is able to produce visually pleasing frames competing to SepConv L F that is also trained with perceptual losses. With respect to the computational complexity, our model performs better than SepConv with comparable results. For example, Sepconv takes 0.98 seconds to interpolate a frame and 0.51 seconds to interpolate a frame with an NVIDIA GeForce GTX 1080 GPU. On the contrast, our model is able to synthesize a frame in 0.52 seconds as well as a frame in 0.23 seconds, which is about 2 times faster than SepConv. Memory demand is another relevant difference between SepConv and our DVF-RCL. For example, our method only need to estimate 3D voxel flow with 23MB for frame interpolation. However, 1.27GB memory [20] is required for SepConv to generate kernels for a 1080p video frame. Handling larger motion will consume even more memory with the kernel size increases, which may become a limitation in practical applications.

6 F. Discussion Our method currently generates an intermediate frame at t = 0.5 between two continuous input frames. We can also train a multi-step model like DVF to interpolate multiple frames simultaneously. However, it is not flexible to produce a frame at an arbitrary time. One possible solution is to combine the desired temporal step with the inputs, which is also mentioned in [20]. Furthermore, we can extract training samples with different time intervals to improve the variety of frame rates, which will enable our model to handle videos with a larger range of motion. V. CONCLUSION In this paper, we have presented a novel frame interpolation approach, called DVF-RCL. Based on a video synthesis method DVF, recurrent convolutional layers (RCL) are employed in the encoder-decoder module to estimate voxel flow step by step to increase the accuracy. We also use the perceptual loss to further improve the visual quality of synthesized frames. Our experiments show that this approach greatly improves the performance of original DVF from both quantitative and qualitative perspectives and produces high-quality results comparable to state-of-the-art interpolation methods. ACKNOWLEDGMENT This work was supported by NSFC ( , U and ), the 111 Project (B07022 and Sheitc No ), Natural Science Foundation of Shanghai (18ZR ) and the Shanghai Key Laboratory of Digital Media Processing and Transmissions. REFERENCES [1] U. S. Kim and M. H. Sunwoo, New frame rate up-conversion algorithms with low computational complexity, IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 3, pp , [2] S. Sekiguchi, Y. Idehara, K. Sugimoto, and K. Asai, A low-cost video frame-rate up conversion using compressed-domain information, in IEEE International Conference on Image Processing (ICIP), vol. 2, 2005, pp. II 974. [3] S. Niklaus, L. Mai, and F. Liu, Video frame interpolation via adaptive convolution, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, no. 4, 2017, p. 6. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (NIPS), 2012, pp [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp [6] C. Dong, C. C. Loy, K. He, and X. Tang, Learning a deep convolutional network for image super-resolution, in European Conference on Computer Vision (ECCV), 2014, pp [7] J. Xie, L. Xu, and E. Chen, Image denoising and inpainting with deep neural networks, in Advances in Neural Information Processing Systems (NIPS), 2012, pp [8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, Flownet: Learning optical flow with convolutional networks, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp [9] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, Flownet 2.0: Evolution of optical flow estimation with deep networks, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, [10] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala, Video frame synthesis using deep voxel flow, in IEEE International Conference on Computer Vision (ICCV), vol. 2, [11] M. Liang and X. Hu, Recurrent convolutional neural network for object recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp [12] K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, in CRCV-TR-12-01, [13] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, A database and evaluation methodology for optical flow, International Journal of Computer Vision (IJCV), vol. 92, no. 1, pp. 1 31, [14] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High accuracy optical flow estimation based on a theory for warping, in European Conference on Computer Vision (ECCV), 2004, pp [15] D. Mahajan, F.-C. Huang, W. Matusik, R. Ramamoorthi, and P. Belhumeur, Moving gradients: a path-based method for plausible image interpolation, in ACM Transactions on Graphics (TOG), vol. 28, no. 3, 2009, p. 42. [16] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung, Phase-based frame interpolation for video, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp [17] G. Long, L. Kneip, J. M. Alvarez, H. Li, X. Zhang, and Q. Yu, Learning image matching by simply watching video, in European Conference on Computer Vision (ECCV), 2016, pp [18] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, View synthesis by appearance flow, in European conference on computer vision (ECCV), 2016, pp [19] M. Mathieu, C. Couprie, and Y. LeCun, Deep multi-scale video prediction beyond mean square error, arxiv preprint arxiv: , [20] S. Niklaus, L. Mai, and F. Liu, Video frame interpolation via adaptive separable convolution, in IEEE International Conference on Computer Vision (ICCV), [21] X. Wei, L. Song, R. Xie, and W. Zhang, Two-stream recurrent convolutional neural networks for video saliency estimation, in IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 2017, pp [22] J. Xu, L. Song, and R. Xie, Two-stream deep encoder-decoder architecture for fully automatic video object segmentation, in IEEE Visual Communications and Image Processing (VCIP), 2017, pp [23] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arxiv preprint arxiv: , [24] J. Johnson, A. Alahi, and L. Fei-Fei, Perceptual losses for realtime style transfer and super-resolution, in European Conference on Computer Vision (ECCV), 2016, pp [25] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., Photo-realistic single image super-resolution using a generative adversarial network, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [26] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, Deepflow: Large displacement optical flow with deep matching, in IEEE International Conference on Computer Vision (ICCV), 2013, pp [27] L. Xu, J. Jia, and Y. Matsushita, Motion detail preserving optical flow estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 9, pp , [28] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp [29] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arxiv preprint arxiv: , [30] A. Paszke, S. Gross, S. Chintala, and G. Chanan, Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, 2017.

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS Mustafa Ozan Tezcan Boston University Department of Electrical and Computer Engineering 8 Saint Mary s Street Boston, MA 2215 www.bu.edu/ece Dec. 19,