Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Similar documents
Two-Stream Convolutional Networks for Action Recognition in Videos

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

CS231N Section. Video Understanding 6/1/2018

Large-scale Video Classification with Convolutional Neural Networks

Person Action Recognition/Detection

Machine Learning 13. week

Long-term Temporal Convolutions for Action Recognition INRIA

Know your data - many types of networks

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Recurrent Neural Networks and Transfer Learning for Action Recognition

CNN Basics. Chongruo Wu

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Lecture 37: ConvNets (Cont d) and Training

Deep neural networks II

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Introduction to Neural Networks

Classification of objects from Video Data (Group 30)

Deep Learning for Computer Vision II

Fuzzy Set Theory in Computer Vision: Example 3

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

arxiv: v1 [cs.cv] 19 Jun 2018

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

INTRODUCTION TO DEEP LEARNING

Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition

Activity Recognition in Temporally Untrimmed Videos

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Using Machine Learning for Classification of Cancer Cells

Computer Vision Lecture 16

Structured Prediction using Convolutional Neural Networks

THE goal of action detection is to detect every occurrence

A Deep Learning Approach to Vehicle Speed Estimation

Multi-View 3D Object Detection Network for Autonomous Driving

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

Face Recognition A Deep Learning Approach

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Deep Learning and Its Applications

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Deep Learning with Tensorflow AlexNet

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

arxiv: v2 [cs.cv] 2 Apr 2018

Semantic Segmentation

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Martian lava field, NASA, Wikipedia

arxiv: v1 [cs.cv] 22 Nov 2017

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Storyline Reconstruction for Unordered Images

An Exploration of Computer Vision Techniques for Bird Species Classification

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Convolution Neural Networks for Chinese Handwriting Recognition

Weighted Convolutional Neural Network. Ensemble.

Convolutional-Recursive Deep Learning for 3D Object Classification

Object Recognition II

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

arxiv: v2 [cs.cv] 31 May 2018

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

arxiv: v1 [cs.cv] 14 Jul 2017

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Multi-Glance Attention Models For Image Classification

Object Detection Based on Deep Learning

Convolu'onal Neural Networks

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

VGR-Net: A View Invariant Gait Recognition Network

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Spatial Localization and Detection. Lecture 8-1

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Class 9 Action Recognition

Convolutional Neural Networks

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Advanced Video Analysis & Imaging

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Image and Video Understanding

Keras: Handwritten Digit Recognition using MNIST Dataset

Fully Convolutional Networks for Semantic Segmentation

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Perceptron: This is convolution!

Computer Vision Lecture 16

Computer Vision Lecture 16

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

CSE 559A: Computer Vision

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Inception Network Overview. David White CS793

RGBD Occlusion Detection via Deep Convolutional Neural Networks

CAP 6412 Advanced Computer Vision

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

Hello Edge: Keyword Spotting on Microcontrollers

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Face Recognition. Nathan Sun

Transcription:

Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon

Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in 3 papers: 1. 3D convolutions 2. Spatial + Optical flow fusion 3. Temporal pooling 4. LSTM Recap of a different elegant method Conclusions

Motivation 500 Hours of video uploaded to YouTube every minute Analyzing these videos is needed for search, recommendation, ranking etc. Action recognition, abnormal event detection, activity understanding

Challenges In Video Classification current ConvNets are not able to take full advantage of temporal information Several orders of magnitude more data compared with photos Variations in motion and viewpoint Datasets are noisy or small - videos are difficult to collect, annotate and store Complex Context compared to photos

Motivation of Temporal Information For example, what is happening in this video? A CNN would probably classify this as crying or shouting Temporal information is needed

Challenges In Video Classification current ConvNets are not able to take full advantage of temporal information Several orders of magnitude more data compared with photos Variations in motion and viewpoint Datasets are noisy or small - videos are difficult to collect, annotate and store Complex Context compared to photos

First paper 3D Convolution (C3D) October 2015

Motivation To combine temporal information with spatial information. 2D covnets are not enough. Proposal: spatiotemporal feature learning using deep 3D ConvNets.

2D Conv vs 3D Conv 2D conv on an image Input: image Output: image 2D conv on a video Input: volume (multiple frames as multiple channels) Output: image 3D conv on a video Input: volume Output: volume preserves temporal information of the input

3D Conv Kernels fixed the spatial kernels to 3x3 vary only the temporal depth of the 3D convolution kernels.

Network Settings Input: 16 frames non overlapping clips (were split from each video - resized to 112 112) Output: class labels (belong to 101 different actions) All convolution kernels are d 3 3 (d temporal depth) Max pooling layer 2-5 : kernel are 2 2 2 Max pooling layer 1: kernel is 1 2 2 Input: 3x16x 112x 112

Training Technical Details Data set: UCF101 We train the networks from scratch using mini-batches of 30 clips. Learning Rate: initial learning rate of 0.003. The learning rate is divided by 10 after every 4 epochs. Stopping criteria: after 16 epochs.

Datasets Action Recognition UCF101-13,000 videos of 101 human action categories Short clips, steady camera, less natural

Training Technical Details Data set: UCF101 We train the networks from scratch using mini-batches of 30 clips. Learning Rate: initial learning rate of 0.003. The learning rate is divided by 10 after every 4 epochs. Stopping criteria: after 16 epochs.

Varying Network Architectures a) homogeneous temporal depth: d= 1,3,5,7 b) varying temporal depth: increasing: d = 3-3-5-5-7 decreasing: d = 7-5-5-3-3

Varying Network Architectures homogeneous temporal depth: varying temporal depth: homogeneous temporal depth of 3 was chosen

Learning Spatiotemporal Features Dataset : Sport1M (long videos) Training: randomly extract five 2-second long clips from every training video C3D trained from scratch/ C3D pre- trained Testing: For video predictions, we average clip predictions of 10 clips

Datasets - Sports Video Classification Sport1M 1 million you tube videos, 487 sport categories Few minutes videos, in the wild, camera less steady, noisier labels

Learning Spatiotemporal Features Dataset : Sport1M (long videos) Training: randomly extract five 2-second long clips from every training video C3D trained from scratch/ C3D pre- trained Testing: For video predictions, we average clip predictions of 10 clips

Sport1M results The method is not state of the art. We note that the method of [29] uses long clips, thus its clip-level accuracy is not directly comparable to that of C3D and DeepVideo

C3D Video Descriptor A model is trained on Sport1M and kept constant 4096 dim video descriptor - averaging FC6 of this model, on 16 frames clips with stride 8, followed by L2 normalization. Multiclass Linear SVM is used on descriptor The descriptor is compared to other descriptors on several datasets

Visualization of C3D Descriptor We observe that C3D starts by focusing on appearance in the first few frames and tracks the salient motion in the subsequent frames

Results of Action Recognition on UCF101 Using C3D descriptor, results were state of the art only when combined with hand crafted video descriptor idt. Best among methods with RGB input only CNN RGB based only all possible feature combinations

C3D descriptor characteristics Compact- Results of UCF101 when reducing dimensions using PCA, better than other descriptors More generic - visualized by t-sne, compared with features extracted with 2D convolutions Fast 313 fps on GPU, 100 times faster than idt

Deconvolution Examples A feature map from Conv2 is learning moving edges and blobs

Deconvolution Examples A feature map from Conv3 is learning moving body parts

Deconvolution Examples A feature map from Conv5 is learning more complex movements like biking

Disadvantages of C3D C3D is limited temporal support of 16 consecutive frames. No optical flow is added to the CNN - Other works showed that adding it is improving the results. Other methods had better performance on each data set.

Second Paper April 2015, presented at CVPR2015

Motivation combining information over longer videos than previous methods Proposals Convolutional temporal feature pooling architecture LSTM cells connected to CNN convolutional output

Optical Flow d t (u, v): the displacement vector at the point (u, v) in frame t, which moves the point to the corresponding point in the following frame t + 1. (c) A close up of dense optical flow in the outlined area. The horizontal and vertical components of the vector field, d t x and d t y, can be seen as image channels (d), (e)

Optical Flow Stacking To represent the motion across a sequence of frames, we stack the flow channels d t x, d t y of L consecutive frames to form a total of 2L input channels

Motivation of Combination RGB - brush and hair, brush and teeth Optical Flow - hand moves periodically at some spatial location The combination discriminates the action

First Method Convolutional Temporal Feature Pooling Proposed approaches Temporal feature pooling performed on last convolutional layer of GoogLeNet.

First Method Implementation Details Datasets: Sport1m Cropping clips of 120 frames at 1 fps Feed forward of each frame through GoogLeNet Several methods for feature pooling between frames in the clip For prediction averaging clips with different starting points

Results: Convolutional Temporal Feature Pooling Data set: Sport1M Late pooling is the worst, doesn t preserve spatial information

Second Method - LSTM architecture Pooling is order invariant, LSTM usage is more natural for sequences

Video classification LSTM architecture 5 layers of LSTM, 512 cells each. Input is last convolutional layer of GoogLeNet from each frame Prediction is made after each frame In training, weight of loss is linearly growing from 0 to 1 through the video

Optical flow fusion implementation Processing videos in 1 Fps loses local motion information. Optical flow stream is added Each stream is fed forward in the same 2 methods as mentioned before Fusion is made only at the softmax layer

Results Sport1M results of different methods Optical flow fusion does not improve due to shaky videos State of the art performance on Sport1M

Results UCF101 results based on raw frames only UCF101 state of the art results (not anymore) when fusing optical flow

Disadvantages Of The 2 Methods Feature pooling is less generic for arbitrary length of a video LSTM is tested only on 30 frames, less global context Optical flow calculation is slow, and less elegant end-to-end training Fusion with optical flow is only at the softmax, not ideal information sharing

Third Paper Two Stream Fusion April 2016, presented at CVPR2016

Motivation Spatial + Optical flow fusion: registering appearance recognition (spatial cue) with optical flow recognition (temporal cue), at the pixel level Temporal fusion: how these cues evolve over time

Two Networks to Fuse Spatial fusion: Per frame Temporal fusion: Between frames

Challenges in the Fusion Process How to spatially fuse? Which channel in one network corresponds to a channel of the other network? Where to fuse the networks spatially? How to perform temporal fusion between frames?

Declarations Fusion Function: Feature Maps: Output Map: For simplicity, assume that W Width H height D number of channels of the respective feature maps

Possible Spatial Fusion Methods Sum Fusion: Max Fusion: Concatenation Fusion: Conv Fusion: Bilinear Fusion:

Where to Spatially Fuse? Two Examples (based on VGG): fusion after the 4th conv-layer. Only a single network tower is used from the point of fusion two layers (after conv5 and fc8) both network towers are kept Influence on the number of parameters

Spatial Fusion Methods - Comparison DataSet: UCF101 Model: 8 layers VGG-M

Spatial Fusion Methods - Comparison DataSet: UCF101 Model: 8 layers VGG-M How to spatially fuse? Conv Fusion Answers to our questions:

Where to Spatially Fuse? Dataset: UCF101 Model: 8 layers VGG-M

Where to Spatially Fuse? Dataset: UCF101 Model: 8 layers VGG-M Answers to our questions: Where to spatially fuse? At ReLU5 or at ReLU5+FC8 (but nearly doubles the parameters involved)

Temporal Fusion At Pool5 Spatiotemporal: Ignores time, spatial pooling only. Averaging the network predictions over time Stacking feature maps across frames, pools from local spatiotemporal neighborhood. No pooling across channels Additionally performs a convolution with a fusion kernel that spans the feature channels from both streams, space and time before 3D pooling. This is replacing the single frame Conv fusion. Kernel size:

Combining it all together Model: Vgg-16 Fusion Relu5 + after softmax (2 towers are kept). Spatiotemporal fusion+ 3D pooling in fused tower, 3D pooling in temporal tower L = 10 :Number of optical flow images around each frame T = 5 : Number of frames per video clip (for testing and training) τ Frame distance between sampled frames. Selected randomly [3,10] Prediction is averaged over both towers

Results

Disadvantages of this method Tested only on 5 frames per video. Not enough for context of longer videos Not tested on bigger and more general dataset of Sport1M Relies heavily on optical flow, may not work on many real life not stabilized videos As before, using optical flow is slow and not allowing end-to-end training

Summery & Conclusions 4 different approaches of video classification were shown Each method performed better on different tasks or datasets Optical flow improves the performance. Adding hand crafted features improved the results, hinting that there is still a room for improvement on CNN approaches

Disadvantages Of LSTM Methods [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 36]

Another Elegant Method

Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 31]

Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 33]

Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 37]

Questions?