Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon
Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in 3 papers: 1. 3D convolutions 2. Spatial + Optical flow fusion 3. Temporal pooling 4. LSTM Recap of a different elegant method Conclusions
Motivation 500 Hours of video uploaded to YouTube every minute Analyzing these videos is needed for search, recommendation, ranking etc. Action recognition, abnormal event detection, activity understanding
Challenges In Video Classification current ConvNets are not able to take full advantage of temporal information Several orders of magnitude more data compared with photos Variations in motion and viewpoint Datasets are noisy or small - videos are difficult to collect, annotate and store Complex Context compared to photos
Motivation of Temporal Information For example, what is happening in this video? A CNN would probably classify this as crying or shouting Temporal information is needed
Challenges In Video Classification current ConvNets are not able to take full advantage of temporal information Several orders of magnitude more data compared with photos Variations in motion and viewpoint Datasets are noisy or small - videos are difficult to collect, annotate and store Complex Context compared to photos
First paper 3D Convolution (C3D) October 2015
Motivation To combine temporal information with spatial information. 2D covnets are not enough. Proposal: spatiotemporal feature learning using deep 3D ConvNets.
2D Conv vs 3D Conv 2D conv on an image Input: image Output: image 2D conv on a video Input: volume (multiple frames as multiple channels) Output: image 3D conv on a video Input: volume Output: volume preserves temporal information of the input
3D Conv Kernels fixed the spatial kernels to 3x3 vary only the temporal depth of the 3D convolution kernels.
Network Settings Input: 16 frames non overlapping clips (were split from each video - resized to 112 112) Output: class labels (belong to 101 different actions) All convolution kernels are d 3 3 (d temporal depth) Max pooling layer 2-5 : kernel are 2 2 2 Max pooling layer 1: kernel is 1 2 2 Input: 3x16x 112x 112
Training Technical Details Data set: UCF101 We train the networks from scratch using mini-batches of 30 clips. Learning Rate: initial learning rate of 0.003. The learning rate is divided by 10 after every 4 epochs. Stopping criteria: after 16 epochs.
Datasets Action Recognition UCF101-13,000 videos of 101 human action categories Short clips, steady camera, less natural
Training Technical Details Data set: UCF101 We train the networks from scratch using mini-batches of 30 clips. Learning Rate: initial learning rate of 0.003. The learning rate is divided by 10 after every 4 epochs. Stopping criteria: after 16 epochs.
Varying Network Architectures a) homogeneous temporal depth: d= 1,3,5,7 b) varying temporal depth: increasing: d = 3-3-5-5-7 decreasing: d = 7-5-5-3-3
Varying Network Architectures homogeneous temporal depth: varying temporal depth: homogeneous temporal depth of 3 was chosen
Learning Spatiotemporal Features Dataset : Sport1M (long videos) Training: randomly extract five 2-second long clips from every training video C3D trained from scratch/ C3D pre- trained Testing: For video predictions, we average clip predictions of 10 clips
Datasets - Sports Video Classification Sport1M 1 million you tube videos, 487 sport categories Few minutes videos, in the wild, camera less steady, noisier labels
Learning Spatiotemporal Features Dataset : Sport1M (long videos) Training: randomly extract five 2-second long clips from every training video C3D trained from scratch/ C3D pre- trained Testing: For video predictions, we average clip predictions of 10 clips
Sport1M results The method is not state of the art. We note that the method of [29] uses long clips, thus its clip-level accuracy is not directly comparable to that of C3D and DeepVideo
C3D Video Descriptor A model is trained on Sport1M and kept constant 4096 dim video descriptor - averaging FC6 of this model, on 16 frames clips with stride 8, followed by L2 normalization. Multiclass Linear SVM is used on descriptor The descriptor is compared to other descriptors on several datasets
Visualization of C3D Descriptor We observe that C3D starts by focusing on appearance in the first few frames and tracks the salient motion in the subsequent frames
Results of Action Recognition on UCF101 Using C3D descriptor, results were state of the art only when combined with hand crafted video descriptor idt. Best among methods with RGB input only CNN RGB based only all possible feature combinations
C3D descriptor characteristics Compact- Results of UCF101 when reducing dimensions using PCA, better than other descriptors More generic - visualized by t-sne, compared with features extracted with 2D convolutions Fast 313 fps on GPU, 100 times faster than idt
Deconvolution Examples A feature map from Conv2 is learning moving edges and blobs
Deconvolution Examples A feature map from Conv3 is learning moving body parts
Deconvolution Examples A feature map from Conv5 is learning more complex movements like biking
Disadvantages of C3D C3D is limited temporal support of 16 consecutive frames. No optical flow is added to the CNN - Other works showed that adding it is improving the results. Other methods had better performance on each data set.
Second Paper April 2015, presented at CVPR2015
Motivation combining information over longer videos than previous methods Proposals Convolutional temporal feature pooling architecture LSTM cells connected to CNN convolutional output
Optical Flow d t (u, v): the displacement vector at the point (u, v) in frame t, which moves the point to the corresponding point in the following frame t + 1. (c) A close up of dense optical flow in the outlined area. The horizontal and vertical components of the vector field, d t x and d t y, can be seen as image channels (d), (e)
Optical Flow Stacking To represent the motion across a sequence of frames, we stack the flow channels d t x, d t y of L consecutive frames to form a total of 2L input channels
Motivation of Combination RGB - brush and hair, brush and teeth Optical Flow - hand moves periodically at some spatial location The combination discriminates the action
First Method Convolutional Temporal Feature Pooling Proposed approaches Temporal feature pooling performed on last convolutional layer of GoogLeNet.
First Method Implementation Details Datasets: Sport1m Cropping clips of 120 frames at 1 fps Feed forward of each frame through GoogLeNet Several methods for feature pooling between frames in the clip For prediction averaging clips with different starting points
Results: Convolutional Temporal Feature Pooling Data set: Sport1M Late pooling is the worst, doesn t preserve spatial information
Second Method - LSTM architecture Pooling is order invariant, LSTM usage is more natural for sequences
Video classification LSTM architecture 5 layers of LSTM, 512 cells each. Input is last convolutional layer of GoogLeNet from each frame Prediction is made after each frame In training, weight of loss is linearly growing from 0 to 1 through the video
Optical flow fusion implementation Processing videos in 1 Fps loses local motion information. Optical flow stream is added Each stream is fed forward in the same 2 methods as mentioned before Fusion is made only at the softmax layer
Results Sport1M results of different methods Optical flow fusion does not improve due to shaky videos State of the art performance on Sport1M
Results UCF101 results based on raw frames only UCF101 state of the art results (not anymore) when fusing optical flow
Disadvantages Of The 2 Methods Feature pooling is less generic for arbitrary length of a video LSTM is tested only on 30 frames, less global context Optical flow calculation is slow, and less elegant end-to-end training Fusion with optical flow is only at the softmax, not ideal information sharing
Third Paper Two Stream Fusion April 2016, presented at CVPR2016
Motivation Spatial + Optical flow fusion: registering appearance recognition (spatial cue) with optical flow recognition (temporal cue), at the pixel level Temporal fusion: how these cues evolve over time
Two Networks to Fuse Spatial fusion: Per frame Temporal fusion: Between frames
Challenges in the Fusion Process How to spatially fuse? Which channel in one network corresponds to a channel of the other network? Where to fuse the networks spatially? How to perform temporal fusion between frames?
Declarations Fusion Function: Feature Maps: Output Map: For simplicity, assume that W Width H height D number of channels of the respective feature maps
Possible Spatial Fusion Methods Sum Fusion: Max Fusion: Concatenation Fusion: Conv Fusion: Bilinear Fusion:
Where to Spatially Fuse? Two Examples (based on VGG): fusion after the 4th conv-layer. Only a single network tower is used from the point of fusion two layers (after conv5 and fc8) both network towers are kept Influence on the number of parameters
Spatial Fusion Methods - Comparison DataSet: UCF101 Model: 8 layers VGG-M
Spatial Fusion Methods - Comparison DataSet: UCF101 Model: 8 layers VGG-M How to spatially fuse? Conv Fusion Answers to our questions:
Where to Spatially Fuse? Dataset: UCF101 Model: 8 layers VGG-M
Where to Spatially Fuse? Dataset: UCF101 Model: 8 layers VGG-M Answers to our questions: Where to spatially fuse? At ReLU5 or at ReLU5+FC8 (but nearly doubles the parameters involved)
Temporal Fusion At Pool5 Spatiotemporal: Ignores time, spatial pooling only. Averaging the network predictions over time Stacking feature maps across frames, pools from local spatiotemporal neighborhood. No pooling across channels Additionally performs a convolution with a fusion kernel that spans the feature channels from both streams, space and time before 3D pooling. This is replacing the single frame Conv fusion. Kernel size:
Combining it all together Model: Vgg-16 Fusion Relu5 + after softmax (2 towers are kept). Spatiotemporal fusion+ 3D pooling in fused tower, 3D pooling in temporal tower L = 10 :Number of optical flow images around each frame T = 5 : Number of frames per video clip (for testing and training) τ Frame distance between sampled frames. Selected randomly [3,10] Prediction is averaged over both towers
Results
Disadvantages of this method Tested only on 5 frames per video. Not enough for context of longer videos Not tested on bigger and more general dataset of Sport1M Relies heavily on optical flow, may not work on many real life not stabilized videos As before, using optical flow is slow and not allowing end-to-end training
Summery & Conclusions 4 different approaches of video classification were shown Each method performed better on different tasks or datasets Optical flow improves the performance. Adding hand crafted features improved the results, hinting that there is still a room for improvement on CNN approaches
Disadvantages Of LSTM Methods [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 36]
Another Elegant Method
Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 31]
Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 33]
Brief Overview [slide source: cs231n course, Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture 14, slide 37]
Questions?