Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz
INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction Multimedia search and indexing Video Classification @bmw.com 2
INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling 3
INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 4
INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Fisher vector, F. Perronnin et al. ECCV 2010 5
INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Spatio-temporal pyramid, X. Yang et al. ECCV 2014 Fisher vector, F. Perronnin et al. ECCV 2010 6
INTELLIGENT VIDEO ANALYTICS Related Work 2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015 Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015 7
OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 8
MULTILAYER REPRESENTATIONS Dense image prediction FCN by Long et al. FlowNet by Fischer et al. 9
MULTILAYER REPRESENTATIONS Features of conv layers Poses, parts, articulations, objects, etc. Visualization by Zeiler et al. 10
MULTILAYER REPRESENTATIONS Convert feature maps to feature descriptors Feature maps of dimension 28 28 5 28 28 feature descriptors of dimension 5 11
MULTILAYER REPRESENTATIONS Learn spatial discriminative weights of conv layers Spatial information of conv layers to enhance representations Video frames Feature maps of a conv layer over time importance Spatial weights of a conv layer 12
MULTILAYER REPRESENTATIONS Aggregate feature descriptors by Fisher vector (FV) Feature maps of a conv layer over time Gaussian mixture model 13
MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) importance Feature maps of a conv layer over time Spatial weights of a conv layer Gaussian mixture model 14
MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) Represent fc layers by temporal max pooling Overview of multilayer representation 15
FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models 16
FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos VGG/C3D 17
FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN fc layer VGG/C3D VGG/C3D Standard RNN 18
FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN RNN VGG/C3D fc layer VGG/C3D fc layer VGG/C3D Standard RNN FC-RNN 19
FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN FC-RNN VGG/C3D fc layer VGG/C3D VGG/C3D Standard RNN FC-RNN 20
RNN FC-RNN STRUCTURE Modeling Temporal Dynamics FC-RNN Pre-trained CNN, fc layer: Transfer to recurrent layers Comparison of standard RNN and FC-RNN 21
MULTIMODAL REPRESENTATIONS Static and dynamic information 2D-CNN/3D-CNN with video frames/optical flow maps A single frame A buffer of frames A single flow map A buffer of flow maps 22
FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 23
FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 4 layers and 4 modalities M = 16 24
EXPERIMENTS Benchmark datasets UCF101: 13,320 videos in 101 classes Skiing HMDB51: 6,766 videos in 51 classes Kissing 25
EXPERIMENTS FC-RNN Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 26
EXPERIMENTS FC-RNN Up to 3 % improvement Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 27
EXPERIMENTS Feature Aggregation A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 28
EXPERIMENTS Feature Aggregation Up to 2.5 % improvement A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 29
EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 30
EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 31
EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 32
EXPERIMENTS Multilayer Fusion Up to 8 % improvement Classification accuracy of single layers over different modalities and multilayer fusion results 33
EXPERIMENTS Multimodal Fusion Up to 6 % improvement Classification accuracy of different modalities and various combinations Comparison to the state-of-the-art results 34
EXPERIMENTS LPBoost conv4 29% 17% 0% 38% conv5 Modalities fc7 50% Layers 31% 23% 12% fc6 35
EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-SF Multimodal Fusion skijet : ( skiing : ) SKIING SKIJET 36
EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-OF Multimodal Fusion boxing speeding bag : ( boxing punching bag : ) BOXING PUNCHING BAG BOXING SPEEDING BAG 37
OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 38