Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz

INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction Multimedia search and indexing Video Classification @bmw.com 2

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling 3

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 4

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Fisher vector, F. Perronnin et al. ECCV 2010 5

INTELLIGENT VIDEO ANALYTICS Related Work 2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015 Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015 7

OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 8

MULTILAYER REPRESENTATIONS Dense image prediction FCN by Long et al. FlowNet by Fischer et al. 9

MULTILAYER REPRESENTATIONS Features of conv layers Poses, parts, articulations, objects, etc. Visualization by Zeiler et al. 10

MULTILAYER REPRESENTATIONS Convert feature maps to feature descriptors Feature maps of dimension 28 28 5 28 28 feature descriptors of dimension 5 11

MULTILAYER REPRESENTATIONS Learn spatial discriminative weights of conv layers Spatial information of conv layers to enhance representations Video frames Feature maps of a conv layer over time importance Spatial weights of a conv layer 12

MULTILAYER REPRESENTATIONS Aggregate feature descriptors by Fisher vector (FV) Feature maps of a conv layer over time Gaussian mixture model 13

MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) importance Feature maps of a conv layer over time Spatial weights of a conv layer Gaussian mixture model 14

MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) Represent fc layers by temporal max pooling Overview of multilayer representation 15

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models 16

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos VGG/C3D 17

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN fc layer VGG/C3D VGG/C3D Standard RNN 18

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN RNN VGG/C3D fc layer VGG/C3D fc layer VGG/C3D Standard RNN FC-RNN 19

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN FC-RNN VGG/C3D fc layer VGG/C3D VGG/C3D Standard RNN FC-RNN 20

RNN FC-RNN STRUCTURE Modeling Temporal Dynamics FC-RNN Pre-trained CNN, fc layer: Transfer to recurrent layers Comparison of standard RNN and FC-RNN 21

MULTIMODAL REPRESENTATIONS Static and dynamic information 2D-CNN/3D-CNN with video frames/optical flow maps A single frame A buffer of frames A single flow map A buffer of flow maps 22

FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 23

EXPERIMENTS Benchmark datasets UCF101: 13,320 videos in 101 classes Skiing HMDB51: 6,766 videos in 51 classes Kissing 25

EXPERIMENTS FC-RNN Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 26

EXPERIMENTS FC-RNN Up to 3 % improvement Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 27

EXPERIMENTS Feature Aggregation A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 28

EXPERIMENTS Feature Aggregation Up to 2.5 % improvement A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 29

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 30

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 31

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 32

EXPERIMENTS Multilayer Fusion Up to 8 % improvement Classification accuracy of single layers over different modalities and multilayer fusion results 33

EXPERIMENTS Multimodal Fusion Up to 6 % improvement Classification accuracy of different modalities and various combinations Comparison to the state-of-the-art results 34

EXPERIMENTS LPBoost conv4 29% 17% 0% 38% conv5 Modalities fc7 50% Layers 31% 23% 12% fc6 35

EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-SF Multimodal Fusion skijet : ( skiing : ) SKIING SKIJET 36

EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-OF Multimodal Fusion boxing speeding bag : ( boxing punching bag : ) BOXING PUNCHING BAG BOXING SPEEDING BAG 37