Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Size: px

Start display at page:

Download "Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification"

Margaret Woods
5 years ago
Views:

1 Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz

2 INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction Multimedia search and indexing Video 2

3 INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling 3

4 INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV

5 INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Fisher vector, F. Perronnin et al. ECCV

6 INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Spatio-temporal pyramid, X. Yang et al. ECCV 2014 Fisher vector, F. Perronnin et al. ECCV

7 INTELLIGENT VIDEO ANALYTICS Related Work 2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015 Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR

8 OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 8

9 MULTILAYER REPRESENTATIONS Dense image prediction FCN by Long et al. FlowNet by Fischer et al. 9

10 MULTILAYER REPRESENTATIONS Features of conv layers Poses, parts, articulations, objects, etc. Visualization by Zeiler et al. 10

11 MULTILAYER REPRESENTATIONS Convert feature maps to feature descriptors Feature maps of dimension feature descriptors of dimension 5 11

12 MULTILAYER REPRESENTATIONS Learn spatial discriminative weights of conv layers Spatial information of conv layers to enhance representations Video frames Feature maps of a conv layer over time importance Spatial weights of a conv layer 12

13 MULTILAYER REPRESENTATIONS Aggregate feature descriptors by Fisher vector (FV) Feature maps of a conv layer over time Gaussian mixture model 13

14 MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) importance Feature maps of a conv layer over time Spatial weights of a conv layer Gaussian mixture model 14

15 MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) Represent fc layers by temporal max pooling Overview of multilayer representation 15

16 FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models 16

17 FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos VGG/C3D 17

18 FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN fc layer VGG/C3D VGG/C3D Standard RNN 18

19 FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN RNN VGG/C3D fc layer VGG/C3D fc layer VGG/C3D Standard RNN FC-RNN 19

20 FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN FC-RNN VGG/C3D fc layer VGG/C3D VGG/C3D Standard RNN FC-RNN 20

21 RNN FC-RNN STRUCTURE Modeling Temporal Dynamics FC-RNN Pre-trained CNN, fc layer: Transfer to recurrent layers Comparison of standard RNN and FC-RNN 21

22 MULTIMODAL REPRESENTATIONS Static and dynamic information 2D-CNN/3D-CNN with video frames/optical flow maps A single frame A buffer of frames A single flow map A buffer of flow maps 22

23 FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 23

24 FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 4 layers and 4 modalities M = 16 24

25 EXPERIMENTS Benchmark datasets UCF101: 13,320 videos in 101 classes Skiing HMDB51: 6,766 videos in 51 classes Kissing 25

26 EXPERIMENTS FC-RNN Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 26

27 EXPERIMENTS FC-RNN Up to 3 % improvement Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 27

28 EXPERIMENTS Feature Aggregation A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 28

29 EXPERIMENTS Feature Aggregation Up to 2.5 % improvement A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 29

30 EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 30

31 EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 31

32 EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 32

33 EXPERIMENTS Multilayer Fusion Up to 8 % improvement Classification accuracy of single layers over different modalities and multilayer fusion results 33

34 EXPERIMENTS Multimodal Fusion Up to 6 % improvement Classification accuracy of different modalities and various combinations Comparison to the state-of-the-art results 34

35 EXPERIMENTS LPBoost conv4 29% 17% 0% 38% conv5 Modalities fc7 50% Layers 31% 23% 12% fc6 35

36 EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-SF Multimodal Fusion skijet : ( skiing : ) SKIING SKIJET 36

37 EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-OF Multimodal Fusion boxing speeding bag : ( boxing punching bag : ) BOXING PUNCHING BAG BOXING SPEEDING BAG 37

38 OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 38

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang Pavlo Molchanov Jan Kautz NVIDIA {xiaodongy, pmolchanov, jkautz}@nvidia.com ABSTRACT This paper presents