Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Similar documents
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Long-term Temporal Convolutions for Action Recognition INRIA

Two-Stream Convolutional Networks for Action Recognition in Videos

arxiv: v1 [cs.cv] 14 Jul 2017

Bilinear Models for Fine-Grained Visual Recognition

Large-scale Video Classification with Convolutional Neural Networks

Eigen-Evolution Dense Trajectory Descriptors

CS231N Section. Video Understanding 6/1/2018

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

arxiv: v1 [cs.cv] 29 Apr 2016

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Deep Local Video Feature for Action Recognition

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Deep Spatial Pyramid Ensemble for Cultural Event Recognition

Return of the Devil in the Details: Delving Deep into Convolutional Nets

arxiv: v1 [cs.cv] 15 Apr 2016

Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Real-time Action Recognition with Enhanced Motion Vector CNNs

Person Action Recognition/Detection

Convolutional-Recursive Deep Learning for 3D Object Classification

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

arxiv: v2 [cs.cv] 2 Apr 2018

arxiv: v2 [cs.cv] 6 May 2018

HUMAN ACTION RECOGNITION

Human Pose Estimation with Deep Learning. Wei Yang

arxiv: v1 [cs.cv] 19 May 2015

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

A Deep Learning Framework for Authorship Classification of Paintings

arxiv: v1 [cs.cv] 22 Nov 2017

Learning Compact Visual Attributes for Large-scale Image Classification

Long-term Temporal Convolutions for Action Recognition

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

arxiv: v3 [cs.cv] 12 Apr 2018

LOCAL VISUAL PATTERN MODELLING FOR IMAGE AND VIDEO CLASSIFICATION

Making Convolutional Networks Recurrent for Visual Sequence Learning

A Unified Method for First and Third Person Action Recognition

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

GPU Accelerated Sequence Learning for Action Recognition. Yemin Shi

Spotlight: A Smart Video Highlight Generator Stanford University CS231N Final Project Report

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

arxiv: v2 [cs.cv] 13 Apr 2015

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Multiple Kernel Learning for Emotion Recognition in the Wild

Mixtures of Gaussians and Advanced Feature Encoding

arxiv: v7 [cs.cv] 21 Apr 2018

Flow-Based Video Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition

COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: THERE S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX

People Detection and Video Understanding

CNN for Low Level Image Processing. Huanjing Yue

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Lecture 7: Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Temporal Difference Networks for Video Action Recognition

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition

arxiv: v1 [cs.cv] 25 Apr 2016

Aggregating Descriptors with Local Gaussian Metrics

Computer Vision Lecture 16

arxiv: v3 [cs.cv] 8 May 2015

Computer Vision Lecture 16

arxiv: v1 [cs.cv] 30 May 2017

Lecture 18: Human Motion Recognition

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Regionlet Object Detector with Hand-crafted and CNN Feature

Visual Object Tracking. Jianan Wu Megvii (Face++) Researcher Dec 2017

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Image-Sentence Multimodal Embedding with Instructive Objectives

SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Multi-region two-stream R-CNN for action detection

arxiv: v1 [cs.cv] 26 Jul 2018

Multiple VLAD encoding of CNNs for image classification

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Class 9 Action Recognition

A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition

Deep CNN Object Features for Improved Action Recognition in Low Quality Videos

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION

A Piggyback Representation for Action Recognition

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Deconvolutions in Convolutional Neural Networks

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL

PREDICTION OF ANOMALOUS ACTIVITIES IN A VIDEO

Pose for Action Action for Pose

Transcription:

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz

INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction Multimedia search and indexing Video Classification @bmw.com 2

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling 3

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 4

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Fisher vector, F. Perronnin et al. ECCV 2010 5

INTELLIGENT VIDEO ANALYTICS Related Work Local feature extraction Global feature representation Temporal modeling Dense trajectories, H. Wang et al. ICCV 2013 Bag-of-visual-words, J. Gemert et al. TPAMI 2009 Spatio-temporal pyramid, X. Yang et al. ECCV 2014 Fisher vector, F. Perronnin et al. ECCV 2010 6

INTELLIGENT VIDEO ANALYTICS Related Work 2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015 Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015 7

OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 8

MULTILAYER REPRESENTATIONS Dense image prediction FCN by Long et al. FlowNet by Fischer et al. 9

MULTILAYER REPRESENTATIONS Features of conv layers Poses, parts, articulations, objects, etc. Visualization by Zeiler et al. 10

MULTILAYER REPRESENTATIONS Convert feature maps to feature descriptors Feature maps of dimension 28 28 5 28 28 feature descriptors of dimension 5 11

MULTILAYER REPRESENTATIONS Learn spatial discriminative weights of conv layers Spatial information of conv layers to enhance representations Video frames Feature maps of a conv layer over time importance Spatial weights of a conv layer 12

MULTILAYER REPRESENTATIONS Aggregate feature descriptors by Fisher vector (FV) Feature maps of a conv layer over time Gaussian mixture model 13

MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) importance Feature maps of a conv layer over time Spatial weights of a conv layer Gaussian mixture model 14

MULTILAYER REPRESENTATIONS Represent conv layers by improved Fisher vector (ifv) Represent fc layers by temporal max pooling Overview of multilayer representation 15

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models 16

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos VGG/C3D 17

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN fc layer VGG/C3D VGG/C3D Standard RNN 18

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN RNN VGG/C3D fc layer VGG/C3D fc layer VGG/C3D Standard RNN FC-RNN 19

FC-RNN STRUCTURE Modeling Temporal Dynamics Don t be a hero use pre-trained models Many pre-trained models from ImageNet and Sports1M Images/Snippets Videos RNN FC-RNN VGG/C3D fc layer VGG/C3D VGG/C3D Standard RNN FC-RNN 20

RNN FC-RNN STRUCTURE Modeling Temporal Dynamics FC-RNN Pre-trained CNN, fc layer: Transfer to recurrent layers Comparison of standard RNN and FC-RNN 21

MULTIMODAL REPRESENTATIONS Static and dynamic information 2D-CNN/3D-CNN with video frames/optical flow maps A single frame A buffer of frames A single flow map A buffer of flow maps 22

FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 23

FUSION BY BOOSTING Optimize a linear combination of predictions of multiple layers from multiple modalities LPBoost: boost-u: learn uniform weights for all classes boost-c: learn class specific weights 4 layers and 4 modalities M = 16 24

EXPERIMENTS Benchmark datasets UCF101: 13,320 videos in 101 classes Skiing HMDB51: 6,766 videos in 51 classes Kissing 25

EXPERIMENTS FC-RNN Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 26

EXPERIMENTS FC-RNN Up to 3 % improvement Outperforms RNN and LSTM by 3.0% and 2.9% error rate epochs Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101 27

EXPERIMENTS Feature Aggregation A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 28

EXPERIMENTS Feature Aggregation Up to 2.5 % improvement A single frame importance Spatial weights of a conv layer A single flow map A buffer of frames A buffer of flow maps Comparison of FV and ifv to represent conv layers of different modalities 29

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 30

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 31

EXPERIMENTS Multilayer Fusion Classification accuracy of single layers over different modalities and multilayer fusion results 32

EXPERIMENTS Multilayer Fusion Up to 8 % improvement Classification accuracy of single layers over different modalities and multilayer fusion results 33

EXPERIMENTS Multimodal Fusion Up to 6 % improvement Classification accuracy of different modalities and various combinations Comparison to the state-of-the-art results 34

EXPERIMENTS LPBoost conv4 29% 17% 0% 38% conv5 Modalities fc7 50% Layers 31% 23% 12% fc6 35

EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-SF Multimodal Fusion skijet : ( skiing : ) SKIING SKIJET 36

EXPERIMENTS Effect of Multimodal Fusion 2D-CNN-OF Multimodal Fusion boxing speeding bag : ( boxing punching bag : ) BOXING PUNCHING BAG BOXING SPEEDING BAG 37

OUR CONTRIBUTIONS Local feature extraction: Multilayer representations from CNN Global feature representation: Multimodal representations Fusion by boosting Temporal modeling: Structure of FC-RNN Overview of multilayer and multimodal fusion for video classification 38