Large-scale gesture recognition based on Multimodal data with C3D and TSN

Similar documents
Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Multimodal Gesture Recognition Based on the ResC3D Network

arxiv: v2 [cs.cv] 26 Apr 2018

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

arxiv: v1 [cs.cv] 14 Jul 2017

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Gesture Recognition: Focus on the Hands

Large-scale Video Classification with Convolutional Neural Networks

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

VGR-Net: A View Invariant Gait Recognition Network

arxiv: v1 [cs.cv] 26 Jul 2018

Faceted Navigation for Browsing Large Video Collection

arxiv: v1 [cs.cv] 30 May 2017

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network

Long-term Temporal Convolutions for Action Recognition

Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

arxiv: v1 [cs.cv] 15 Apr 2016

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

arxiv: v2 [cs.cv] 6 May 2018

Deep Local Video Feature for Action Recognition

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

arxiv: v3 [cs.cv] 2 Aug 2017

arxiv: v1 [cs.cv] 6 Jul 2016

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

arxiv: v1 [cs.cv] 23 Jan 2018

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition

arxiv: v2 [cs.cv] 2 Apr 2018

Aggregating Frame-level Features for Large-Scale Video Classification

Convolutional Neural Network Layer Reordering for Acceleration

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Learning to track for spatio-temporal action localization

ChaLearn Looking at People Workshop

Channel Locality Block: A Variant of Squeeze-and-Excitation

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016

arxiv: v1 [cs.cv] 26 Apr 2018

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Violent Interaction Detection in Video Based on Deep Learning

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

DeepIndex for Accurate and Efficient Image Retrieval

Multi-region two-stream R-CNN for action detection

arxiv: v1 [cs.cv] 22 Nov 2017

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Temporal Difference Networks for Video Action Recognition

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

CS231N Section. Video Understanding 6/1/2018

A Unified Method for First and Third Person Action Recognition

Activity Recognition in Temporally Untrimmed Videos

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Scene Composition in Augmented Virtual Presenter System

Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

arxiv: v7 [cs.cv] 21 Apr 2018

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

arxiv: v1 [cs.cv] 13 Aug 2017

Additive Manufacturing Defect Detection using Neural Networks

Deep Temporal Models (Benchmarks and Applica6ons Analysis)

arxiv: v2 [cs.cv] 8 Nov 2015

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Recurrent Neural Networks and Transfer Learning for Action Recognition

Face recognition based on improved BP neural network

arxiv: v1 [cs.ro] 18 Jul 2017

Semantic Segmentation

arxiv: v1 [cs.cv] 15 Jun 2018

Flowing ConvNets for Human Pose Estimation in Videos

Flow-Based Video Recognition

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

Robust Face Recognition Based on Convolutional Neural Network

Object Detection Based on Deep Learning

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

arxiv: v3 [cs.cv] 12 Apr 2018

Stochastic Function Norm Regularization of DNNs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset

Supplementary material for Analyzing Filters Toward Efficient ConvNet

arxiv: v2 [cs.cv] 18 Aug 2017

Spotlight: A Smart Video Highlight Generator Stanford University CS231N Final Project Report

ImageNet Classification with Deep Convolutional Neural Networks

Multi-view fusion for activity recognition using deep neural networks

Pyramid Person Matching Network for Person Re-identification

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Lecture 5: Object Detection

arxiv: v1 [cs.cv] 10 Apr 2017

arxiv: v1 [cs.cv] 2 May 2015

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

PEOPLE IN SEATS COUNTING VIA SEAT DETECTION FOR MEETING SURVEILLANCE

Transcription:

Large-scale gesture recognition based on Multimodal data with C3D and TSN July 6, 2017 1 Team details Team name ASU Team leader name Yunan Li Team leader address, phone number and email address: Xidian University, 2th South Taibai Road, Xi an, Shaanxi, China phone number: 18710849937, 13109501120 email: xdfzliyunan@163.com Rest of the team members Weikang Shi, Xin Xu, Zhenxin Ma, Qiguang Miao (supervisor) Team website URL (if any) Affiliation School of Computer Science and Technology, Xidian University 2 Contribution details Title of the contribution Large-scale gesture recognition based on Multimodal data with C3D and TSN Final score 67.71% (on testing) General method description Our method includes four parts. First we process an enhancement on RG- B and depth data, namely retinex for unifying the illumination of RGB videos, and median filter for eliminating noise in depth videos. Meanwhile, optical flow videos are generated as another modality of data which 1

concern about the motion path. Then two different sampling strategies are adopted. One is uniform sampling and the other is sectional weighted sampling. After that, these videos are sent to C3D model for feature extraction. To get more comprehensive features, we also employ Temporal segment network for feature extraction. Extracted features are blended for boosting the performance. The features derived from the same modality of data are fused in terms of canonical correlation analysis, and the different-modality features are fused by stacking. References [1] Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera and Stan Z. Li, ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition, CVPR workshop, 2016. [2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, arxiv 2014. [4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014. [5] L Wang, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. ECCV 2016. [6] T. Brox, A. Bruhn, N. Papenberg, J. Weickert: High accuracy optical flow estimation based on a theory for warping. ECCV 2004. [7] M. Haghighat, M. Abdel-Mottaleb, W. Alhalabi: Fully automatic face normalization and single sample face recognition in unconstrained environments. Expert Systems with Applications, 2016, 47(5):23-34. Representative image / diagram of the method The diagram is shown as below. Describe data preprocessing techniques applied (if any) 1. Data enhancement. We enhance RGB and depth data according to their characters, respectively. As the RGB data are easily suffered from the different illumination, we employ the single scale retinex, basing on a theory that claims the chromaticity of objects remains relatively constant under varying illumination conditions, to eliminate the influence of illumination. Meanwhile, the noise is obvious in the depth data, therefore we conduct a median filter to alleviate it. 2. As the analysis of the distribution of training data shows, most videos are with 32 frames, we sampling the videos to unify the frame numbers. However, to augment the amount of data, we adopt two sampling strategies. One is uniform sampling, in which we sample frames according the ratio of original frame number and 32. The other is 2

Figure 1: The diagram of our method. sectional weighted sampling. We first cut the video into several section, and calculate the average optical flow value of each section. The frame number of each section is determined by the motion condition of it. 3 Visual Analysis 3.1 Gesture Recognition (or/and Spotting) Stage 3.1.1 Features / Data representation Describe features used or data representation model FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) The features we used for gesture recognition are extracted by C3D model and TSN. 3.1.2 Dimensionality reduction Dimensionality reduction technique applied FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) We use PCA for feature dimensionality reduction. 3

3.1.3 Compositional model Compositional model used, i.e. pictorial structure FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) 3.1.4 Learning strategy Learning strategy applied FOR GESTURE RECOGNITION (OR/AND SPOT- TING) STAGE (if any) We adopt a learning-extracting-fusing strategy. After a pre-processing step, we obtain retinex filtered RGB data, median filter filtered depth data and optical flow data. Then we finetune a C3D model to extract the features of these data. Meanwhile, the TSN model is also trained with depth (for TSN model we use uniform sampling data only) and flow data. After that, we fuse these features. For the data sampled with different sampling strategies of one modality, we use Canonical Correlation Analysis for calculating the weights of these features, and fuse by weighted addition. Then for features of different modalities, we stack them together. 3.1.5 Other techniques Other technique/strategy used not included in previous items FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) 3.1.6 Method complexity Method complexity FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE The part of our method that is most likely to have high complexity is the finetuning process of C3D model and training TSN. It takes 19.4 hours for C3D model and 14.2 hours for TSN. It also takes about 10G graphic memory for C3D model and about 4-6G graphic memory for TSN. The classification is implemented by a linear-svm classifier so that the complexity of it is not very high. 3.2 Data Fusion Strategies List data fusion strategies (how different feature descriptions are combined) for learning the model / network: Single frame, early, slow, late. (if any) As mentioned above, we enhance the RGB and the depth videos respectively by Retinex Algorithm and median filter algorithm and adopt different ways such as uniform sampling and section weighted sampling to get data with different sampling rate.through the analysis of the statistical characteristic of data of the same modality, we use the canonical correlation analysis to calculate the correlation coefficient between features respectively, so as to obtain the fusion weight and perform the weighted fusion. For different modality data, the fusion is carried out by stacking. 4

3.3 Global Method Description Which pre-trained or external methods have been used (for any stage, if any) The C3D model is pre-trained with the Sports-1M dataset. Which additional data has been used in addition to the provided ChaLearn training and validation data (at any stage, if any) The flow data which generated from RGB data is used to concentrate on the gestures. And the filtered data is used to enhance the RGB and depth videos. Qualitative advantages of the proposed solution 1). data-driven frame number unification. 2). Video enhancement. Filtering RGB data according to Retinex theory and depth data by median filter algorithm. 3). Data augmentation by different sampling ways, e.g., uniform sampling and sectional weighted sampling. 4). Multimodal features extracted by C3D model and TSN. 5). canonical correlation analysis for calculating the weight of features. Results of the comparison to other approaches (if any) Novelty degree of the solution and if is has been previously published Data enhancement, augmentation strategy and fusion scheme are novelly proposed and have not been published before. 4 Other details Language and implementation details (including platform, memory, parallelization requirements) Our experiments are processed on a PC with Intel Core i7-6700 CPU @ 3.40GHz 8, 16 GB RAM and Nvidia Geforce GTX TITAN X GPU. The experiments of C3D model and TSN training and feature extracting are processed under caffe framework on Linux Ubuntu 14.04 LTS, others including data enhancement, optical flow data generation, two sampling strategies, feature fusion and SVM classification are implemented by matlab R2015b on 64-bit Windows 7. Human effort required for implementation, training and validation? The destination setting of generated data in pre-processing step may need your effort, especially for the frame images for TSN (need to be put into the data folder of TSN sub-project), the others can be executed by shell/matlab scripts if the settings are correct. Training/testing expended time? The training time of C3D model is 19.4 hours and about 14.2 hours of 5

TSN. The classification time of SVM classifier is about 2-3 hours for all testing data. General comments and impressions of the challenge? what do you expect from a new challenge in face and looking at people analysis? It is interesting and still challenging because the velocity of motion, the background and other factors can still handicap the accuracy and need studying in the future. Maybe social relationship classification based on gesture or action of two or more people is interesting for a new challenge. 6