Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Similar documents
Large-scale gesture recognition based on Multimodal data with C3D and TSN

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Large-scale Video Classification with Convolutional Neural Networks

VGR-Net: A View Invariant Gait Recognition Network

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

arxiv: v1 [cs.cv] 14 Jul 2017

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Faceted Navigation for Browsing Large Video Collection

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

arxiv: v2 [cs.cv] 26 Apr 2018

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

arxiv: v1 [cs.cv] 26 Apr 2018

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Object Detection Based on Deep Learning

arxiv: v3 [cs.cv] 2 Aug 2017

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

arxiv: v1 [cs.cv] 30 May 2017

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Kaggle Data Science Bowl 2017 Technical Report

Convolutional Neural Network Layer Reordering for Acceleration

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

Multi-Glance Attention Models For Image Classification

ChaLearn Looking at People Workshop

Automatic Video Description Generation via LSTM with Joint Two-stream Encoding

Robust Face Recognition Based on Convolutional Neural Network

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Multimodal Gesture Recognition Based on the ResC3D Network

Aggregating Frame-level Features for Large-Scale Video Classification

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

POINT CLOUD DEEP LEARNING

arxiv: v1 [cs.cv] 6 Jul 2016

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 10 May 2017

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

arxiv: v1 [cs.cv] 15 Jun 2018

RoI Pooling Based Fast Multi-Domain Convolutional Neural Networks for Visual Tracking

Semantic Segmentation

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

In Defense of Fully Connected Layers in Visual Representation Transfer

Gesture Recognition: Focus on the Hands

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

4D Effect Video Classification with Shot-aware Frame Selection and Deep Neural Networks

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Channel Locality Block: A Variant of Squeeze-and-Excitation

Real-time Object Detection CS 229 Course Project

DeepIndex for Accurate and Efficient Image Retrieval

arxiv: v1 [cs.cv] 19 Apr 2016

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

arxiv: v1 [cs.cv] 2 May 2015

Unified, real-time object detection

Pyramid Person Matching Network for Person Re-identification

Dense Optical Flow Prediction from a Static Image

Computer Vision Lecture 16

arxiv: v2 [cs.cv] 2 Apr 2018

Learning RGB-D Salient Object Detection using background enclosure, depth contrast, and top-down features

Computer Vision Lecture 16

Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

Long-term Temporal Convolutions for Action Recognition

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Scene Composition in Augmented Virtual Presenter System

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

A Unified Method for First and Third Person Action Recognition

arxiv: v1 [cs.cv] 23 Jan 2018

with Deep Learning A Review of Person Re-identification Xi Li College of Computer Science, Zhejiang University

DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Getting started with Caffe. Jon Barker, Solutions Architect

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Feature-Fused SSD: Fast Detection for Small Objects

arxiv: v1 [cs.cv] 25 Apr 2016

arxiv: v1 [cs.cv] 15 Apr 2016

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Learning image representations equivariant to ego-motion (Supplementary material)

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Supplementary Materials for Salient Object Detection: A

SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION

Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

arxiv: v3 [cs.cv] 12 Apr 2018

Transcription:

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks August 16, 2016 1 Team details Team name FLiXT Team leader name Yunan Li Team leader address, phone number and email address: Xidian University, 2th South Taibai Road, Xi an, Shaanxi, China phone number: 18710849937 email: xdfzliyunan@163.com Rest of the team members Kuan Tian, Yingying Fan, Xin Xu, Rui Li. Team website URL (if any) Affiliation School of Computer Science and Technology, Xidian University 2 Contribution details Title of the contribution Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks Final score 48.62% (on validation) General method description Our method recognizes gestures by employing both RGB and depth videos and learning with the features extracted by the 3D CNN model. To learn 1

more about the detail of motions, we make a pre-processing on the inputs and convert them into 32-frames videos. Since the variations in background, clothing, skin color and other external factors may disturb the recognition, we employ saliency video to concentrate on the gestures. The features of the videos are learnt by C3D model, a 3D convolutional network model that learns spatiotemporal features. Then we also blend the RGB feature, depth feature and saliency features together to boost the performance. The final classification is implemented by SVM classifier. References [1] Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera and Stan Z. Li, ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition, CVPR workshop, 2016. [2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, arxiv 2014. [4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014. [5] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, Frequencytuned salient region detection, CVPR 2009. [6] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Representative image / diagram of the method Figure 1: The diagram of our method Describe data preprocessing techniques applied (if any) We make a pre-processing on the videos to convert them into 32-frames. As the statistics of all 35878 train videos show, most of them are with 29-39 frames and the peak is 33-frames video. For easier processing in 2

C3D learning, we choose 32 as a benchmark and sampling or extending input videos in terms of the length of them. 3 Visual Analysis 3.1 Gesture Recognition (or/and Spotting) Stage 3.1.1 Features / Data representation Describe features used or data representation model FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) The features we use for gesture recognition is extracted by C3D model. The architecture of C3D model is illustrated in Fig.2. After 8 convolution and 5 pooling, the input video is converted into a 1 4096 dimension feature vector, and that is exact what we use for classification. Figure 2: The architecture of C3D model (from Tran et al. s paper). It consists of 8 convolution layers, 5 pooling layers, 2 fully-connected layers and a softmax loss layer. The feature we extract is from fc6 layer, i.e., the first fully-connected layer. 3.1.2 Dimensionality reduction Dimensionality reduction technique applied FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) 3.1.3 Compositional model Compositional model used, i.e. pictorial structure FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) 3.1.4 Learning strategy Learning strategy applied FOR GESTURE RECOGNITION (OR/AND SPOT- TING) STAGE (if any) We adopt a learning-extracting-fusing strategy. Since blending the three kinds of input videos may result unreasonable data (because the objects in RGB and depth videos are not presented with the same size), we first use three kinds of video to finetune the model and extract features respectively, then we blend these three features together and use the integrated feature to train SVM and obtain the final classification result. 3

3.1.5 Other techniques Other technique/strategy used not included in previous items FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) 3.1.6 Method complexity Method complexity FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE The part of our method that most likely to have high complexity is the finetuning process of C3D model. It needs about 50.9 hours to finetune and update nearly parameters. It also takes about 8G graphic memory. The classification is implemented by a linear-svm classifier so that the complexity of it is not very high. 3.2 Data Fusion Strategies List data fusion strategies (how different feature descriptions are combined) for learning the model / network: Single frame, early, slow, late. (if any) The data we use for fusion is RGB, depth and saliency data. The saliency data is generated from RGB data to alleviate the disturbing influence of background, clothing, skin color and etc. Since the RGB and depth videos are not matched well (objects in RGB video is a little bigger), we choose to fusion in the later stage - after features of RGB, depth and saliency data are extracted by C3D model, we blend these three 1 4096 feature vectors and calculate the average. 3.3 Global Method Description Which pre-trained or external methods have been used (for any stage, if any) The C3D model is pre-trained with the Sports-1M dataset. Which additional data has been used in addition to the provided ChaLearn training and validation data (at any stage, if any) Qualitative advantages of the proposed solution The advantages of our method are: 1. The 32-frames strategy achieves better results than the original C3D model which requires 16-frames input. 2. The fusion feature make a significant progress compared with any single feature in boosting accuracy. Results of the comparison to other approaches (if any) Novelty degree of the solution and if is has been previously published The 32-frame scheme and saliency video used for fusion are novel proposed and have never been published before. 4

4 Other details Language and implementation details (including platform, memory, parallelization requirements) Our experiments are processed on a PC with Intel Core i7-6700 CPU @ 3.40GHz 8, 16 GB RAM and Nvidia Geforce GTX TITAN X G- PU. The experiments of C3D model training and feature extracting are processed under caffe framework on Linux Ubuntu 14.04 LTS, others including 32-frames video generation, feature fusion and SVM classification are implemented by matlab R2012b on 64-bit Windows 7. Human effort required for implementation, training and validation? As our method is divided into four modules (data pre-processing module, feature extraction module, classification module and prediction generation module), the intermediary data needs transporting among those modules, especially for the feature extraction, since the network needs data and creates data in specific directory. Training/testing expended time? The training time of C3D model is about 50.9 hours, the feature extraction time is about 1 hour for 35878 training data. The classification time of SVM classifier is about 6-9 hours (depending on the amount of training data and the rate that CPU is occupied). General comments and impressions of the challenge? what do you expect from a new challenge in face and looking at people analysis? The challenge is really fantastic. Some of the gestures are really difficult to distinguish so we need to try our best to make the details of each gesture be learnt by the network. It might be better that the face and looking at people analysis challenge is different from existing face recognition scheme. A specific application like criminal discrimination may be interesting and realistic. 5