Large-scale gesture recognition based on Multimodal data with C3D and TSN

Size: px
Start display at page:

Download "Large-scale gesture recognition based on Multimodal data with C3D and TSN"

Transcription

1 Large-scale gesture recognition based on Multimodal data with C3D and TSN July 6, Team details Team name ASU Team leader name Yunan Li Team leader address, phone number and address: Xidian University, 2th South Taibai Road, Xi an, Shaanxi, China phone number: , xdfzliyunan@163.com Rest of the team members Weikang Shi, Xin Xu, Zhenxin Ma, Qiguang Miao (supervisor) Team website URL (if any) Affiliation School of Computer Science and Technology, Xidian University 2 Contribution details Title of the contribution Large-scale gesture recognition based on Multimodal data with C3D and TSN Final score 67.71% (on testing) General method description Our method includes four parts. First we process an enhancement on RG- B and depth data, namely retinex for unifying the illumination of RGB videos, and median filter for eliminating noise in depth videos. Meanwhile, optical flow videos are generated as another modality of data which 1

2 concern about the motion path. Then two different sampling strategies are adopted. One is uniform sampling and the other is sectional weighted sampling. After that, these videos are sent to C3D model for feature extraction. To get more comprehensive features, we also employ Temporal segment network for feature extraction. Extracted features are blended for boosting the performance. The features derived from the same modality of data are fused in terms of canonical correlation analysis, and the different-modality features are fused by stacking. References [1] Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera and Stan Z. Li, ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition, CVPR workshop, [2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, arxiv [4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale Video Classification with Convolutional Neural Networks, CVPR [5] L Wang, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. ECCV [6] T. Brox, A. Bruhn, N. Papenberg, J. Weickert: High accuracy optical flow estimation based on a theory for warping. ECCV [7] M. Haghighat, M. Abdel-Mottaleb, W. Alhalabi: Fully automatic face normalization and single sample face recognition in unconstrained environments. Expert Systems with Applications, 2016, 47(5): Representative image / diagram of the method The diagram is shown as below. Describe data preprocessing techniques applied (if any) 1. Data enhancement. We enhance RGB and depth data according to their characters, respectively. As the RGB data are easily suffered from the different illumination, we employ the single scale retinex, basing on a theory that claims the chromaticity of objects remains relatively constant under varying illumination conditions, to eliminate the influence of illumination. Meanwhile, the noise is obvious in the depth data, therefore we conduct a median filter to alleviate it. 2. As the analysis of the distribution of training data shows, most videos are with 32 frames, we sampling the videos to unify the frame numbers. However, to augment the amount of data, we adopt two sampling strategies. One is uniform sampling, in which we sample frames according the ratio of original frame number and 32. The other is 2

3 Figure 1: The diagram of our method. sectional weighted sampling. We first cut the video into several section, and calculate the average optical flow value of each section. The frame number of each section is determined by the motion condition of it. 3 Visual Analysis 3.1 Gesture Recognition (or/and Spotting) Stage Features / Data representation Describe features used or data representation model FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) The features we used for gesture recognition are extracted by C3D model and TSN Dimensionality reduction Dimensionality reduction technique applied FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) We use PCA for feature dimensionality reduction. 3

4 3.1.3 Compositional model Compositional model used, i.e. pictorial structure FOR GESTURE RECOG- NITION (OR/AND SPOTTING) STAGE (if any) Learning strategy Learning strategy applied FOR GESTURE RECOGNITION (OR/AND SPOT- TING) STAGE (if any) We adopt a learning-extracting-fusing strategy. After a pre-processing step, we obtain retinex filtered RGB data, median filter filtered depth data and optical flow data. Then we finetune a C3D model to extract the features of these data. Meanwhile, the TSN model is also trained with depth (for TSN model we use uniform sampling data only) and flow data. After that, we fuse these features. For the data sampled with different sampling strategies of one modality, we use Canonical Correlation Analysis for calculating the weights of these features, and fuse by weighted addition. Then for features of different modalities, we stack them together Other techniques Other technique/strategy used not included in previous items FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE (if any) Method complexity Method complexity FOR GESTURE RECOGNITION (OR/AND SPOTTING) STAGE The part of our method that is most likely to have high complexity is the finetuning process of C3D model and training TSN. It takes 19.4 hours for C3D model and 14.2 hours for TSN. It also takes about 10G graphic memory for C3D model and about 4-6G graphic memory for TSN. The classification is implemented by a linear-svm classifier so that the complexity of it is not very high. 3.2 Data Fusion Strategies List data fusion strategies (how different feature descriptions are combined) for learning the model / network: Single frame, early, slow, late. (if any) As mentioned above, we enhance the RGB and the depth videos respectively by Retinex Algorithm and median filter algorithm and adopt different ways such as uniform sampling and section weighted sampling to get data with different sampling rate.through the analysis of the statistical characteristic of data of the same modality, we use the canonical correlation analysis to calculate the correlation coefficient between features respectively, so as to obtain the fusion weight and perform the weighted fusion. For different modality data, the fusion is carried out by stacking. 4

5 3.3 Global Method Description Which pre-trained or external methods have been used (for any stage, if any) The C3D model is pre-trained with the Sports-1M dataset. Which additional data has been used in addition to the provided ChaLearn training and validation data (at any stage, if any) The flow data which generated from RGB data is used to concentrate on the gestures. And the filtered data is used to enhance the RGB and depth videos. Qualitative advantages of the proposed solution 1). data-driven frame number unification. 2). Video enhancement. Filtering RGB data according to Retinex theory and depth data by median filter algorithm. 3). Data augmentation by different sampling ways, e.g., uniform sampling and sectional weighted sampling. 4). Multimodal features extracted by C3D model and TSN. 5). canonical correlation analysis for calculating the weight of features. Results of the comparison to other approaches (if any) Novelty degree of the solution and if is has been previously published Data enhancement, augmentation strategy and fusion scheme are novelly proposed and have not been published before. 4 Other details Language and implementation details (including platform, memory, parallelization requirements) Our experiments are processed on a PC with Intel Core i GHz 8, 16 GB RAM and Nvidia Geforce GTX TITAN X GPU. The experiments of C3D model and TSN training and feature extracting are processed under caffe framework on Linux Ubuntu LTS, others including data enhancement, optical flow data generation, two sampling strategies, feature fusion and SVM classification are implemented by matlab R2015b on 64-bit Windows 7. Human effort required for implementation, training and validation? The destination setting of generated data in pre-processing step may need your effort, especially for the frame images for TSN (need to be put into the data folder of TSN sub-project), the others can be executed by shell/matlab scripts if the settings are correct. Training/testing expended time? The training time of C3D model is 19.4 hours and about 14.2 hours of 5

6 TSN. The classification time of SVM classifier is about 2-3 hours for all testing data. General comments and impressions of the challenge? what do you expect from a new challenge in face and looking at people analysis? It is interesting and still challenging because the velocity of motion, the background and other factors can still handicap the accuracy and need studying in the future. Maybe social relationship classification based on gesture or action of two or more people is interesting for a new challenge. 6

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks August 16, 2016 1 Team details Team name FLiXT Team leader name Yunan Li Team leader address, phone number and email address:

More information

Multimodal Gesture Recognition Based on the ResC3D Network

Multimodal Gesture Recognition Based on the ResC3D Network Multimodal Gesture Recognition Based on the ResC3D Network Qiguang Miao 1 Yunan Li 1 Wanli Ouyang 2 Zhenxin Ma 1 Xin Xu 1 Weikang Shi 1 Xiaochun Cao 3 1 School of Computer Science and Technology, Xidian

More information

arxiv: v2 [cs.cv] 26 Apr 2018

arxiv: v2 [cs.cv] 26 Apr 2018 Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognition arxiv:1804.07187v2 [cs.cv] 26 Apr 2018 Okan Köpüklü Neslihan Köse Gerhard Rigoll Institute for Human-Machine Communication Technical

More information

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network EasyChair Preprint 730 Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network Xue Bai, Enqing Chen and Haron Chweya Tinega EasyChair preprints are intended for rapid dissemination

More information

arxiv: v1 [cs.cv] 14 Jul 2017

arxiv: v1 [cs.cv] 14 Jul 2017 Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University

More information

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh National Institute of Advanced Industrial Science and Technology (AIST) Tsukuba,

More information

Gesture Recognition: Focus on the Hands

Gesture Recognition: Focus on the Hands Gesture Recognition: Focus on the Hands Pradyumna Narayana, J. Ross Beveridge, Bruce A. Draper Colorado State University {prady, ross, draper} @cs.colostate.edu Abstract Gestures are a common form of human

More information

Large-scale Video Classification with Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area

More information

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Yuanjun Xiong 1 Kai Zhu 1 Dahua Lin 1 Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

VGR-Net: A View Invariant Gait Recognition Network

VGR-Net: A View Invariant Gait Recognition Network VGR-Net: A View Invariant Gait Recognition Network Human gait has many advantages over the conventional biometric traits (like fingerprint, ear, iris etc.) such as its non-invasive nature and comprehensible

More information

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 26 Jul 2018 A Better Baseline for AVA Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman DeepMind Carnegie Mellon University University of Oxford arxiv:1807.10066v1 [cs.cv] 26 Jul 2018 Abstract We introduce

More information

Faceted Navigation for Browsing Large Video Collection

Faceted Navigation for Browsing Large Video Collection Faceted Navigation for Browsing Large Video Collection Zhenxing Zhang, Wei Li, Cathal Gurrin, Alan F. Smeaton Insight Centre for Data Analytics School of Computing, Dublin City University Glasnevin, Co.

More information

arxiv: v1 [cs.cv] 30 May 2017

arxiv: v1 [cs.cv] 30 May 2017 Trampoline ing Trampoline ing Generic Tubelet Proposals for Action Localization arxiv:1705.10861v1 [cs.cv] 30 May 2017 Jiawei He Simon Fraser University jha203@sfu.ca Abstract We develop a novel framework

More information

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS In-Jae Yu, Do-Guk Kim, Jin-Seok Park, Jong-Uk Hou, Sunghee Choi, and Heung-Kyu Lee Korea Advanced Institute of Science and

More information

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in

More information

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network Cultural Event Recognition by Subregion Classification with Convolutional Neural Network Sungheon Park and Nojun Kwak Graduate School of CST, Seoul National University Seoul, Korea {sungheonpark,nojunk}@snu.ac.kr

More information

Long-term Temporal Convolutions for Action Recognition

Long-term Temporal Convolutions for Action Recognition 1 Long-term Temporal Convolutions for Action Recognition Gül Varol, Ivan Laptev, and Cordelia Schmid, Fellow, IEEE arxiv:1604.04494v2 [cs.cv] 2 Jun 2017 Abstract Typical human actions last several seconds

More information

Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks Pichao Wang, Wanqing Li, Song Liu, Yuyao Zhang, Zhimin Gao and Philip Ogunbona Advanced Multimedia Research Lab, University

More information

arxiv: v1 [cs.cv] 15 Apr 2016

arxiv: v1 [cs.cv] 15 Apr 2016 arxiv:0.09v [cs.cv] Apr 0 Long-term Temporal Convolutions for Action Recognition Gül Varol, Ivan Laptev Cordelia Schmid Inria Abstract. Typical human actions last several seconds and exhibit characteristic

More information

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Rui Hou, Chen Chen, Mubarak Shah Center for Research in Computer Vision (CRCV), University of Central Florida (UCF) houray@gmail.com,

More information

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material Ayush Tewari 1,2 Michael Zollhöfer 1,2,3 Pablo Garrido 1,2 Florian Bernard 1,2 Hyeongwoo

More information

arxiv: v2 [cs.cv] 6 May 2018

arxiv: v2 [cs.cv] 6 May 2018 Appearance-and-Relation Networks for Video Classification Limin Wang 1,2 Wei Li 3 Wen Li 2 Luc Van Gool 2 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Computer Vision

More information

Deep Local Video Feature for Action Recognition

Deep Local Video Feature for Action Recognition Deep Local Video Feature for Action Recognition Zhenzhong Lan 1 Yi Zhu 2 Alexander G. Hauptmann 1 Shawn Newsam 2 1 Carnegie Mellon University 2 University of California, Merced {lanzhzh,alex}@cs.cmu.edu

More information

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Noriki Nishida and Hideki Nakayama Machine Perception Group Graduate School of Information Science and Technology The University

More information

ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK

ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK Huafeng Chen 1,2, Jun Chen 1,2, Chen Chen 3, Ruimin Hu 1,2 1 Research Institute of Shenzhen, Wuhan University, Shenzhen, China 2 National

More information

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

Evaluation of Triple-Stream Convolutional Networks for Action Recognition Evaluation of Triple-Stream Convolutional Networks for Action Recognition Dichao Liu, Yu Wang and Jien Kato Graduate School of Informatics Nagoya University Nagoya, Japan Email: {liu, ywang, jien} (at)

More information

arxiv: v3 [cs.cv] 2 Aug 2017

arxiv: v3 [cs.cv] 2 Aug 2017 Action Detection ( 4.3) Tube Proposal Network ( 4.1) Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Rui Hou, Chen Chen, Mubarak Shah Center for Research in Computer Vision (CRCV),

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Noriki Nishida, Hideki Nakayama Machine Perception Group Graduate School of Information Science and Technology The University

More information

arxiv: v1 [cs.cv] 23 Jan 2018

arxiv: v1 [cs.cv] 23 Jan 2018 Let s Dance: Learning From Online Dance Videos Daniel Castro Georgia Institute of Technology Steven Hickson Patsorn Sangkloy shickson@gatech.edu patsorn sangkloy@gatech.edu arxiv:1801.07388v1 [cs.cv] 23

More information

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report Figure 1: The architecture of the convolutional network. Input: a single view image; Output: a depth map. 3 Related Work In [4] they used depth maps of indoor scenes produced by a Microsoft Kinect to successfully

More information

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition Jinliang Zang 1, Le Wang 1(&), Ziyi Liu 1, Qilin Zhang 2, Gang Hua 3, and Nanning Zheng 1 1 Xi an Jiaotong University,

More information

arxiv: v2 [cs.cv] 2 Apr 2018

arxiv: v2 [cs.cv] 2 Apr 2018 Depth of 3D CNNs Depth of 2D CNNs Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? arxiv:1711.09577v2 [cs.cv] 2 Apr 2018 Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh National Institute

More information

Aggregating Frame-level Features for Large-Scale Video Classification

Aggregating Frame-level Features for Large-Scale Video Classification Aggregating Frame-level Features for Large-Scale Video Classification Shaoxiang Chen 1, Xi Wang 1, Yongyi Tang 2, Xinpeng Chen 3, Zuxuan Wu 1, Yu-Gang Jiang 1 1 Fudan University 2 Sun Yat-Sen University

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization Ting Yao, Tao Mei, and Yong Rui Microsoft Research, Beijing, China {tiyao, tmei, yongrui}@microsoft.com Abstract The

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

Learning to track for spatio-temporal action localization

Learning to track for spatio-temporal action localization Learning to track for spatio-temporal action localization Philippe Weinzaepfel a Zaid Harchaoui a,b Cordelia Schmid a a Inria b NYU firstname.lastname@inria.fr Abstract We propose an effective approach

More information

ChaLearn Looking at People Workshop

ChaLearn Looking at People Workshop ChaLearn Looking at People Workshop Cultural Event Recognition (Demo) Junior Fabian, CVC, jfabian@cvc.uab.es, Hugo Escalante, INAOE, Xavier Baró, UOC, Sergio Escalera, CVC/UB, Jordi González, CVC, Pablo

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016 Additive Manufacturing Defect Detection using Neural Networks James Ferguson May 16, 2016 Outline Introduction Background Edge Detection Methods Results Porosity Detection Methods Results Conclusion /

More information

arxiv: v1 [cs.cv] 26 Apr 2018

arxiv: v1 [cs.cv] 26 Apr 2018 Deep Keyframe Detection in Human Action Videos Xiang Yan1,2, Syed Zulqarnain Gilani2, Hanlin Qin1, Mingtao Feng3, Liang Zhang4, and Ajmal Mian2 arxiv:1804.10021v1 [cs.cv] 26 Apr 2018 1 School of Physics

More information

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION 3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION Longlong Jing 1, Yuancheng Ye 1, Xiaodong Yang 3, Yingli Tian 1,2 1 The Graduate Center, 2 The City College, City University

More information

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Outline: Introduction Action classification architectures

More information

Violent Interaction Detection in Video Based on Deep Learning

Violent Interaction Detection in Video Based on Deep Learning Journal of Physics: Conference Series PAPER OPEN ACCESS Violent Interaction Detection in Video Based on Deep Learning To cite this article: Peipei Zhou et al 2017 J. Phys.: Conf. Ser. 844 012044 View the

More information

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences Pichao Wang 1, Wanqing Li 1, Zhimin Gao 1, Jing Zhang 1, Chang Tang 2, and Philip Ogunbona 1 1 Advanced Multimedia Research

More information

DeepIndex for Accurate and Efficient Image Retrieval

DeepIndex for Accurate and Efficient Image Retrieval DeepIndex for Accurate and Efficient Image Retrieval Yu Liu, Yanming Guo, Song Wu, Michael S. Lew Media Lab, Leiden Institute of Advance Computer Science Outline Motivation Proposed Approach Results Conclusions

More information

Multi-region two-stream R-CNN for action detection

Multi-region two-stream R-CNN for action detection Multi-region two-stream R-CNN for action detection Xiaojiang Peng, Cordelia Schmid To cite this version: Xiaojiang Peng, Cordelia Schmid. Multi-region two-stream R-CNN for action detection. ECCV 2016 -

More information

arxiv: v1 [cs.cv] 22 Nov 2017

arxiv: v1 [cs.cv] 22 Nov 2017 D Nets: New Architecture and Transfer Learning for Video Classification Ali Diba,4,, Mohsen Fayyaz,, Vivek Sharma, Amir Hossein Karami 4, Mohammad Mahdi Arzani 4, Rahman Yousefzadeh 4, Luc Van Gool,4 ESAT-PSI,

More information

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition Jinliang Zang 1, Le Wang 1, Ziyi Liu 1, Qilin Zhang 2, Zhenxing Niu 3, Gang Hua 4, and Nanning Zheng 1 1 Xi an Jiaotong

More information

Temporal Difference Networks for Video Action Recognition

Temporal Difference Networks for Video Action Recognition Temporal Difference Networks for Video Action Recognition Joe Yue-Hei Ng Larry S. Davis University of Maryland, College Park {yhng,lsd}@umiacs.umd.edu Abstract Deep convolutional neural networks have been

More information

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Deep Learning Based Real-time Object Recognition System with Image Web Crawler , pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department

More information

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Max Wang mwang07@stanford.edu Ting-Chun Yeh chun618@stanford.edu I. Introduction Recognizing human

More information

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image

More information

A Unified Method for First and Third Person Action Recognition

A Unified Method for First and Third Person Action Recognition A Unified Method for First and Third Person Action Recognition Ali Javidani Department of Computer Science and Engineering Shahid Beheshti University Tehran, Iran a.javidani@mail.sbu.ac.ir Ahmad Mahmoudi-Aznaveh

More information

Activity Recognition in Temporally Untrimmed Videos

Activity Recognition in Temporally Untrimmed Videos Activity Recognition in Temporally Untrimmed Videos Bryan Anenberg Stanford University anenberg@stanford.edu Norman Yu Stanford University normanyu@stanford.edu Abstract We investigate strategies to apply

More information

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab. [ICIP 2017] Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab., POSTECH Pedestrian Detection Goal To draw bounding boxes that

More information

Scene Composition in Augmented Virtual Presenter System

Scene Composition in Augmented Virtual Presenter System Scene Composition in Augmented Virtual Presenter System Ting-Xi Liu 1, Yao Lu 2, Li-Jing Zhang 3, Zi-Jian Wang 4 1 School of Computer Science, Beijing Institute of Technology, Beijing, China 2 School of

More information

Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines

Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines Matheus Gutoski 1, Nelson Marcelo Romero Aquino 2 Manassés Ribeiro 3, André Engênio Lazzaretti 4, Heitor

More information

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition Yizhou Zhou 1 Xiaoyan Sun 2 Zheng-Jun Zha 1 Wenjun Zeng 2 1 University of Science and Technology of China 2 Microsoft Research Asia zyz0205@mail.ustc.edu.cn,

More information

arxiv: v7 [cs.cv] 21 Apr 2018

arxiv: v7 [cs.cv] 21 Apr 2018 End-to-end Video-level Representation Learning for Action Recognition Jiagang Zhu 1,2, Wei Zou 1, Zheng Zhu 1,2 1 Institute of Automation, Chinese Academy of Sciences (CASIA) 2 University of Chinese Academy

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Charles R. Qi Hao Su Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas Stanford University 1. Details

More information

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Alberto Montes al.montes.gomez@gmail.com Santiago Pascual TALP Research Center santiago.pascual@tsc.upc.edu Amaia Salvador

More information

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition Kun Liu, 1 Wu Liu, 1 Chuang Gan, 2 Mingkui Tan, 3 Huadong

More information

arxiv: v1 [cs.cv] 13 Aug 2017

arxiv: v1 [cs.cv] 13 Aug 2017 Lattice Long Short-Term Memory for Human Action Recognition Lin Sun 1,2, Kui Jia 3, Kevin Chen 2, Dit Yan Yeung 1, Bertram E. Shi 1, and Silvio Savarese 2 arxiv:1708.03958v1 [cs.cv] 13 Aug 2017 1 The Hong

More information

Additive Manufacturing Defect Detection using Neural Networks

Additive Manufacturing Defect Detection using Neural Networks Additive Manufacturing Defect Detection using Neural Networks James Ferguson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville Knoxville, Tennessee 37996 Jfergu35@vols.utk.edu

More information

Deep Temporal Models (Benchmarks and Applica6ons Analysis)

Deep Temporal Models (Benchmarks and Applica6ons Analysis) Deep Temporal Models (Benchmarks and Applica6ons Analysis) Sek Chai SRI Interna6onal Presented at: NICE 2016, March 7, 2016 2016 SRI International Project Summary Goals Analyze Deep Temporal Models (DTMs).

More information

arxiv: v2 [cs.cv] 8 Nov 2015

arxiv: v2 [cs.cv] 8 Nov 2015 Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister Dept. of Engineering Science University of Oxford tp@robots.ox.ac.uk James Charles School of Computing University of Leeds j.charles@leeds.ac.uk

More information

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Nikiforos Pittaras 1, Foteini Markatopoulou 1,2, Vasileios Mezaris 1, and Ioannis Patras 2 1 Information Technologies

More information

Recurrent Neural Networks and Transfer Learning for Action Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the

More information

Face recognition based on improved BP neural network

Face recognition based on improved BP neural network Face recognition based on improved BP neural network Gaili Yue, Lei Lu a, College of Electrical and Control Engineering, Xi an University of Science and Technology, Xi an 710043, China Abstract. In order

More information

arxiv: v1 [cs.ro] 18 Jul 2017

arxiv: v1 [cs.ro] 18 Jul 2017 Choosing Smartly: Adaptive Multimodal Fusion for Object Detection in Changing Environments Oier Mees Andreas Eitel Wolfram Burgard arxiv:1707.05733v1 [cs.ro] 18 Jul 2017 Abstract Object detection is an

More information

Semantic Segmentation

Semantic Segmentation Semantic Segmentation UCLA:https://goo.gl/images/I0VTi2 OUTLINE Semantic Segmentation Why? Paper to talk about: Fully Convolutional Networks for Semantic Segmentation. J. Long, E. Shelhamer, and T. Darrell,

More information

arxiv: v1 [cs.cv] 15 Jun 2018

arxiv: v1 [cs.cv] 15 Jun 2018 arxiv:1806.05810v1 [cs.cv] 15 Jun 2018 Disclaimer: This work has been accepted for publication in the IEEE International Conference on Image Processing: link: https://2018.ieeeicip.org/ Copyright: c 2018

More information

Flowing ConvNets for Human Pose Estimation in Videos

Flowing ConvNets for Human Pose Estimation in Videos Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister Dept. of Engineering Science University of Oxford tp@robots.ox.ac.uk James Charles School of Computing University of Leeds j.charles@leeds.ac.uk

More information

Flow-Based Video Recognition

Flow-Based Video Recognition Flow-Based Video Recognition Jifeng Dai Visual Computing Group, Microsoft Research Asia Joint work with Xizhou Zhu*, Yuwen Xiong*, Yujie Wang*, Lu Yuan and Yichen Wei (* interns) Talk pipeline Introduction

More information

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction

More information

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

Automatic Detection of Multiple Organs Using Convolutional Neural Networks Automatic Detection of Multiple Organs Using Convolutional Neural Networks Elizabeth Cole University of Massachusetts Amherst Amherst, MA ekcole@umass.edu Sarfaraz Hussein University of Central Florida

More information

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition Shuyang Sun 1,2, Zhanghui Kuang 2, Lu Sheng 3, Wanli Ouyang 1, Wei Zhang 2 1 The University of Sydney 2

More information

Robust Face Recognition Based on Convolutional Neural Network

Robust Face Recognition Based on Convolutional Neural Network 2017 2nd International Conference on Manufacturing Science and Information Engineering (ICMSIE 2017) ISBN: 978-1-60595-516-2 Robust Face Recognition Based on Convolutional Neural Network Ying Xu, Hui Ma,

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters AJ Piergiovanni, Chenyou Fan, and Michael S Ryoo School of Informatics and Computing, Indiana University, Bloomington, IN

More information

arxiv: v3 [cs.cv] 12 Apr 2018

arxiv: v3 [cs.cv] 12 Apr 2018 A Closer Look at Spatiotemporal Convolutions for Action Recognition Du Tran 1, Heng Wang 1, Lorenzo Torresani 1,2, Jamie Ray 1, Yann LeCun 1, Manohar Paluri 1 1 Facebook Research 2 Dartmouth College {trandu,hengwang,torresani,jamieray,yann,mano}@fb.com

More information

Stochastic Function Norm Regularization of DNNs

Stochastic Function Norm Regularization of DNNs Stochastic Function Norm Regularization of DNNs Amal Rannen Triki Dept. of Computational Science and Engineering Yonsei University Seoul, South Korea amal.rannen@yonsei.ac.kr Matthew B. Blaschko Center

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset

CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset Prabhav Agrawal and Soham Shah 23 February 2015 1 Introduction We are interested in modeling the human perception of

More information

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Supplementary material for Analyzing Filters Toward Efficient ConvNet Supplementary material for Analyzing Filters Toward Efficient Net Takumi Kobayashi National Institute of Advanced Industrial Science and Technology, Japan takumi.kobayashi@aist.go.jp A. Orthonormal Steerable

More information

arxiv: v2 [cs.cv] 18 Aug 2017

arxiv: v2 [cs.cv] 18 Aug 2017 Flow-Guided Feature Aggregation for Video Object Detection Xizhou Zhu 1,2 Yujie Wang 2 Jifeng Dai 2 Lu Yuan 2 Yichen Wei 2 1 University of Science and Technology of China 2 Microsoft Research ezra0408@mail.ustc.edu.cn

More information

Spotlight: A Smart Video Highlight Generator Stanford University CS231N Final Project Report

Spotlight: A Smart Video Highlight Generator Stanford University CS231N Final Project Report Spotlight: A Smart Video Highlight Generator Stanford University CS231N Final Project Report Jun-Ting (Tim) Hsieh junting@stanford.edu Chengshu (Eric) Li chengshu@stanford.edu Kuo-Hao Zeng khzeng@cs.stanford.edu

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

Multi-view fusion for activity recognition using deep neural networks

Multi-view fusion for activity recognition using deep neural networks Multi-view fusion for activity recognition using deep neural networks Rahul Kavi a, Vinod Kulathumani a, *, FNU Rohit a, Vlad Kecojevic b a Department of Computer Science and Electrical Engineering, West

More information

Pyramid Person Matching Network for Person Re-identification

Pyramid Person Matching Network for Person Re-identification Proceedings of Machine Learning Research 77:487 497, 2017 ACML 2017 Pyramid Person Matching Network for Person Re-identification Chaojie Mao mcj@zju.edu.cn Yingming Li yingming@zju.edu.cn Zhongfei Zhang

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

arxiv: v1 [cs.cv] 10 Apr 2017

arxiv: v1 [cs.cv] 10 Apr 2017 Fully Convolutional Deep Neural Networks for Persistent Multi-Frame Multi-Object Detection in Wide Area Aerial Videos Rodney LaLonde, Dong Zhang, Mubarak Shah Center for Research in Computer Vision, University

More information

arxiv: v1 [cs.cv] 2 May 2015

arxiv: v1 [cs.cv] 2 May 2015 Dense Optical Flow Prediction from a Static Image Jacob Walker, Abhinav Gupta, and Martial Hebert Robotics Institute, Carnegie Mellon University {jcwalker, abhinavg, hebert}@cs.cmu.edu arxiv:1505.00295v1

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

PEOPLE IN SEATS COUNTING VIA SEAT DETECTION FOR MEETING SURVEILLANCE

PEOPLE IN SEATS COUNTING VIA SEAT DETECTION FOR MEETING SURVEILLANCE PEOPLE IN SEATS COUNTING VIA SEAT DETECTION FOR MEETING SURVEILLANCE Hongyu Liang, Jinchen Wu, and Kaiqi Huang National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science

More information