AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Similar documents
Learning Human Pose Models from Synthesized Data for Robust RGB-D Action Recognition

Learning a Deep Model for Human Action Recognition from Novel Viewpoints

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

CS231N Section. Video Understanding 6/1/2018

The Kinect Sensor. Luís Carriço FCUL 2014/15

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Learning Action Recognition Model From Depth and Skeleton Videos

People Detection and Video Understanding

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Large-scale Video Classification with Convolutional Neural Networks

3D Pose Estimation using Synthetic Data over Monocular Depth Images

Key Developments in Human Pose Estimation for Kinect

Photo-realistic Renderings for Machines Seong-heum Kim

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Human Motion Detection and Tracking for Video Surveillance

Person Action Recognition/Detection

Validation of Automated Mobility Assessment using a Single 3D Sensor

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Skeleton-based Action Recognition Based on Deep Learning and Grassmannian Pyramids

Video Object Segmentation using Deep Learning

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Arbitrary View Action Recognition via Transfer Dictionary Learning on Synthetic Training Data

3D object recognition used by team robotto

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

Visual Tracking of Human Body with Deforming Motion and Shape Average

Fuzzy Set Theory in Computer Vision: Example 3

Human Detection, Tracking and Activity Recognition from Video

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

HUMAN action recognition [1] [3] is a long standing. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Motion Tracking and Event Understanding in Video Sequences

Bilinear Models for Fine-Grained Visual Recognition

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

CITS 4402 Computer Vision

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Neue Verfahren der Bildverarbeitung auch zur Erfassung von Schäden in Abwasserkanälen?

CS Decision Trees / Random Forests

Human Detection. A state-of-the-art survey. Mohammad Dorgham. University of Hamburg

Combined Shape Analysis of Human Poses and Motion Units for Action Segmentation and Recognition

Su et al. Shape Descriptors - III

Real-Time Human Pose Recognition in Parts from Single Depth Images

arxiv: v3 [cs.cv] 16 Apr 2018

Histogram of Oriented Principal Components for Cross-View Action Recognition

Presenter: Felix Goldberg, Ph.D. Chief AI Scientist Artificial Intelligence Cart AIC

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

Using Machine Learning for Classification of Cancer Cells

Capturing People in Surveillance Video

Recognizing people. Deva Ramanan

Person identification from spatio-temporal 3D gait

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

3D Computer Vision 1

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

R3DG Features: Relative 3D Geometry-based Skeletal Representations for Human Action Recognition

Video Object Segmentation using Deep Learning

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Pose Estimation on Depth Images with Convolutional Neural Network

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

Multi-view 3D Models from Single Images with a Convolutional Network

Chapter 2 Learning Actionlet Ensemble for 3D Human Action Recognition

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

TRAFFIC SIGN RECOGNITION USING A MULTI-TASK CONVOLUTIONAL NEURAL NETWORK

VISION FOR AUTOMOTIVE DRIVING

Computer Vision EE837, CS867, CE803

Adaptive Gesture Recognition System Integrating Multiple Inputs

Spatial Localization and Detection. Lecture 8-1

Object Recognition 1

Object Recognition 1

AI Ukraine Point cloud labeling using machine learning

CITY UNIVERSITY OF HONG KONG 香港城市大學

Expanding gait identification methods from straight to curved trajectories

Computer Vision Object and People Tracking

Storyline Reconstruction for Unordered Images

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

1. INTRODUCTION ABSTRACT

Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels

HUMAN ACTIVITY RECOGNITION AND GYMNASTICS ANALYSIS THROUGH DEPTH IMAGERY

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

List of Accepted Papers for ICVGIP 2018

An Exploration of Computer Vision Techniques for Bird Species Classification

The SIFT (Scale Invariant Feature

HUMAN action recognition has received increasing attention

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

A Real-Time Street Actions Detection

Lighting- and Occlusion-robust View-based Teaching/Playback for Model-free Robot Programming

Deep Neural Networks:

Object, Scene and Ego-Centric Action Classification for Robotic Vision. Hasan Firdaus Mohd Zaki

CS 231. Inverse Kinematics Intro to Motion Capture. 3D characters. Representation. 1) Skeleton Origin (root) Joint centers/ bones lengths

Classification of objects from Video Data (Group 30)

Automated video surveillance for preventing suicide attempts

Dense 3D Modelling and Monocular Reconstruction of Deformable Objects

Learning Semantic Environment Perception for Cognitive Robots

Transcription:

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering www.csse.uwa.edu.au/~ajmal/

Overview Aim of automatic human action recognition Applications Challenges Deep Learning Data generation Training the deep neural network models Results

Aim of human action recognition Given a video, automatically classify what action is performed in the video Who is performing the action is not relevant, the action is e.g. Walking or running Falling down Struggling in a pool Answering a phone Holding head / chest / stomach Interactions with others such as handing over a bag

Applications Elderly care / smart houses Child minding Swimming pool monitoring Surveillance & security Action based video search At a more finer level Rehabilitation (e.g. quantifying progress in walking) Sports action analysis (injury prevention)

Challenges The same action appears very different (to computers) when viewed from different angles Changes in illumination can completely change the appearance of videos The same action may be performed in different ways Noise due to variations in clothing, background and occlusions

Deep Neural Networks (DNN) DNNs have the potential to model all these complex variations DNN learns a non-linear mapping between the input and output data in the form of millions of parameters However, to learn these parameters, DNN requires a large corpus of labelled training data Human action videos that are (1) multiview, (2) labelled and (3) large in quantity to train DNNs are not available and expensive to generate

Our solution to paucity of data Generate the data synthetically Covers the multiview & quantity parts What about labels? Use dummy labels.

Generating training data Synthetically We fit 3D human models to Motion Capture data to synthesize 3D videos

Simulating different body shapes We use different 3D human body shapes. Real or synthetic work equally well.

Simulating different camera veiwpoints We place many cameras on a hemisphere to capture different viewpoint images or videos

Feature extraction from 3d videos

Generating Synthetic depth (3D) images We choose typical human poses (~350) and generate depth images for each pose from a large number of camera viewpoints. These depth images resemble real data from 3D sensors.

3d Human Pose Model Convolutional Neural Network (CNN) is trained to map an input depth (3D) image, irrespective of the camera viewpoint, to one of N poses. HPM: Human Pose Model

Modelling CNN features with Fourier Temporal Pyramid (FTP) CNN outputs a viewpoint invariant representation of the human pose A sequence of human poses defines action We perform Fourier analysis on each individual CNN fc-7 features FTP FTP FTP FTP splits the sequence at each level, performs Fourier analysis on each individual split to find the low frequency components

Feature extraction from 2d videos

Human action recognition in conventional videos Can we synthesize data to train deep neural networks for action recognition in 2D videos? Motion trajectories in 3D videos projected to image planes resemble motion trajectories in 2D videos We fit 3D human models to Motion Capture (MoCap) data and find motion trajectories from 108 viewpoints MoCap sequences do not correspond to our actions of interest so we give them each a dummy label

From MoCap to 3D Trajectories

Complete pipeline for NN-Training from Synthetic dense trajectories

Network Architecture The network is trained using Synthetic data Dummy labels It learns to project each action, observed from any viewpoint, to a shared high level space We get the same high level features irrespective of viewpoint R-NKTM: Robust Non-linear Knowledge Transfer Model

Feature Extraction from Real 2D Videos Trajectory codebook

Classification Our deep neural network models are trained on synthetic data with Pose labels (image based) rather than action labels Dummy labels (video based) that correspond to random actions Therefore, we use the deep models as feature extractors and then train another classifier on these features using real video labels Support Vector Machines L 1 L 2 Classifier

The N-UCLA Benchmark dataset North-Western University of California Los Angeles 10 subjects 20 actions 3 camera viewpoints RGB-D videos (Kinect-1) 640 480 RGB resolution 320 240 Depth resolution

The UWA3D-II Benchmark Dataset University of Western Australia 20 subjects 30 actions 4 camera viewpoints RGB-D videos (Kinect-1) 640 480 RGB resolution 320 240 Depth resolution

The ixmas Benchmark Dataset INRIA Xmas Motion Acquisition Sequence 11 subjects 11 actions 5 camera viewpoints RGB videos 390 291 resolution

Comparative Results (2D Only) Mean accuracy (%) on the three datasets when test camera view is not included in the training data. Method Publishe d year IXMAS (all cams) UWA (all cams) N-UCLA (all cams) Dense Trajectories 2011 61.7 60.4 72.7 Hankelets 2012 56.4 51.8 45.2 Non-linear Circulant Temporal Encoding 2014 67.4 61.2 68.6 3D Convolutional Neural Networks 2013 62.8 61.0 Two-stream Convolutional Networks 2014 42.5 57.6 Our method (RNKTM) 2016 74.1 % 67.4 % 78.1 %

Comparative results (3d only) Mean accuracy (%) on the two datasets when test camera view is not included in the training data. Method Publishe d year Data Type UWA (all cams) N-UCLA (cam-3 only) ActionLet 2013 Joints 39.8 76.0 3D Skeleton Points in Lie Group 2014 Joints 43.4 74.2 Histogram of Oriented 4D Normals 2013 Depth 28.9 39.9 Continuous Virtual Path 2013 Depth 25.6 53.5 Histogram of Oriented PCs 2016 Depth 52.2 80.0 Our method (HPM+TM) 2016 Depth 76.9 92.0

Comparative results (2D + 3d) Mean accuracy (%) on the two datasets when test camera view is not included in the training data. 2D and 3D features were combined to train an L 1 L 2 classifier Type of Data Neural Network Model UWA (all cams) N-UCLA (all cams) 2D videos NKTM (fully connected) 67.5 78.1 3D videos Human Pose Model (CNN) 76.9 79.7 Combined NKTM+HPM+TM 84.1 83.3

Conclusion Existing MoCap data can be exploited to synthesize videos for training deep neural networks Dummy labels can be used to train deep models that are good feature extractors 3D data improves human action recognition accuracy Future research may find new uses for legacy MoCap data that has been routinely collected for decades at universities and sports research institutes Our future research aims at combining MoCap data from multiple institutions to perform fine grained action analysis for athlete performance enhancement

Acknowledgements This research is supported by the Australian Research Council Discovery Project (DP160101458) MoCap data was obtained from the Carnegie Mellon University, USA Deep learning was performed in MATLAB using the Matconvnet library from the University of Oxford Related publications: [1] Hossein Rahmani, Ajmal Mian and Mubarak Shah, Learning a deep model for human action recognition from novel viewpoints, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017. [2] Hossein Rahmani and Ajmal Mian, 3D Action recognition from novel viewpoints, IEEE Conference on Computer Vision and Pattern Recognition (CVPR-Oral), 2016.