P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Similar documents
CS231N Section. Video Understanding 6/1/2018

Two-Stream Convolutional Networks for Action Recognition in Videos

Action recognition in videos

P-CNN: Pose-based CNN Features for Action Recognition

Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition

Person Action Recognition/Detection

arxiv: v1 [cs.cv] 11 Jun 2015

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

String distance for automatic image classification

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Multiple Kernel Learning for Emotion Recognition in the Wild

Category-level localization

ImageCLEF 2011

People Detection and Video Understanding

Large-scale Video Classification with Convolutional Neural Networks

Content-based image and video analysis. Event Recognition

CLASSIFICATION Experiments

Automatic summarization of video data

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Exploring Bag of Words Architectures in the Facial Expression Domain

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Multiple-Choice Questionnaire Group C

Mixtures of Gaussians and Advanced Feature Encoding

Part based models for recognition. Kristen Grauman

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Action Recognition with HOG-OF Features

arxiv: v1 [cs.cv] 29 Apr 2016

Understanding Sport Activities from Correspondences of Clustered Trajectories

Convolutional-Recursive Deep Learning for 3D Object Classification

Activity Recognition in Temporally Untrimmed Videos

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Action recognition in robot-assisted minimally invasive surgery

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Object Recognition with Deformable Models

Long-term Temporal Convolutions for Action Recognition INRIA

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Detecting Parts for Action Localization

Aggregating Descriptors with Local Gaussian Metrics

Compressed local descriptors for fast image and video search in large databases

Bag-of-features. Cordelia Schmid

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

An Exploration of Computer Vision Techniques for Bird Species Classification

CAP 6412 Advanced Computer Vision

Sparse coding for image classification

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Dyadic Interaction Detection from Pose and Flow

Histogram of Oriented Gradients (HOG) for Object Detection

Part-based and local feature models for generic object recognition

GAN Related Works. CVPR 2018 & Selective Works in ICML and NIPS. Zhifei Zhang

Multi-region two-stream R-CNN for action detection

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Storyline Reconstruction for Unordered Images

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

Beyond Bags of features Spatial information & Shape models

THE goal of action detection is to detect every occurrence

Tri-modal Human Body Segmentation

Classification of objects from Video Data (Group 30)

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition

Learning to Localize Objects with Structured Output Regression

CS 231A Computer Vision (Fall 2011) Problem Set 4

A Unified Method for First and Third Person Action Recognition

Minimizing hallucination in Histogram of Oriented Gradients

A Keypoint Descriptor Inspired by Retinal Computation

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Yiqi Yan. May 10, 2017

Deep Local Video Feature for Action Recognition

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Human Pose Estimation with Deep Learning. Wei Yang

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Spatial Localization and Detection. Lecture 8-1

Recognizing people. Deva Ramanan

Optical flow. Cordelia Schmid

HUMAN action recognition has received significant research

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Face Recognition A Deep Learning Approach

CS229: Action Recognition in Tennis

GPU Accelerated Sequence Learning for Action Recognition. Yemin Shi

Structured Models in. Dan Huttenlocher. June 2010

Extended Co-occurrence HOG with Dense Trajectories for Fine-grained Activity Recognition

Chapter 3 Image Registration. Chapter 3 Image Registration

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Beyond bags of Features

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Large scale object/scene recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition

Lecture 18: Human Motion Recognition

Class 9 Action Recognition

Transcription:

P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh

Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained actions statistical representations of local motion descriptors coarse action :standing up, hand-shaking, dancing we believe action recognition will benefit from spatial and temporal detection and alignment of human poses in videos action descriptor based on human poses tracks of body joints over time http://jhmdb.is.tue.mpg.de/

Two-Stream Convolutional Networks (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

Robust pose features Pose-CNN Track human pose in a video body part track Extract CNN features (appearance and motion) per part-track Train SVM classifier Cordelia Schmid

Pose-CNN (1) input video (2) human pose estimation (3) crop RGB and Optical Flow patches of body parts (4) extract CNN features (appearance and motion) per part and per frame (5) aggregate per-frame descriptors over time (max/min) (6) normalize aggregated descriptors (7) concatenate appearance and motion descriptors from all body parts

Pose-CNN # Compute temporal differences of CNN features! " with frames Aggregation (max and min) of frame descriptors Concatenation to get static and dynamicvideo descriptors Normalization of video descriptor: normalize by the average L2-norm of the! " # from the training set (L p )

State-of-the-art methods detect and track human poses in videos extract poses for individual frames deformable part model to locate positions of body joints extract a large set of pose configurations in each frame and link them constrained to have a high score of the pose estimator motion of joints in a pose sequence is constrained to be consistent with the optical flow extracted at joint positions

State-of-the-art methods

Highlevel pose features encode spatial and temporal relations of body joint positions positions of body joints are first normalized relative offsets to the head distances between all pairs of joints, orientations of the vectors connecting pairs of joints and inner angles Dynamic features are obtained from trajectories of body joints quantized using a separate codebook

Dense trajectory features

Dense trajectory features

Dense trajectory features

Fisher Vector

Datasets used for evaluation JHMB 21 human actions, such as brush hair, climb, golf, run or sit. restricted to the duration of the action between 36 and 55 clips per action for a total of 928 clips, 3 train/test splits Each clip contains between 15 and 40 frames of size 320 240 Human pose is annotated in each of the 31838 frames The metric used is accuracy: each clip is assigned an action label corresponding to the maximum value among the scores returned by the action classifiers sub-jhmdb includes 316 clips distributed over 12 actions in which the human body is fully visible. 3 train/test splits and the evaluation metric is accuracy

JHMB

Datasets used for evaluation MPI cooking 64 fine grained actions and an additional background class a total of 5609 clips, 7 training/test splits, frame size 1624 1224 actions are very similar, such as cut dice, cut slices, and cut stripes or wash hands and wash objects sub-mpii cooking selection of two similar classes wash hands and wash objects with GT pose 55 and 139 clips for wash hands and wash objects actions respectively, for a total of 29, 997 frames

MPI cooking

Performance of the individual features Different body parts are complementary Appearance and flow are complementary

Robustness of P-CNN P-CNN on par with HLPF for GT P-CNN significantly more robust for real noisy poses

CNN Feature Pyramid Architecture (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

Hierarchical model for a sample snippet

Extract Binary key-frames