P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh

Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained actions statistical representations of local motion descriptors coarse action :standing up, hand-shaking, dancing we believe action recognition will benefit from spatial and temporal detection and alignment of human poses in videos action descriptor based on human poses tracks of body joints over time http://jhmdb.is.tue.mpg.de/

Two-Stream Convolutional Networks (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

Robust pose features Pose-CNN Track human pose in a video body part track Extract CNN features (appearance and motion) per part-track Train SVM classifier Cordelia Schmid

Pose-CNN (1) input video (2) human pose estimation (3) crop RGB and Optical Flow patches of body parts (4) extract CNN features (appearance and motion) per part and per frame (5) aggregate per-frame descriptors over time (max/min) (6) normalize aggregated descriptors (7) concatenate appearance and motion descriptors from all body parts

Pose-CNN # Compute temporal differences of CNN features! " with frames Aggregation (max and min) of frame descriptors Concatenation to get static and dynamicvideo descriptors Normalization of video descriptor: normalize by the average L2-norm of the! " # from the training set (L p )

State-of-the-art methods detect and track human poses in videos extract poses for individual frames deformable part model to locate positions of body joints extract a large set of pose configurations in each frame and link them constrained to have a high score of the pose estimator motion of joints in a pose sequence is constrained to be consistent with the optical flow extracted at joint positions

State-of-the-art methods

Highlevel pose features encode spatial and temporal relations of body joint positions positions of body joints are first normalized relative offsets to the head distances between all pairs of joints, orientations of the vectors connecting pairs of joints and inner angles Dynamic features are obtained from trajectories of body joints quantized using a separate codebook

Dense trajectory features

Fisher Vector

Datasets used for evaluation JHMB 21 human actions, such as brush hair, climb, golf, run or sit. restricted to the duration of the action between 36 and 55 clips per action for a total of 928 clips, 3 train/test splits Each clip contains between 15 and 40 frames of size 320 240 Human pose is annotated in each of the 31838 frames The metric used is accuracy: each clip is assigned an action label corresponding to the maximum value among the scores returned by the action classifiers sub-jhmdb includes 316 clips distributed over 12 actions in which the human body is fully visible. 3 train/test splits and the evaluation metric is accuracy

JHMB

Datasets used for evaluation MPI cooking 64 fine grained actions and an additional background class a total of 5609 clips, 7 training/test splits, frame size 1624 1224 actions are very similar, such as cut dice, cut slices, and cut stripes or wash hands and wash objects sub-mpii cooking selection of two similar classes wash hands and wash objects with GT pose 55 and 139 clips for wash hands and wash objects actions respectively, for a total of 29, 997 frames

MPI cooking

Performance of the individual features Different body parts are complementary Appearance and flow are complementary

Robustness of P-CNN P-CNN on par with HLPF for GT P-CNN significantly more robust for real noisy poses

CNN Feature Pyramid Architecture (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

Hierarchical model for a sample snippet

Extract Binary key-frames