P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh
Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained actions statistical representations of local motion descriptors coarse action :standing up, hand-shaking, dancing we believe action recognition will benefit from spatial and temporal detection and alignment of human poses in videos action descriptor based on human poses tracks of body joints over time http://jhmdb.is.tue.mpg.de/
Two-Stream Convolutional Networks (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification
Robust pose features Pose-CNN Track human pose in a video body part track Extract CNN features (appearance and motion) per part-track Train SVM classifier Cordelia Schmid
Pose-CNN (1) input video (2) human pose estimation (3) crop RGB and Optical Flow patches of body parts (4) extract CNN features (appearance and motion) per part and per frame (5) aggregate per-frame descriptors over time (max/min) (6) normalize aggregated descriptors (7) concatenate appearance and motion descriptors from all body parts
Pose-CNN # Compute temporal differences of CNN features! " with frames Aggregation (max and min) of frame descriptors Concatenation to get static and dynamicvideo descriptors Normalization of video descriptor: normalize by the average L2-norm of the! " # from the training set (L p )
State-of-the-art methods detect and track human poses in videos extract poses for individual frames deformable part model to locate positions of body joints extract a large set of pose configurations in each frame and link them constrained to have a high score of the pose estimator motion of joints in a pose sequence is constrained to be consistent with the optical flow extracted at joint positions
State-of-the-art methods
Highlevel pose features encode spatial and temporal relations of body joint positions positions of body joints are first normalized relative offsets to the head distances between all pairs of joints, orientations of the vectors connecting pairs of joints and inner angles Dynamic features are obtained from trajectories of body joints quantized using a separate codebook
Dense trajectory features
Dense trajectory features
Dense trajectory features
Fisher Vector
Datasets used for evaluation JHMB 21 human actions, such as brush hair, climb, golf, run or sit. restricted to the duration of the action between 36 and 55 clips per action for a total of 928 clips, 3 train/test splits Each clip contains between 15 and 40 frames of size 320 240 Human pose is annotated in each of the 31838 frames The metric used is accuracy: each clip is assigned an action label corresponding to the maximum value among the scores returned by the action classifiers sub-jhmdb includes 316 clips distributed over 12 actions in which the human body is fully visible. 3 train/test splits and the evaluation metric is accuracy
JHMB
Datasets used for evaluation MPI cooking 64 fine grained actions and an additional background class a total of 5609 clips, 7 training/test splits, frame size 1624 1224 actions are very similar, such as cut dice, cut slices, and cut stripes or wash hands and wash objects sub-mpii cooking selection of two similar classes wash hands and wash objects with GT pose 55 and 139 clips for wash hands and wash objects actions respectively, for a total of 29, 997 frames
MPI cooking
Performance of the individual features Different body parts are complementary Appearance and flow are complementary
Robustness of P-CNN P-CNN on par with HLPF for GT P-CNN significantly more robust for real noisy poses
CNN Feature Pyramid Architecture (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification
Hierarchical model for a sample snippet
Extract Binary key-frames