P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Size: px

Start display at page:

Download "P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh"

Heather Potter
5 years ago
Views:

1 P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh

2 Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained actions statistical representations of local motion descriptors coarse action :standing up, hand-shaking, dancing we believe action recognition will benefit from spatial and temporal detection and alignment of human poses in videos action descriptor based on human poses tracks of body joints over time

3 Two-Stream Convolutional Networks (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

4 Robust pose features Pose-CNN Track human pose in a video body part track Extract CNN features (appearance and motion) per part-track Train SVM classifier Cordelia Schmid

Pose-CNN (1) input video (2) human pose estimation (3) crop RGB and Optical Flow patches of body parts (4) extract CNN features (appearance and motion) per part and per

5 Pose-CNN (1) input video (2) human pose estimation (3) crop RGB and Optical Flow patches of body parts (4) extract CNN features (appearance and motion) per part and per frame (5) aggregate per-frame descriptors over time (max/min) (6) normalize aggregated descriptors (7) concatenate appearance and motion descriptors from all body parts

6 Pose-CNN # Compute temporal differences of CNN features! " with frames Aggregation (max and min) of frame descriptors Concatenation to get static and dynamicvideo descriptors Normalization of video descriptor: normalize by the average L2-norm of the! " # from the training set (L p )

7 State-of-the-art methods detect and track human poses in videos extract poses for individual frames deformable part model to locate positions of body joints extract a large set of pose configurations in each frame and link them constrained to have a high score of the pose estimator motion of joints in a pose sequence is constrained to be consistent with the optical flow extracted at joint positions

8 State-of-the-art methods

9 Highlevel pose features encode spatial and temporal relations of body joint positions positions of body joints are first normalized relative offsets to the head distances between all pairs of joints, orientations of the vectors connecting pairs of joints and inner angles Dynamic features are obtained from trajectories of body joints quantized using a separate codebook

10 Dense trajectory features

11 Dense trajectory features

12 Dense trajectory features

13 Fisher Vector

14 Datasets used for evaluation JHMB 21 human actions, such as brush hair, climb, golf, run or sit. restricted to the duration of the action between 36 and 55 clips per action for a total of 928 clips, 3 train/test splits Each clip contains between 15 and 40 frames of size Human pose is annotated in each of the frames The metric used is accuracy: each clip is assigned an action label corresponding to the maximum value among the scores returned by the action classifiers sub-jhmdb includes 316 clips distributed over 12 actions in which the human body is fully visible. 3 train/test splits and the evaluation metric is accuracy

15 JHMB

16 Datasets used for evaluation MPI cooking 64 fine grained actions and an additional background class a total of 5609 clips, 7 training/test splits, frame size actions are very similar, such as cut dice, cut slices, and cut stripes or wash hands and wash objects sub-mpii cooking selection of two similar classes wash hands and wash objects with GT pose 55 and 139 clips for wash hands and wash objects actions respectively, for a total of 29, 997 frames

17 MPI cooking

18 Performance of the individual features Different body parts are complementary Appearance and flow are complementary

19 Robustness of P-CNN P-CNN on par with HLPF for GT P-CNN significantly more robust for real noisy poses

20 CNN Feature Pyramid Architecture (1) Spatial and temporal feature extraction (2) Building a pyramid (3) Creating a video representation (4) Classification

21 Hierarchical model for a sample snippet

22 Extract Binary key-frames

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image