Human Pose Estimation with Deep Learning Wei Yang
Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2
What do we need to know to recognize a crime scene? 3
stand stand Cues Scene: bank Abnormal pose Lay down Lay down Lay down Hands up Activity: robbery 4
Why is human pose estimation challenging? 5
#1. Articulation #2. Occlusion #3. Scale variation 6
#1. Articulation #2. Occlusion #3. Scale variation 7
#1. Articulation #2. Occlusion #3. Scale variation 8
Applications Understand Activities Family Robots 9
3D Human Poses Real-Time Imitation of Human Whole-Body Motions by Humanoids. J. Koenemann, F. Burget, and M. Bennewitz. ICRA, 2014. 10
Deep Learning Based Methods Fully Convolutional Network Regression with Euclidean Loss: P heatmaps H p L = 1 σ P 2 2 p=1 H p H p 2 where H p N l p, Σ, s. t., p = 1,, P 11
Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 12
Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 13
Why the Scale Matters? Yipin Yang, Yao Yu, Yu Zhou, Sidan Du, James Davis, Ruigang Yang. Semantic Parametric Reshaping of Human Body Models. In 3DV Workshop on Dynamic Shape Measurement and Analysis, 2014. 14
Why the Scale Matters? Learning Feature Pyramids for Human Pose Estimation Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, Xiaogang Wang ICCV, 2017 15
Previous work Multi-scale testing Multi-branch network The model itself is not scale invariant Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." TPAMI, 2010. Need much more memory and computation Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." CVPR. 2015. 16
Hourglass Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[c]//european Conference on Computer Vision. Springer, Cham, 2016: 483-499. 17
Identity Mapping Conv PRM + Pool PRM Pyramid Residual Modules 256x256 128 128 64 64 Stack 1 Stack n (a) x (l) Hourglass Hourglass Ratio 1 Ratio n (b) f 1 f C Detailed hourglass structure f 0 g Convolution Pyramid Residual module Score maps Addition x (l+1) Newell et al. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016 18
Initialization of Multi-Branch Networks Single-branch networks VGG Multi-branch networks Inceptions Traditional weight initialization methods, e.g., Gaussian, Xavier, MSRA (Kaiming), are not applicable for multi-branch networks. Xavier Glorot, Yoshua Bengio ; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010. 19
Forward Backward Initialization of Multi-Branch Networks x 1 (l) x2 (l) (l) x ci x (l) Conv / FC Conv / FC y (l) C i (l) y (l) = W (l) (l) x c + b (l) c=1 x (l+1) = f(y (l) ) y 1 (l) y2 (l) C o (l) Δx (l) = W l T Δy (l) c=1 Δy (l) = f (y l )Δx (l+1) (l) y co αc i l n i l Var ω l = 1 αc o l n o l Var ω l = 1 * α = 0.5 for ReLU and 1 for Tanh and Sigmoid. 20
OUTPUT STD Initialization of Multi-Branch Networks 1.1 MSR init Ours init 0.9 0.7 0.5 0.3 0.1-0.1 1 2 3 4 5 6 7 8 9 10 11 LAYER INDEX He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." ICCV. 2015. 21
Qualitative Results MPII dataset LSP dataset 22
Evaluation Metric PCK: Percentage of Correct Keypoints α max(h, w) 23
Results on MPII Human Pose State-of-the-art performance http://human-pose.mpi-inf.mpg.de/#results 24
Image Classification Top-1 Test Error on CIFAR-10 25
Semantic Segmentation: PASCAL VOC 2012 dataset (a) Image (b) DeepLab (c) DeepLap+PRM (a) Image (b) DeepLab (c) DeepLap+PRM (a) Image (b) DeepLab (c) DeepLap+PRM
Section Summary Feature pyramid module Generalizable for various networks and tasks Weight initialization for multi-branch networks Learning Feature Pyramids for Human Pose Estimation Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, Xiaogang Wang ICCV, 2017 27
Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 28
Challenges: No Annotation Constrained scenes In-the-wild scenes Domain Discrepancy No annotation 29
Which one is more plausible? Discriminator 30
Weakly Supervised Adversarial Learning 3D dataset Images w/o GT Real Fake G 3D Human Pose Estimator Prediction D Multi-source Discriminator Ground-truth 31
Adversarial Learning Fool Generator Discriminator Loss G Euclidean Loss Tell Loss D Classification Loss 32
Conv Residual Residual Depth Generator 2D module Depth module 256x256 128 128 64 64 Stack 1 Stack n Hourglass 2D score maps 3D Poses 33
Discriminator 34
64 P 256 Multi-Source Discriminator Real or Fake samples Image I CNN Geometric descriptor P [Δx, Δy, Δz] [Δx 2, Δy 2, Δz 2 ] CNN Fully Connected layers Real Fake Raw poses CNN 64 2D Heatmaps Depthmaps Concatenation 35
Effectiveness of Adversarial Learning 36
(Ours) Ablation Study on H36M Dataset MPJPE (error in mm) on H36M Image+Pose+Geo Image+Geo Image+Pose Jointly learn 2D + depth Fix 2D, finetune depth Zhou et al. ICCV 17 59.7 60.3 8% less error 61.3 64.8 64.9 65.2 58 60 62 64 66 Full Geo Pose Baseline Baseline (fix 2D) State-of-art* *Zhou et al. ICCV 17 37
Ours baseline Results on Images in the Wild 38
Multi-view Results 39
Section Summary Weakly supervised adversarial learning for 3D pose estimation in the wild Multi-source discriminator 3D Human Pose Estimation in the Wild by Adversarial Learning Wei Yang, Wanli Ouyang, Xiaolong Wang, Hongsheng Li, Xiaogang Wang CVPR, 2018 40
Code Open-source PyTorch code https://github.com/bearpaw/pytorch-pose ICCV 17 https://github.com/bearpaw/pyranet 41
Thanks! wyang@ee.cuhk.edu.hk http://www.ee.cuhk.edu.hk/~wyang/ @bearpaw 42