Human Pose Estimation with Deep Learning. Wei Yang

Similar documents
Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach

Content-Based Image Recovery

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

arxiv: v1 [cs.cv] 29 Sep 2016

Learning to Estimate 3D Human Pose and Shape from a Single Color Image Supplementary material

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Spatial Localization and Detection. Lecture 8-1

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Flow-Based Video Recognition

3D Human Pose Estimation in the Wild by Adversarial Learning

Lecture 7: Semantic Segmentation

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Object Detection by 3D Aspectlets and Occlusion Reasoning

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Cascade Region Regression for Robust Object Detection

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multi-scale Adaptive Structure Network for Human Pose Estimation from Color Images

Multi-Scale Structure-Aware Network for Human Pose Estimation

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Learning Deep Features for Visual Recognition

Speaker: Ming-Ming Cheng Nankai University 15-Sep-17 Towards Weakly Supervised Image Understanding

Crafting GBD-Net for Object Detection

Multi-Scale Structure-Aware Network for Human Pose Estimation

ECE 6554:Advanced Computer Vision Pose Estimation

Computer Vision Lecture 16

Computer Vision Lecture 16

LEARNING RIGIDITY IN DYNAMIC SCENES FOR SCENE FLOW ESTIMATION

Human Pose Estimation using Global and Local Normalization. Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, Jingdong Wang

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Learning Deep Representations for Visual Recognition

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Computer Vision Lecture 16

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Photo OCR ( )

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

arxiv: v1 [cs.cv] 20 Dec 2016

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

arxiv: v1 [cs.cv] 31 Mar 2016

Fully Convolutional Networks for Semantic Segmentation

Feature-Fused SSD: Fast Detection for Small Objects

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Yiqi Yan. May 10, 2017

THE task of human pose estimation is to determine the. Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

MSCOCO Keypoints Challenge Megvii (Face++)

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Team G-RMI: Google Research & Machine Intelligence

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

3D Pose Estimation using Synthetic Data over Monocular Depth Images

Generative Networks. James Hays Computer Vision

Deep Neural Networks:

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

arxiv: v1 [cs.cv] 14 Jul 2017

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation

arxiv: v1 [cs.cv] 6 Oct 2017

arxiv: v1 [cs.cv] 12 Sep 2018

Rich feature hierarchies for accurate object detection and semantic segmentation

Classifying a specific image region using convolutional nets with an ROI mask as input

Lecture 5: Object Detection

Final Report: Smart Trash Net: Waste Localization and Classification

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Object Detection Based on Deep Learning

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Structured Prediction using Convolutional Neural Networks

arxiv: v1 [cs.cv] 16 Nov 2015

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Tri-modal Human Body Segmentation

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Channel Locality Block: A Variant of Squeeze-and-Excitation

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material

Finding Tiny Faces Supplementary Materials

Transcription:

Human Pose Estimation with Deep Learning Wei Yang

Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2

What do we need to know to recognize a crime scene? 3

stand stand Cues Scene: bank Abnormal pose Lay down Lay down Lay down Hands up Activity: robbery 4

Why is human pose estimation challenging? 5

#1. Articulation #2. Occlusion #3. Scale variation 6

#1. Articulation #2. Occlusion #3. Scale variation 7

#1. Articulation #2. Occlusion #3. Scale variation 8

Applications Understand Activities Family Robots 9

3D Human Poses Real-Time Imitation of Human Whole-Body Motions by Humanoids. J. Koenemann, F. Burget, and M. Bennewitz. ICRA, 2014. 10

Deep Learning Based Methods Fully Convolutional Network Regression with Euclidean Loss: P heatmaps H p L = 1 σ P 2 2 p=1 H p H p 2 where H p N l p, Σ, s. t., p = 1,, P 11

Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 12

Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 13

Why the Scale Matters? Yipin Yang, Yao Yu, Yu Zhou, Sidan Du, James Davis, Ruigang Yang. Semantic Parametric Reshaping of Human Body Models. In 3DV Workshop on Dynamic Shape Measurement and Analysis, 2014. 14

Why the Scale Matters? Learning Feature Pyramids for Human Pose Estimation Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, Xiaogang Wang ICCV, 2017 15

Previous work Multi-scale testing Multi-branch network The model itself is not scale invariant Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." TPAMI, 2010. Need much more memory and computation Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." CVPR. 2015. 16

Hourglass Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[c]//european Conference on Computer Vision. Springer, Cham, 2016: 483-499. 17

Identity Mapping Conv PRM + Pool PRM Pyramid Residual Modules 256x256 128 128 64 64 Stack 1 Stack n (a) x (l) Hourglass Hourglass Ratio 1 Ratio n (b) f 1 f C Detailed hourglass structure f 0 g Convolution Pyramid Residual module Score maps Addition x (l+1) Newell et al. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016 18

Initialization of Multi-Branch Networks Single-branch networks VGG Multi-branch networks Inceptions Traditional weight initialization methods, e.g., Gaussian, Xavier, MSRA (Kaiming), are not applicable for multi-branch networks. Xavier Glorot, Yoshua Bengio ; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010. 19

Forward Backward Initialization of Multi-Branch Networks x 1 (l) x2 (l) (l) x ci x (l) Conv / FC Conv / FC y (l) C i (l) y (l) = W (l) (l) x c + b (l) c=1 x (l+1) = f(y (l) ) y 1 (l) y2 (l) C o (l) Δx (l) = W l T Δy (l) c=1 Δy (l) = f (y l )Δx (l+1) (l) y co αc i l n i l Var ω l = 1 αc o l n o l Var ω l = 1 * α = 0.5 for ReLU and 1 for Tanh and Sigmoid. 20

OUTPUT STD Initialization of Multi-Branch Networks 1.1 MSR init Ours init 0.9 0.7 0.5 0.3 0.1-0.1 1 2 3 4 5 6 7 8 9 10 11 LAYER INDEX He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." ICCV. 2015. 21

Qualitative Results MPII dataset LSP dataset 22

Evaluation Metric PCK: Percentage of Correct Keypoints α max(h, w) 23

Results on MPII Human Pose State-of-the-art performance http://human-pose.mpi-inf.mpg.de/#results 24

Image Classification Top-1 Test Error on CIFAR-10 25

Semantic Segmentation: PASCAL VOC 2012 dataset (a) Image (b) DeepLab (c) DeepLap+PRM (a) Image (b) DeepLab (c) DeepLap+PRM (a) Image (b) DeepLab (c) DeepLap+PRM

Section Summary Feature pyramid module Generalizable for various networks and tasks Weight initialization for multi-branch networks Learning Feature Pyramids for Human Pose Estimation Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, Xiaogang Wang ICCV, 2017 27

Outline Scale 3D Pose Gray Feature pyramid learning Black In-the-wild 3D pose estimation ICCV 2017 CVPR 2018 28

Challenges: No Annotation Constrained scenes In-the-wild scenes Domain Discrepancy No annotation 29

Which one is more plausible? Discriminator 30

Weakly Supervised Adversarial Learning 3D dataset Images w/o GT Real Fake G 3D Human Pose Estimator Prediction D Multi-source Discriminator Ground-truth 31

Adversarial Learning Fool Generator Discriminator Loss G Euclidean Loss Tell Loss D Classification Loss 32

Conv Residual Residual Depth Generator 2D module Depth module 256x256 128 128 64 64 Stack 1 Stack n Hourglass 2D score maps 3D Poses 33

Discriminator 34

64 P 256 Multi-Source Discriminator Real or Fake samples Image I CNN Geometric descriptor P [Δx, Δy, Δz] [Δx 2, Δy 2, Δz 2 ] CNN Fully Connected layers Real Fake Raw poses CNN 64 2D Heatmaps Depthmaps Concatenation 35

Effectiveness of Adversarial Learning 36

(Ours) Ablation Study on H36M Dataset MPJPE (error in mm) on H36M Image+Pose+Geo Image+Geo Image+Pose Jointly learn 2D + depth Fix 2D, finetune depth Zhou et al. ICCV 17 59.7 60.3 8% less error 61.3 64.8 64.9 65.2 58 60 62 64 66 Full Geo Pose Baseline Baseline (fix 2D) State-of-art* *Zhou et al. ICCV 17 37

Ours baseline Results on Images in the Wild 38

Multi-view Results 39

Section Summary Weakly supervised adversarial learning for 3D pose estimation in the wild Multi-source discriminator 3D Human Pose Estimation in the Wild by Adversarial Learning Wei Yang, Wanli Ouyang, Xiaolong Wang, Hongsheng Li, Xiaogang Wang CVPR, 2018 40

Code Open-source PyTorch code https://github.com/bearpaw/pytorch-pose ICCV 17 https://github.com/bearpaw/pyranet 41

Thanks! wyang@ee.cuhk.edu.hk http://www.ee.cuhk.edu.hk/~wyang/ @bearpaw 42