Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011

Size: px

Start display at page:

Download "Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011"

Barnaby O’Connor’
5 years ago
Views:

1 Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011

2 Auto-initialize a tracking algorithm & recover from failures All human poses, shapes & sizes Limited compute budget super-real time on Xbox 360 to allow games to run concurrently

3 right hand neck left shoulder right elbow

4 No temporal information frame-by-frame Local pose estimate of parts each pixel & each body joint treated independently reduced training data and computation time Very fast simple depth image features parallel decision forest classifier

5 road building car grass water cow road cat building bicycle road [Shotton, Winn, Rother, Criminisi ] [Winn & Shotton 06] [Shotton, Johnson, Cipolla 08]

6 capture depth image & remove bg infer body parts per pixel cluster pixels to hypothesize body joint positions fit model & track skeleton

7 Compute P(c i w i ) pixels i = (x, y) body part c i image window w i image windows move with classifier Discriminative approach learn classifier P(c i w i ) from training data

8 Record mocap 500k frames distilled to 100k poses Retarget to several models Render (depth, body parts) pairs Train invariance to:

9 synthetic (train & test) real (test)

10 Depth comparisons feature response very fast to compute image f I, x = d I x d I (x + Δ) image coordinate depth offset depth Δ Δ x Δ x Δ x Δ Δ x x x input depth image Δ = v d I x scales inversely with depth Background pixels d = large constant

11 Toy example: distinguish left (L) and right (R) sides of the body no image window centred at x f(i, x; Δ 1 ) > θ 1 yes f(i, x; Δ 2 ) > θ 2 no yes P(c) L R P(c) P(c) L R L R

12 P n (c) body part c n Q n = (I, x) f(i, x; Δ n ) > θ n [Breiman et al. 84] for all pixels P l (c) c l no reduce entropy yes P r (c) r c Take (Δ, θ) that maximises information gain: ΔE = Q l Q n E(Q l ) Q r Q n E(Q r ) Goal: drive entropy at leaf nodes to zero

13 input depth ground truth parts inferred parts (soft) depth

14 Average per-class accuracy 65% 60% synthetic test data 65% 60% real test data 55% 55% 50% 50% 45% 45% 40% 40% 35% 35% 30% Depth of trees 30% Depth of trees

15 [Amit & Geman 97] [Breiman 01] [Geurts et al. 06] tree 1 (I, x) (I, x) tree T P T (c) P 1 (c) c Trained on different random subset of images bagging helps avoid over-fitting Average tree posteriors c P c I, x = 1 P T t (c I, x) t=1 T

16 Average per-class accuracy ground truth 55% 50% 45% inferred body parts (most likely) 1 tree 3 trees 6 trees 40% Number of trees

17 Average per-class accuracy 50% 48% 46% 44% 42% 40% 38% 36% 34% 32% 30% ground truth Maximum probe offset (pixel meters)

18 Average per-class per-class accuracy 60% 50% NB trees fixed to maximum depth 20 40% 30% 20% 10% Synthetic test set Real test set Silhouette (scale) Silhouette (no scale) Number of training images (log scale)

19 Define 3D world space density: 3D coord pixel weight 3D coord of i th pixel 1 2 pixel index i bandwidth inferred probability depth at i th pixel Mean shift for mode detection 3. hypothesize body joints

20 input depth inferred body parts front view side view inferred joint positions no tracking or smoothing top view

21 input depth inferred body parts front view side view inferred joint positions no tracking or smoothing top view

22 Center Head Center Neck Left Shoulder Right Shoulder Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision

23 Center Head Center Neck Left Shoulder Right Shoulder Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision Joint prediction from ground truth body parts Joint prediction from inferred body parts

24 Use 3D joint hypotheses kinematic constraints temporal coherence to give full skeleton higher accuracy invisible joints multi-player 4. track skeleton

25 Frame-by-frame gives robustness Body parts representation for efficiency Fast, simple machine learning Significant engineering to scale to a massive, varied training data set

Winn, Shahram Izadi, Pushmeet Kohli The whole Kinect team, especially: Alex Kipman, Mark

27 With thanks to: Andrew Fitzgibbon, Mat Cook, Andrew Blake, Toby Sharp, Ollie Williams, Sebastian Nowozin, Antonio Criminisi, Mihai Budiu, Ross Girshick, Duncan Robertson, John Winn, Shahram Izadi, Pushmeet Kohli The whole Kinect team, especially: Alex Kipman, Mark Finocchio, Ryan Geiss, Richard Moore, Robert Craig, Momin Al-Ghosien, Matt Bronder, Craig Peeper

Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011 PRESENTER: