Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011 PRESENTER: AHSAN ABDULLAH
PROBLEM
APPROACH Partitioning into body parts helps localizing the joints right hand neck left shoulder right elbow
PIPELINE Design Goals Efficiency Robustness capture depth image & remove bg infer body parts per pixel cluster pixels to hypothesize body joint positions fit model & track skeleton
BODY PART CLASSIFICATION Compute P(c i w i ) pixels i = (x, y) body part c i image window w i Discriminative approach image windows move with classifier learn classifier P(c i w i ) from training data
LEARNING DATA synthetic (train & test) real (test)
LEARNING DATA SYNTHESIS Record MoCap 500k frames distilled to 100k poses Retarget to several models Render (depth, body parts) pairs
FEATURE SET Depth comparisons - very fast to compute feature response image depth f I, x = d I x d I (x + Δ) image coordinate offset depth Δ Δ x Δ x Δ x Δ Δ x x x input depth image Δ = v d I x scales inversely with depth Background pixels d = large constant
DECISION FORESTS Aggregation of decision trees
TRAINING DECISION TREES P n (c) body part cn Q n = (I, x) f(i, x; Δ n ) > θ n for all pixels [Breiman et al. 84] P l (c) c l no reduce entropy yes P r (c) r c Take (Δ, θ) that maximises information gain
DECISION TREE CLASSIFICATION Toy example: Distinguish left (L) and right (R) sides of the body no image window centred at x f(i, x; Δ 1 ) > θ 1 yes f(i, x; Δ 2 ) > θ 2 no yes P(c) L R P(c) P(c) L R L R
DECISION FOREST CLASSIFIER tree 1 (I, x) (I, x) P T (c) tree T P 1 (c) c Trained on different random subset of images bagging helps avoid over-fitting c [Amit & Geman 97] [Breiman 01] [Geurts et al. 06] Average tree posteriors T P c I, x = 1 T t=1 P t (c I, x)
Average per-class NUMBER OF TREES ground truth 55% 50% 45% inferred body parts (most likely) 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of trees
Average per-class accuracy TREE DEPTH 65% 60% synthetic test data 65% 60% real test data 55% 55% 50% 50% 45% 45% 40% 40% 35% 35% 30% 8 12 16 20 Depth of trees 30% 5 15 Depth of trees
Body parts to joint hypotheses Define 3D world space density 3D coord pixel weight 3D coord of i th pixel 1 2 pixel index i bandwidth inferred probability depth at i th pixel Mean shift for mode detection 3. hypothesize body joints
input depth inferred body parts front view side view inferred joint positions No tracking or smoothing top view
input depth inferred body parts front view side view inferred joint positions No tracking or smoothing top view
Center Head Center Neck Left Shoulder Right Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision JOINT PREDICTION ACCURACY 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Center Head Center Neck Left Shoulder Right Shoulder Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision JOINT PREDICTION ACCURACY 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Joint prediction from ground truth body parts Joint prediction from inferred body parts
ANALYSIS No temporal information - frame-by-frame Very fast - simple depth image feature - parallel decision forest classifier
Uses KINECT SYSTEM 3D joint hypotheses kinematic constraints temporal coherence 3 2 1 to give full skeleton higher accuracy invisible joints multi-player 4. track skeleton
SUMMARY Frame-by-frame gives robustness Body parts representation for efficiency Fast, simple machine learning Significant engineering to scale to a massive, varied training data set
QUESTIONS