Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011 PRESENTER: AHSAN ABDULLAH

PROBLEM

APPROACH Partitioning into body parts helps localizing the joints right hand neck left shoulder right elbow

PIPELINE Design Goals Efficiency Robustness capture depth image & remove bg infer body parts per pixel cluster pixels to hypothesize body joint positions fit model & track skeleton

BODY PART CLASSIFICATION Compute P(c i w i ) pixels i = (x, y) body part c i image window w i Discriminative approach image windows move with classifier learn classifier P(c i w i ) from training data

LEARNING DATA synthetic (train & test) real (test)

LEARNING DATA SYNTHESIS Record MoCap 500k frames distilled to 100k poses Retarget to several models Render (depth, body parts) pairs

FEATURE SET Depth comparisons - very fast to compute feature response image depth f I, x = d I x d I (x + Δ) image coordinate offset depth Δ Δ x Δ x Δ x Δ Δ x x x input depth image Δ = v d I x scales inversely with depth Background pixels d = large constant

DECISION FORESTS Aggregation of decision trees

TRAINING DECISION TREES P n (c) body part cn Q n = (I, x) f(i, x; Δ n ) > θ n for all pixels [Breiman et al. 84] P l (c) c l no reduce entropy yes P r (c) r c Take (Δ, θ) that maximises information gain

DECISION TREE CLASSIFICATION Toy example: Distinguish left (L) and right (R) sides of the body no image window centred at x f(i, x; Δ 1 ) > θ 1 yes f(i, x; Δ 2 ) > θ 2 no yes P(c) L R P(c) P(c) L R L R

DECISION FOREST CLASSIFIER tree 1 (I, x) (I, x) P T (c) tree T P 1 (c) c Trained on different random subset of images bagging helps avoid over-fitting c [Amit & Geman 97] [Breiman 01] [Geurts et al. 06] Average tree posteriors T P c I, x = 1 T t=1 P t (c I, x)

Average per-class NUMBER OF TREES ground truth 55% 50% 45% inferred body parts (most likely) 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of trees

Average per-class accuracy TREE DEPTH 65% 60% synthetic test data 65% 60% real test data 55% 55% 50% 50% 45% 45% 40% 40% 35% 35% 30% 8 12 16 20 Depth of trees 30% 5 15 Depth of trees

Body parts to joint hypotheses Define 3D world space density 3D coord pixel weight 3D coord of i th pixel 1 2 pixel index i bandwidth inferred probability depth at i th pixel Mean shift for mode detection 3. hypothesize body joints

input depth inferred body parts front view side view inferred joint positions No tracking or smoothing top view

Center Head Center Neck Left Shoulder Right Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision JOINT PREDICTION ACCURACY 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Center Head Center Neck Left Shoulder Right Shoulder Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP Average precision JOINT PREDICTION ACCURACY 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Joint prediction from ground truth body parts Joint prediction from inferred body parts

ANALYSIS No temporal information - frame-by-frame Very fast - simple depth image feature - parallel decision forest classifier

Uses KINECT SYSTEM 3D joint hypotheses kinematic constraints temporal coherence 3 2 1 to give full skeleton higher accuracy invisible joints multi-player 4. track skeleton

SUMMARY Frame-by-frame gives robustness Body parts representation for efficiency Fast, simple machine learning Significant engineering to scale to a massive, varied training data set

QUESTIONS