The Kinect Sensor. Luís Carriço FCUL 2014/15

Advanced Interaction Techniques The Kinect Sensor Luís Carriço FCUL 2014/15 Sources: MS Kinect for Xbox 360 John C. Tang. Using Kinect to explore NUI, Ms Research, From Stanford CS247 Shotton et al. Real-Time Human Pose Recognition in Parts from Single Depth Images, CVPR 2011 Larry Zitnick. Kinect Case Study, CSE = 576 John MacCormick. How does the Kinect work?

The Kinect Sensor

The RGB Camera 640 x 480-pixel resolution and run at 30 FPS (frames per second)

The Depth Sensor Technology: structured IR light missing pixels (non IR reflective) far shadow near

The Depth Sensor An infrared projector An infrared camera 640 x 480-pixel resolution and run at 30 FPS In mm, from the camera Structured light (Zhang et al, 3DPVT, 2002) Plus other algorithms: e.g. depth from focus

How it works? Structured light 3D scanner

Book no Book Source: http://www.futurepicture.org/?p=97

The Depth Map top view side view

RGB vs. depth for pose estimation RGB Only works well lit Background clutter Scale unknown Clothing, skin colour Depth Works in low light Person pops out from bg Scale known Uniform texture Shadows, missing pixels

Skeletal - Provided Data Skeleton space coordinates are expressed in meters

Skeleton Recognition Two main steps: Find body parts Compute joint positions. capture depth image & remove bg infer body parts per pixel cluster pixels to hypothesize body joint positions fit model & track skeleton

Body part recognition No temporal information frame-by-frame Local pose estimate of parts each pixel & each body joint treated independently reduced training data and computation time Very fast simple depth image features parallel decision forest classifier

Features

Classification Learning: 1. Randomly choose a set of thresholds and features for splits. 2. Pick the threshold and feature that provide the largest information gain. 3. Recurse until a certain accuracy is reached or depth is obtained.

Implementation details 3 trees (depth 20) 300k unique training images per tree. 2000 candidate features, and 50 thresholds One day on 1000 core cluster.

Tracking Body Parts The trained classifiers assign a probably of a pixel being in each body part Picks out areas of maximum probability for each body part type

And the Skeleton The mean shift algorithm is used to robustly compute modes of probability distributions Mean shift is simple, fast, and effective

Vision Algorithm (Summary) Object recognition approach Intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem Large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes System runs at 200 frames per second on consumer hardware Evaluation shows high accuracy on both synthetic and real test sets State of the art accuracy in comparison with related work and improved generalization over exact whole-skeleton nearest neighbor matching

In Practice Collect training data thousands of visits to global households, filming real users, the Hollywood motion capture studio generated billions of images Apply state-of-the-art object recognition research Apply state-of-the-art real-time semantic segmentation Build a training set classify each pixel s probability of being in any of 32 body segments, determine probabilistic cluster of body configurations consistent with those, present the most probable Millions of training images Millions of classifier parameters Hard to parallelize New algorithm for distributed decision-tree training Fun Fact: Major use of DryadLINQ (large-scale distributed cluster computing)

To learn more Warning: lots of wrong info on web Great site by Daniel Reetz: http://www.futurepicture.org/?p=97 Kinect patents: http://www.faqs.org/patents/app/20100118123 http://www.faqs.org/patents/app/20100020078 http://www.faqs.org/patents/app/20100007717