Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Similar documents
Kinect Device. How the Kinect Works. Kinect Device. What the Kinect does 4/27/16. Subhransu Maji Slides credit: Derek Hoiem, University of Illinois

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman

Binocular stereo. Given a calibrated binocular stereo pair, fuse it to produce a depth image. Where does the depth information come from?

Epipolar Geometry and Stereo Vision

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

Stereo vision. Many slides adapted from Steve Seitz

Epipolar Geometry and Stereo Vision

Stereo. Many slides adapted from Steve Seitz

Real-Time Human Pose Recognition in Parts from Single Depth Images

The Kinect Sensor. Luís Carriço FCUL 2014/15

CS4495/6495 Introduction to Computer Vision. 3B-L3 Stereo correspondence

Stereo II CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz

Depth Sensors Kinect V2 A. Fornaser

BIL Computer Vision Apr 16, 2014

CS Decision Trees / Random Forests

CS4495/6495 Introduction to Computer Vision

Chaplin, Modern Times, 1936

Image Based Reconstruction II

Stereo: Disparity and Matching

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Lecture 9 & 10: Stereo Vision

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Lecture 10: Multi view geometry

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

Stereo. Outline. Multiple views 3/29/2017. Thurs Mar 30 Kristen Grauman UT Austin. Multi-view geometry, matching, invariant features, stereo vision

What have we leaned so far?

Final project bits and pieces

Epipolar Geometry and Stereo Vision

EECS 442 Computer vision. Stereo systems. Stereo vision Rectification Correspondence problem Active stereo vision systems

CS5670: Computer Vision

Fundamental matrix. Let p be a point in left image, p in right image. Epipolar relation. Epipolar mapping described by a 3x3 matrix F

Data-driven Depth Inference from a Single Still Image

Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923

Multiple View Geometry

Introduction à la vision artificielle X

Segmentation and Grouping

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Project 2 due today Project 3 out today. Readings Szeliski, Chapter 10 (through 10.5)

Multiple View Geometry

Lecture 14: Computer Vision

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

Background subtraction in people detection framework for RGB-D cameras

There are many cues in monocular vision which suggests that vision in stereo starts very early from two similar 2D images. Lets see a few...

Recap from Previous Lecture

Computer Vision Lecture 17

Lecture'9'&'10:'' Stereo'Vision'

Computer Vision Lecture 17

Lecture 10: Multi-view geometry

Segmentation and Grouping April 19 th, 2018

Robert Collins CSE486, Penn State. Lecture 09: Stereo Algorithms

Segmentation and Grouping

Outline. Segmentation & Grouping. Examples of grouping in vision. Grouping in vision. Grouping in vision 2/9/2011. CS 376 Lecture 7 Segmentation 1

Lecture 6 Stereo Systems Multi-view geometry

Lecture 7: Segmentation. Thursday, Sept 20

Stereo and Epipolar geometry

Cameras and Stereo CSE 455. Linda Shapiro

CS 4495 Computer Vision. Segmentation. Aaron Bobick (slides by Tucker Hermans) School of Interactive Computing. Segmentation

Grouping and Segmentation

Local features: detection and description. Local invariant features

Project 3 code & artifact due Tuesday Final project proposals due noon Wed (by ) Readings Szeliski, Chapter 10 (through 10.5)

Image Segmentation. Shengnan Wang

Announcements. Stereo Vision Wrapup & Intro Recognition

Computer Vision I. Announcements. Random Dot Stereograms. Stereo III. CSE252A Lecture 16

Segmentation & Grouping Kristen Grauman UT Austin. Announcements

EE795: Computer Vision and Intelligent Systems

Visual Perception for Robots

Segmentation and Grouping April 21 st, 2015

CS 2770: Computer Vision. Edges and Segments. Prof. Adriana Kovashka University of Pittsburgh February 21, 2017

Recap: Features and filters. Recap: Grouping & fitting. Now: Multiple views 10/29/2008. Epipolar geometry & stereo vision. Why multiple views?

Epipolar Geometry and Stereo Vision

Towards a Simulation Driven Stereo Vision System

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

Motion capture: An evaluation of Kinect V2 body tracking for upper limb motion analysis

The goals of segmentation

Lecture: k-means & mean-shift clustering

Lecture: k-means & mean-shift clustering

Key Developments in Human Pose Estimation for Kinect

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Grouping and Segmentation

Stereo and structured light

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Stereo Vision II: Dense Stereo Matching

Local features: detection and description May 12 th, 2015

Nonparametric Clustering of High Dimensional Data

CS 2770: Intro to Computer Vision. Multiple Views. Prof. Adriana Kovashka University of Pittsburgh March 14, 2017

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Segmentation & Clustering

Digital Image Processing

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Multi-view stereo. Many slides adapted from S. Seitz

Ninio, J. and Stevens, K. A. (2000) Variations on the Hermann grid: an extinction illusion. Perception, 29,

Lecture 6 Stereo Systems Multi- view geometry Professor Silvio Savarese Computational Vision and Geometry Lab Silvio Savarese Lecture 6-24-Jan-15

A virtual tour of free viewpoint rendering

Local features and image matching. Prof. Xin Yang HUST

3D Photography: Active Ranging, Structured Light, ICP

Introduction to Computer Vision. Srikumar Ramalingam School of Computing University of Utah

Gesture Recognition: Hand Pose Estimation

Transcription:

Human Body Recognition and Tracking: How the Kinect Works Kinect RGB-D Camera Microsoft Kinect (Nov. 2010) Color video camera + laser-projected IR dot pattern + IR camera $120 (April 2012) Kinect 1.5 due 5/2012 IR laser projector color camera T2 Many slides by D. Hoiem IR camera 640 x 480, 30 fps What the Kinect Does Get Depth Image How Kinect Works: Overview IR Projector IR Sensor Projected Light Pattern Stereo Algorithm Segmentation, Part Prediction Application (e.g., game) Estimate body parts and joint poses Depth Image Body parts and joint positions

Part 1: Stereo from Projected Dots IR Projector Part 1: Stereo from Projected Dots 1. Overview of depth from stereo Stereo Algorithm IR Sensor Projected Light Pattern Segmentation, Part Prediction 2. How it works for a projector/sensor pair 3. Stereo algorithm used by PrimeSense (Kinect) Depth Image Body parts and joint positions Depth from Stereo Images image 1 image 2 Depth from Stereo Images Goal: recover depth by finding image coordinate x that corresponds to x X X Dense depth map x x x z x' f f C Baseline B C Some of following slides adapted from Steve Seitz and Lana Lazebnik

Basic Stereo Matching Algorithm Depth from Disparity X x x O O f z x x z For each pixel in the first image Find corresponding epipolar line in the right image Examine all pixels on the epipolar line and pick the best match Triangulate the matches to get depth information f f O Baseline O B B f disparity x x z Disparity is inversely proportional to depth, z Basic Stereo Matching Algorithm Correspondence Search Left Right scanline If necessary, rectify the two stereo images to transform epipolar lines into scanlines For each pixel x in the first image Find corresponding epipolar scanline in the right image Examine all pixels on the scanline and pick the best match x Compute disparity x-x and set depth(x) = fb/(x-x ) Matching cost disparity Slide a window along the right scanline and compare contents of that window with the reference window in the left image Matching cost: SSD or normalized correlation

Results with Window Search Data Add Constraints and Solve with Graph Cuts Before Window-based matching Ground truth Graph cuts Ground truth Y. Boykov, O. Veksler, and R. Zabih, Fast Approximate Energy Minimization via Graph Cuts, PAMI 2001 For the latest and greatest: http://www.middlebury.edu/stereo/ Failures of Correspondence Search Structured Light Textureless surfaces Occlusions, repeated structures Basic Principle Use a projector to create known features (e.g., points, lines) Light projection If we project distinctive points, matching is easy Non-Lambertian surfaces, specularities

Source: http://www.futurepicture.org/?p=97 Example: Book vs. No Book Source: http://www.futurepicture.org/?p=97 Example: Book vs. No Book Kinect s Projected Dot Pattern Projected Dot Pattern (2)

Same Stereo Algorithms Apply Projector Sensor Region-Growing Random Dot Matching 1. Detect dots ( speckles ) and label them unknown 2. Randomly select a region anchor: a dot with unknown depth a. Windowed search via normalized cross correlation along scanline Check that best match score is greater than threshold; if not, mark as invalid and go to 2 b. Region growing 1. Neighboring pixels are added to a queue 2. For each pixel in queue, initialize by anchor s shift; then search small local neighborhood; if matched, add neighbors to queue 3. Stop when no pixels are left in the queue 3. Stop when all dots have known depth or are marked invalid http://www.wipo.int/patentscope/search/en/wo2007043036 Kinect RGB-D Camera Implementation In-camera ASIC computes 11-bit 640 x 480 depth map at 30 Hz Range limit for tracking: 0.7 6 m (2.3 to 20 ) Practical range limit: 1.2 3.5 m

Part 2: Pose from Depth Goal: Estimate Pose from Depth Image IR Projector IR Sensor Projected Light Pattern Stereo Algorithm Segmentation, Part Prediction Depth Image Body parts and joint positions Real-Time Human Pose Recognition in Parts from a Single Depth Image, J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011 Goal: Estimate Pose from Depth Image Step 1. Find body parts Step 2. Compute joint positions Challenges Lots of variation in bodies, orientations, poses Needs to be very fast (their algorithm runs at 200 fps on the Xbox 360 GPU) Pose Examples RGB Depth Part Label Map Joint Positions http://research.microsoft.com/apps/video/default.aspx?id=144455 Examples of one part

Finding Body Parts Extract Body Pixels by Thresholding Depth What should we use for a feature? Difference in depth What should we use for a classifier? Random Decision Forests Features Difference of depth at two pixel Offset is scaled by depth at reference pixel Part Classification with Random Forests Randomized decision forest: collection of independently-trained binary decision trees Each tree is a classifier that predicts the likelihood of a pixel x belonging to body part class c Non-leaf node corresponds to a thresholded feature Leaf node corresponds to a conjunction of several features At leaf node store learned distribution P(c I, x) d I (x) is depth image, = (u, v) is offset to second pixel

Classification Classification Testing Phase: 1. Classify each pixel x in image I using all decision trees and average the results at the leaves: Learning Phase: 1. For each tree, pick a randomly sampled subset of training data 2. Randomly choose a set of features and thresholds at each node 3. Pick the feature and threshold that give the largest information gain 4. Recurse until a certain accuracy is reached or tree-depth is obtained Classification Forest: An Ensemble Model Implementation Tree t=1 t=2 t=3 The ensemble model Forest output probability 31 body parts 3 trees (depth 20) 300,000 training images per tree randomly selected from 1M training images 2,000 training example pixels per image 2,000 candidate features 50 candidate thresholds per feature Decision forest constructed in 1 day on 1,000 core cluster

Get Lots of Training Data Synthetic Body Models Capture and sample 500K mocap frames of people kicking, driving, dancing, etc. Get 3D models for 15 bodies with a variety of weights, heights, etc. Synthesize mocap data for all 15 body types Synthetic Data for Training and Testing Real Data for Testing

Results Classification Accuracy vs. # Training Examples Classification Accuracy Comparison with Nearest-Neighbor Matching of Whole Body

Step 2: Joint Position Estimation Joints are estimated using the mean-shift clustering algorithm applied to the labeled pixels Gaussian-weighted density estimator for each body part to find its mode 3D position Push back in depth each cluster mode to lie at approx. center of the body part 73% joint prediction accuracy (on head, shoulders, elbows, hands) Mean Shift Clustering Algorithm 1. Choose a search window size 2. Choose the initial location of the search window 3. Compute the mean location (centroid of the data) in the search window 4. Center the search window at the mean location computed in Step 3 5. Repeat Steps 3 and 4 until convergence The mean shift algorithm seeks the mode, i.e., point of highest density of a data distribution: Intuitive Description Intuitive Description Region of interest Region of interest Center of mass Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector

Intuitive Description Intuitive Description Region of interest Region of interest Center of mass Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Intuitive Description Intuitive Description Region of interest Region of interest Center of mass Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector

Intuitive Description Region of interest Center of mass Clustering Cluster : All data points in the attraction basin of a mode Attraction basin : the region for which all trajectories lead to the same mode Objective : Find the densest region Distribution of identical billiard balls Mean Shift : A robust Approach Toward Feature Space Analysis, by Comaniciu, Meer Results Failures

Runtime Implementation Uses Xbox 360 s Xenon CPU with 3 cores 500 MHz GPU with 10 MB DRAM Applications Gesture recognition http://www.youtube.com/watch?v=e0c2b3pbvrw Robot SLAM http://www.youtube.com/watch?v=ainx-vpdhmo http://www.youtube.com/watch?v=58_xg8akcae Object recognition http://www.youtube.com/watch?v=kawlwzgdswq Nano helicoptors http://www.youtube.com/watch?v=yqimgv5vtd4 Applications Uses of Kinect: Gesture Recognition Mario: http://www.youtube.com/watch?v=8ctjl5lujhg Robot Control: http://www.youtube.com/watch?v=w8bmgtmkfby Capture for holography: http://www.youtube.com/watch?v=4lw8wgmfpte Virtual dressing room: http://www.youtube.com/watch?v=1jbvnk1t4vq Fly wall: http://vimeo.com/user3445108/kiwibankinteractivewall 3D Scanner: http://www.youtube.com/watch?v=v7lthxroesw

Uses of Kinect: Robot SLAM Uses of Kinect: Robot SLAM (2) Nano Helicoptors Uses of Kinect: Object Recognition