Articulated Pose Estimation with Flexible Mixtures-of-Parts

Similar documents
Real-Time Human Pose Recognition in Parts from Single Depth Images

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011

CS Decision Trees / Random Forests

The Kinect Sensor. Luís Carriço FCUL 2014/15

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Key Developments in Human Pose Estimation for Kinect

Background subtraction in people detection framework for RGB-D cameras

Human Pose Estimation in Stereo Images

Hand part classification using single depth images

Real-Time Human Pose Recognition in Parts from Single Depth Images

CS 223B Computer Vision Problem Set 3

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

THE fast and reliable estimation of the pose of the human

Kinect Device. How the Kinect Works. Kinect Device. What the Kinect does 4/27/16. Subhransu Maji Slides credit: Derek Hoiem, University of Illinois

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Body Parts Dependent Joint Regressors for Human Pose Estimation in Still Images

Tracking People. Tracking People: Context

Part I: HumanEva-I dataset and evaluation metrics

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Human Upper Body Posture Recognition and Upper Limbs Motion Parameters Estimation

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

CS 231A Computer Vision (Winter 2018) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3

Gesture Recognition: Hand Pose Estimation

Learning and Recognizing Visual Object Categories Without First Detecting Features

CS 231A Computer Vision (Winter 2014) Problem Set 3

Category vs. instance recognition

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

Model-based Visual Tracking:

3D Pose Estimation using Synthetic Data over Monocular Depth Images

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

CAP 6412 Advanced Computer Vision

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

A novel template matching method for human detection

Linear combinations of simple classifiers for the PASCAL challenge

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Integral Channel Features with Random Forest for 3D Facial Landmark Detection

Object Category Detection. Slides mostly from Derek Hoiem

Deformable Part Models

Tri-modal Human Body Segmentation

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Multi-view Body Part Recognition with Random Forests

Latent variable pictorial structure for human pose estimation on depth images

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Multi-View 3D Object Detection Network for Autonomous Driving

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

Robotics Programming Laboratory

Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image - Supplementary Material -

Global Probability of Boundary

Visuelle Perzeption für Mensch- Maschine Schnittstellen

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Fast Sample Generation with Variational Bayesian for Limited Data Hyperspectral Image Classification

Pose Estimation on Depth Images with Convolutional Neural Network

Data-driven Depth Inference from a Single Still Image

DPM Configurations for Human Interaction Detection

Joint Classification-Regression Forests for Spatially Structured Multi-Object Segmentation

Human Motion Detection and Tracking for Video Surveillance

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

CS381V Experiment Presentation. Chun-Chen Kuo

Exploiting Depth Camera for 3D Spatial Relationship Interpretation

Domain Adaptation for Upper Body Pose Tracking in Signed TV Broadcasts

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Photo-realistic Renderings for Machines Seong-heum Kim

Texton Clustering for Local Classification using Scene-Context Scale

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

A two-step methodology for human pose estimation increasing the accuracy and reducing the amount of learning samples dramatically

Learning video saliency from human gaze using candidate selection

Ulas Bagci

Object Category Detection: Sliding Windows

Video based Animation Synthesis with the Essential Graph. Adnane Boukhayma, Edmond Boyer MORPHEO INRIA Grenoble Rhône-Alpes

Category-level localization

Real-Time Human Detection using Relational Depth Similarity Features

Human Upper Body Pose Estimation in Static Images

Automatic Gait Recognition. - Karthik Sridharan

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

Contents I IMAGE FORMATION 1

Modeling 3D Human Poses from Uncalibrated Monocular Images

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Stochastic Road Shape Estimation, B. Southall & C. Taylor. Review by: Christopher Rasmussen

Window based detectors

Robust PDF Table Locator

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Learning 6D Object Pose Estimation and Tracking

Lecture 25: Review I

Action recognition in videos

Nonrigid Surface Modelling. and Fast Recovery. Department of Computer Science and Engineering. Committee: Prof. Leo J. Jia and Prof. K. H.

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Walking gait dataset: point clouds, skeletons and silhouettes

Transfer Forest Based on Covariate Shift

Generic Face Alignment Using an Improved Active Shape Model

Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd Supplementary Material

Optical flow and tracking

Transcription:

Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION

Outline Modeling Special Cases Inferences Learning Experiments

Problem and Relevance Problem: Difficult to capture-identify different poses of the human body Relevance: Improvements in pose-recognition can assist with many modern applications

Challenges and Previous Approaches Challenges: Many degrees of freedom with relation to joints Limbs can vary in appearance (clothes, body shape, camera orientation, etc.) Translate method from static image to video Previous Approaches: Tree approaches have been primarily used with different approaches to tackle the double-counting phenomena

Approach Object recognition to retrieve bounding boxes Bounding boxes on different parts of body Label the bounding boxes

Step by Step Approach Define the model (as well as special cases) Infer parts based on model Supervised learning

Defining the Model Label parts of a model, as well as orientation Score specific configurations of parts using: Sum 1: local score of template for part i for type t at location p Sum 2: switching model for scoring paired types φ(i, p i ): feature vector for pixel p i S t : Compatibility score for parts ψ p i p j = dx dx 2 dy dy 2 : dx = x i x j relative location of i to j, relative to pixel grid

Learning Supervised learning Given labeled (+/-) examples, define a prediction objective function: Positive examples scores: >1 Negative examples' scores: <-1

Learning... Prior models: Score high on ground-truth poses Score low on alternate poses Paper's model: Score high on ground-truth poses (Assuming) Decent score on alternate poses Low score on people-less images

Learning...

Experiments Datasets: Image Parse Dataset Buffy Dataset (and subset) Metrics: Number of mixtures/types Number of parts Alt. Methods: Rigid HOG Mixture of Definition by Parts Additional methods detailed in their respective papers

Datasets Parse: 305 pose-annotated images of detailed full body poses Buffy: 748 pose-annotated video frames sampled from 5 episodes

Metrics Number of Mixtures/Types Number of Parts 2 setups used: 14 part model and 27 part model

Results

Results...

Results...

High Level Summary Supervised Learning for a Model Utilizes Pose Estimation AND Pose Detection Label a Model Score model based on co-occurrence Maximize model score

Discussion Strengths/Weaknesses of Approach Ideas for Improvement Weaknesses of Experiment Other methods for video application

Real-Time Human Pose Recognition in Parts from Single Depth Images PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION

Outline Data Gathering Real Data vs Synthetic Data Body Part Inference and Joint Proposals Intermediate Body Parts Representation Depth image comparison and features Random Decision Forests/Trees Joint Positioning Experiments

Main Idea/Problem Addressed Real-time human pose recognition using consumer quality machine Invariance to: Shape Size Pose Clothing Etc.

Problem Relevance Provides opportunity for: monetary gain quality of life improvement improved machine-computer interaction Examples: Microsoft Kinect, Pedestrian Tracking, Computer- Interfaces

Challenges Human Body is capable of many different poses Need a way to generate many of these poses with different body shapes, scales, clothing types, etc. in mind Must decide how to identify parts of the body, and how detailed our labeling should be (i.e. leg vs. lower leg, knee, and upper leg) Backgrounds, light levels, color, and texture invariance

Previous Approaches Using Conventional Intensity Cameras (CPU expensive) Examine/learn an initial pose => Learn variations of that pose Estimate/search for locations of body segments from which to build the body Using Depth Cameras Base their search off of 3D models, divide the model into parts, base their search off of those parts

Their Approach Large and varied dataset Both synthetic and motion captured data Object Recognition Approach (by parts) Intermediate Body Part Representation Pose Estimation reduced to Per-Pixel Classification Create scored proposals of body joints

Data Gathering Depth Imaging Benefits Color and texture invariance Does well in low light levels Gives accurate scaling estimates Resolves silhouette ambiguity Background subtraction Creates a base for synthesizing additional depth images

Data Types synthetic (train & test) real (test)

Real Data (Motion Capture Data) Large Database built using motion capture and human actions related to target application (dancing, running, etc.) Classifier expected to generalize unseen poses Wide range of poses vs. all possible combinations Many redundant poses are discarded based on initial data and furthest neighbor clustering 100k poses (frames) used s.t. no poses are within 5 cm of one another

Synthetic Data Goals: Realism and Variety Use of randomized rendering pipeline, which produces samples that are fully labeled and can be trained on Learning is used to provide invariance towards: camera pose, body pose, body size, and body shape Other slight variations accounted for: height, weight, mocap frame, camera noise, clothing, hairstyle

The Rendering Pipeline Necessary to account for pose variation and model variation Start with: Base Character and Pose Transform on: Rotation and Translation; Hair and Clothing; Weight and Height Variations; Camera Position and Orientation; Camera Noise Add transformations to dataset

Different Renderings

Rendering Pipeline... Steps: 1) Sample a set of parameters (randomly) 2) Render depth and body part images based on 3D meshes 3) Using Autodesk Motionbuilder, the mocap retargets to the base 15 meshes of the body s shapes and sizes 4) Further randomized variations to parameters (to expand pose coverage)

Body Part Labeling Body parts broken down into an intermediate body part representation of 31 body parts (object by parts) Observations: Parts should be small to enough to localize different body joints Parts should be small in numbers such that no classifier space is wasted

Depth Image Features For a given pixel x, the features are computed via: d I x : the depth at pixel x in Image θ = u, v : describes offsets u and v 1 d I (x) : ensures features are depth invariant

Depth Image Features...

Depth Image Features... Features are weak individually Solution: Combine with a decision forest Solution is efficient: One feature reads at most 3 image pixels Performs at most 5 arithmetic operations Can be implemented on a GPU

Toy example: distinguish left (L) and right (R) sides of the body no image window centred at x f(i, x; Δ 1 ) > θ 1 yes f(i, x; Δ 2 ) > θ 2 no yes P(c) L R P(c) P(c) L R L R

P n (c) body part c n Q n = (I, x) f(i, x; Δ n ) > θ n [Breiman et al. 84] for all pixels P l (c) c l no reduce entropy yes P r (c) r c Take (Δ, θ) that maximises information gain: ΔE = Q l Q n E(Q l ) Q r Q n E(Q r ) Goal: drive entropy at leaf nodes to zero

[Amit & Geman 97] [Breiman 01] [Geurts et al. 06] tree 1 (I, x) (I, x) tree T P T (c) P 1 (c) c Trained on different random subset of images bagging helps avoid over-fitting Average tree posteriors c P c I, x = 1 P T t (c I, x) t=1 T

Average per-class accuracy ground truth 55% 50% 45% inferred body parts (most likely) 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of trees

Randomized Decision Forests... Consist of T decision trees Split nodes: Have a feature f(θ) and threshold τ To classify pixel x from image I, you start at the root and evaluate equation (1), branching left or right according to the set t Leaf nodes: A learned distribution (left side of equation) of body part labels c is stored Distributions are averaged for all trees to give the final classification decision

Training using the decision forest Trees are trained on different sets of randomly synthesized images using the following algorithm:

Joint Position Proposal Body Part Recognition provides per-pixel information Must utilize this to generate reliable proposals for joint proposals Method: local mode-finding approach based on mean shift with a weighted Gaussian kernel Density estimator per body part equation: x : a coordinate in 3D space N: the number of image pixels w ic : pixel weighting x î : re-projection of image pixel x i into space given b c : learned per-part bandwidth

Joint Position Proposal... Pixel weighting w ic is defined by: Accounts for: Inferred Body Part Probability: P c I, x i World surface area of the pixel: d I x i 2

Define 3D world space density: 3D coord pixel weight 3D coord of i th pixel 1 2 pixel index i bandwidth inferred probability depth at i th pixel Mean shift for mode detection 3. hypothesize body joints

Joint Position Proposal (cont) Key Details: Density estimates are depth invariant (8) Depending on the application, inferred body parts can be pre-accumulated, e.g. multiple body parts over the same area can be merged to form a localized joint Mean shift finds modes efficiently Pixels above a selected learned probability curve are used for starting points of c (7, 8) Final joint estimate: sum of the pixel weights reaching their given mode

Experiments Test Data: Set of 5000 synthesized depth images (original poses used to generate are left out) Real dataset of 8808 frames from more than 15 different subjects Depth data from [13] (28 depth image sequences ranging from short motion to full actions)

Parameters and Metrics 3 trees, depth of 20, 300k training images/tree, 2000 training example pixels/image, 2000 candidate features (θ), 50 candidate thresholds (τ)/feature Quantification of classification and joint prediction accuracy Classification: average per-class accuracy (all body parts =) Joint Prediction: average precision per joint using mean avg precision

Joint Prediction Given a ground truth position: the first joint proposal within D meters (D = 0.1m) is assumed to be the true positive, while all others are false positives Joint proposals outside D are also false positives All proposals counted, not just confident ones Invisible joints not penalized

Qualitative Results

Classification Accuracy and Metrics Metrics: # of training images Synthesized silhouette images Depth of trees Max probe offset

Classification Accuracy and Metrics...

Joint Prediction Accuracy Comparisons Nearest Neighbor Comparison, two variants: Var. 1: matches ground truth test skeleton to training skeletons with translational alignment in 3D space (note: no access to test skeleton in practice) Var. 2: chamfer matching [14]: uses depth edges and 12 orientation bins

Joint Prediction Accuracy Comparisons... Versus [13] [13]: Uses body part proposals from [28] Tracks the skeleton with kinematic and temporal info. Data from time-of-flight depth camera This paper's algorithm has improved joint prediction (using AP), and runs at least 10x faster (no values given)

Joint Prediction Accuracy Comparisons...

Full Rotations and Multiple People Decrease in map, but was still able to achieve an impressive accuracy of 65.5% Trained on 900k images with full rotations, 5k synthetic images with full rotations

Discussion Improvements? Weaknesses? Applications? Future work?

Resources/References 1. Articulated Pose Estimation using Flexible Mixtures of Parts. Y. Yang and D. Ramanan. CVPR 2011. 2. Real-Time Human Pose Recognition in Parts from a Single Depth Image. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. CVPR 2011. 3. Openc Dev Team. "Meanshift and Camshift." Meanshift and Camshift OpenCV 3.0.0-dev Documentation. Openc Dev Team, n.d. Web. 20 Feb. 2015 4. Random Decision Forests. Tin Kam Ho. 5. V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypoint recognition. In Proc. CVPR, pages 2:775 781, 2005.