Inferring 3D People from 2D Images

Similar documents
Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences

Predicting 3D People from 2D Pictures

Part I: HumanEva-I dataset and evaluation metrics

Visual Motion Analysis and Tracking Part II

3D Human Motion Analysis and Manifolds

Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation

CS 664 Flexible Templates. Daniel Huttenlocher

Graphical Models and Their Applications

Efficient Nonparametric Belief Propagation with Application to Articulated Body Tracking

Human Upper Body Pose Estimation in Static Images

Inferring 3D from 2D

A novel approach to motion tracking with wearable sensors based on Probabilistic Graphical Models

Boosted Multiple Deformable Trees for Parsing Human Poses

Efficient Mean Shift Belief Propagation for Vision Tracking

Hierarchical Part-Based Human Body Pose Estimation

Video Texture. A.A. Efros

3D Tracking for Gait Characterization and Recognition

Learning to Estimate Human Pose with Data Driven Belief Propagation

2D Articulated Tracking With Dynamic Bayesian Networks

Learning and Recognizing Visual Object Categories Without First Detecting Features

Visual Tracking of Human Body with Deforming Motion and Shape Average

Behaviour based particle filtering for human articulated motion tracking

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10

Automatic Detection and Tracking of Human Motion with a View-Based Representation

Practical Course WS12/13 Introduction to Monte Carlo Localization

Nonflat Observation Model and Adaptive Depth Order Estimation for 3D Human Pose Tracking

3D Human Body Tracking using Deterministic Temporal Motion Models

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France

Stochastic Tracking of 3D Human Figures Using 2D Image Motion

An Adaptive Appearance Model Approach for Model-based Articulated Object Tracking

Generating Different Realistic Humanoid Motion

Tracking People. Tracking People: Context

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Structured Models in. Dan Huttenlocher. June 2010

Epitomic Analysis of Human Motion

Segmentation. Bottom up Segmentation Semantic Segmentation

Scene Grammars, Factor Graphs, and Belief Propagation

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Scene Grammars, Factor Graphs, and Belief Propagation

On the Sustained Tracking of Human Motion

Efficient Extraction of Human Motion Volumes by Tracking

A Unified Spatio-Temporal Articulated Model for Tracking

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video

Probabilistic Robotics

Multi-view Human Motion Capture with An Improved Deformation Skin Model

Estimation of Human Figure Motion Using Robust Tracking of Articulated Layers

Computer Vision II Lecture 14

Algorithms for Markov Random Fields in Computer Vision

Representation and Matching of Articulated Shapes

A New Hierarchical Particle Filtering for Markerless Human Motion Capture

Data-driven methods: Video & Texture. A.A. Efros

Human Activity Recognition Using Multidimensional Indexing

radius length global trans global rot

Automatic Model Initialization for 3-D Monocular Visual Tracking of Human Limbs in Unconstrained Environments

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

08 An Introduction to Dense Continuous Robotic Mapping

Finding and Tracking People from the Bottom Up

A Swarm Intelligence Based Searching Strategy for Articulated 3D Human Body Tracking

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods

Motion Capture. Motion Capture in Movies. Motion Capture in Games

3D Human Pose Estimation from Monocular Image Sequences

Loose-limbed People: Estimating 3D Human Pose and Motion Using Non-parametric Belief Propagation

Multi-camera Tracking of Articulated Human Motion Using Motion and Shape Cues

Motion Interpretation and Synthesis by ICA

Conditional Visual Tracking in Kernel Space

Humanoid Robotics. Monte Carlo Localization. Maren Bennewitz

Modeling 3D Human Poses from Uncalibrated Monocular Images

Estimating Human Pose with Flowing Puppets

Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation

Reducing Particle Filtering Complexity for 3D Motion Capture using Dynamic Bayesian Networks

Automatic detection and tracking of human motion with a view-based representation

Tracking Human Body Parts Using Particle Filters Constrained by Human Biomechanics

Recognition Rate. 90 S 90 W 90 R Segment Length T

Guiding Model Search Using Segmentation

AUTOMATIC OBJECT EXTRACTION IN SINGLE-CONCEPT VIDEOS. Kuo-Chin Lien and Yu-Chiang Frank Wang

Probabilistic Detection and Tracking of Motion Boundaries

A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space

Learning Image Statistics for Bayesian Tracking

Tracking People Interacting with Objects

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization. Wolfram Burgard

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore

Edge Filter Response. Ridge Filter Response

Human Tracking with Mixtures of Trees

Tracking Generic Human Motion via Fusion of Low- and High-Dimensional Approaches

Learning Articulated Skeletons From Motion

Motion Synthesis and Editing. Yisheng Chen

Local cues and global constraints in image understanding

Object Tracking with an Adaptive Color-Based Particle Filter

Entropy Driven Hierarchical Search for 3D Human Pose Estimation

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Human Tracking with Mixtures of Trees

Optimal motion trajectories. Physically based motion transformation. Realistic character animation with control. Highly dynamic motion

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Capturing Skeleton-based Animation Data from a Video

Probabilistic Graphical Models

Tracking in Action Space

Reducing Particle Filtering Complexity for 3D Motion Capture using Dynamic Bayesian Networks

Transcription:

Inferring 3D People from 2D Images Department of Computer Science Brown University http://www.cs.brown.edu/~black

Collaborators Hedvig Sidenbladh, Swedish Defense Research Inst. Leon Sigal, Brown University Michael Isard, Microsoft Research Ben Sigelman, Brown University David Fleet, PARC Inc. (soon: U. Toronto) Funding: DARPA HumanID project.

Capturing Humans in Motion Loss of depth and motion in projection to 2D images. E T I E N N E - J U L E S M A R E Y, 1882. Marker-based tracking EADWEARD MUYBRIDGE, 1884-5. Multiple cameras.

Mocap Today Cameras CMU Mocap lab. 2003

Markerless Motion Capture Kinematic tree: Marr&Nishihara 78 0 [ θ 1,0 x, θ 1,0 y, θ 1,0 z 1 2,1 [ ] 2 θ x ] Also model angular velocities ~ 50D space [ τ [ θ, τ, τ 0, g 0, g 0, g x y z, θ, θ 0, g 0, g 0, g x y z ] ] Represent a pose at time t by a vector of these parameters: φ t

Markerless Motion Capture

Markerless Motion Capture Find the pose φ t * No special clothing * Unknown, cluttered, environment * Incremental estimation such that the projection matches the image data.

Markerless Motion Capture Find the pose φ t such that the projection matches the image data.

Motivation * Markerless Mocap animation, film, games, archival footage sports and rehabilitation medicine * Tracking gait recognition (biometric person identification) surveillance * Understanding HCI/gesture recognition (cars, elder care, games, ) video search/annotation

Why is it Difficult? The appearance of people can vary dramatically. Bones and joints are unobservable (muscle, skin, clothing hide the underlying structure).

Why is it Difficult? Loss of 3D in 2D projection Unusual poses Self occlusion Low contrast

Clothing and Lighting

Large Motions Long-range motions. (makes search and matching hard) Motion blur. (nothing to match)

Ambiguities Ambiguous matches Self occlusion

Requirements 1. Represent uncertainty and multiple hypotheses. 2. Model complex kinematics of the body. Correlations between joints and over time. 3. Exploit multiple image cues in a robust fashion. 4. Integrate information over time. The recovery of human motion is fundamentally a problem of inference from ambiguous and uncertain measurements.

Bayesian formulation p(model cues) = Approach p(cues model) p(model) p(cues) 1. Model: Kinematic tree. Cues: filter responses. 2. Likelihood: exploit learned likelihood of filter responses conditioned on the projected model. 3. Prior: statistical model, learned from examples. 4. Search: discretize intelligently using factored sampling and search using a particle filter.

Towards a Rigorous Likelihood 1. Project 3D model into image to predict the location of limb regions in the scene. 2. Compute rich set of spatial and temporal filter responses conditioned on the predicted orientation of the limb. 3. Compute likelihood of filter responses using a statistical model learned from examples.

Natural Image Statistics * Marginal statistics of image derivatives are non-gaussian. * Consistent across scale. filter response Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford.

Human-Specific Statistics f i filter response p ( f φ on i t Learn marginal statistics. )

Generic Background Statistics f i filter response p ( f off i )

Likelihood crude independence assumptions p( f 1,..., f k φ t ) limbs filters i= 1 p p on ( off fi ( φt f ) i )

Prior Bayesian formulation p(model cues) = p(cues model) p(model) p(cues) 1. Model: Kinematic tree. Cues: filter responses. 2. Likelihood: exploit learned likelihood of filter responses conditioned on the projected model. 3. Prior: statistical model, learned from examples.

Learning Human Motion * constrain the posterior to likely & valid poses/motions * What we want: joint angles p( φ φ ) t t 1 3D motion-capture data. * Database with multiple actors and a variety of motions. (from M. Gleicher) time

Implicit Probabilistic Prior Problem: * insufficient data to learn an explicit prior probabilistic model of human motion. Alternative: * the data represents all we know Efros & Freeman 01 * replace representation and learning with search. (challenge: search has to be fast)

Texture Synthesis Database Synthetic Texture * e.g. De Bonnet&Viola, Efros&Leung, Efros&Freeman, Paztor&Freeman, Hertzmann et al. * Image(s) as an implicit probabilistic model. Efros & Freeman 01

Mocap Soup [Cohen] SIGGRAPH 2002: Arikan & Forsyth. Interactive motion generation from examples Li et al. Motion textures: A two-level statistical model for character motion synthesis Lee et al. Interactive control of avatars animated with human motion data Kovar et al. Motion graphs Pullen & Bregler. Motion capture assisted animation: Texturing and synthesis Here we formulate a probabilistic model suitable for stochastic search and Bayesian tracking.

Motion Texture Φ t 1 s φ t Ψ i 1 ψ i

Probabilistic Formulation Want samples from the temporal prior s p( φ Φ ) t t 1 pose at time t history of poses up to t-1 Instead, sample from Database index i-1 p( Ψ Φ ) i 1 t 1 and take φ ts = ψ i Problem: Compute p( Ψi Φt ) for all motions i in the database?

Approximate Probabilistic Model Approach: Only need samples. Trade accuracy for speed of sampling.

Probabilistic Database Search Each level in the tree corresponds to one PCA coefficient l. Sort each motion example i into a tree: Left subtree for negative value of c l,i, right for positive value.

Probabilistic Database Search Probabilistic Database Search ) ( ) ( t i t i p p c c Φ Ψ Approximated by sampling from tree iteratively: ) 0 ( ) ( ) 0 (,,, 2 2 2 exp 2 1,,,, t l i l left l dz c z l z l t l i l right l c c p p c c p p t l < = = = = βσ πβσ

Motion Texture Synthesis Start with a small motion chunk, sample to generate a new sequence of poses. Changing color indicates new example sequence. Walking

Bayesian Formulation Posterior over model parameters given an image sequence. r p( φt It ) = Temporal model (prior) r κ p( I φ ) ( p( φ φ dφ t Likelihood of observing the image features (filters responses) given the model parameters t t t 1 ) p( φt 1 It 1)) t 1 Posterior from previous time instant Monte Carlo integration

Bayesian formulation Inference/Search p(model cues) = p(cues model) p(model) p(cues) 1. Model: Kinematic tree. Cues: filter responses. 2. Likelihood: exploit learned likelihood of filter responses conditioned on the projected model. 3. Prior: statistical model, learned from examples. 4. Search: discretize intelligently using factored sampling and search using a particle filter.

Key Idea: Represent Ambiguity * Represent a multi-modal posterior probability distribution over model parameters - sampled representation - each sample is a pose and its probability (likelihood weighting) - predict over time using a particle filter. Samples from a distribution over 3D poses.

r Posterior p φ I ) ( t 1 t 1 Particle Filter Isard & Blake 96

Posterior p φ I ) sample r ( t 1 t 1 Particle Filter Isard & Blake 96

Posterior p φ I ) sample Temporal dynamics p( φt φt 1) ( t 1 t 1 sample r Particle Filter Isard & Blake 96

Posterior p φ I ) sample Temporal dynamics p( φt φt 1) Likelihood ( t 1 t 1 sample r p( It φt) Particle Filter Isard & Blake 96

r Posterior p φ I ) ( t 1 t 1 Particle Filter Temporal dynamics Posterior sample p( φt φt 1) Likelihood sample r p( φt It) p( It φt) normalize Isard & Blake 96

Stochastic 3D Tracking monocular sequence * 2500 samples. * circa 2000.

How Strong is the Prior? * Learned walking prior. * No likelihood = hallucination.

Related Work Cham & Rehg 99 * Single camera, multiple hypotheses. * 2D templates (no change in viewpoint) * Manual initialization.

Related Work * multiple cameras * simplified clothing and light * manual initialization * weak prior Deutscher, North, Bascle, & Blake 99 * annealed particle filter

* manual initialization * monocular * complex background Related Work * multiple cues (motion, edges, motion discontinuities) * weak prior * careful attention paid to the optimization problem. Sminchisescu &Triggs 01

Are we done? Current systems: * require manual initialization * are brittle (can t re-initialize) * can t easily exploit robust, bottom-up, detectors The search space is huge. Particle filtering on the whole space requires strong priors or huge numbers of samples.

Kinematic Tree * brittle if it does not fit perfectly. * starts with torso which is hardest to find. * faces, hands, and feet may be easier to find. Traditional kinematic tree (dogma) * these are defined wrt the other limbs results in a full high-dimensional search to fit bottom up measurements.

Attractive People Traditional kinematic tree (dogma) Push puppet toy (dog)

Attractive People Loose-limbed body (graphical model) Push puppet toy

Loosely-Jointed Bodies (with Michael Isard and Leon Sigal) Soft constraints (messages) between limbs. Pose estimation becomes inference in a graphical model (Belief Propagation). Allows bottom up initialization. Deals well with unobserved limbs. Pictorial structures Fischler and Elschlager 73 More recently (2D, discretized): * Felzenszwalb & Huttenlocher 00, Forsyth et al 00-03.

Loosely-Jointed Bodies (with Michael Isard and Leon Sigal) * 6D (position & orientation) discretization not practical

Belief random variable (position and orientation of node i) local evidence observations (image and filter responses) neighbors of node i p ( X Y) = α λ( Y X ) m ( X ) i i i k A i ki i incoming messages

Messages message from node i to node j local evidence m ij ( X ) = α ψ ( X, X ) λ( X ) m ( X ) dx j ij i j i k A \ i j ki i i potential from i to j (prob of X j conditioned on X i ) neighbors of ( spatial or temporal prior) i, not including j incoming messages at i this is the hard part if things aren t Gaussian

Messages message foundation m ij ( X ) = α ψ ( X, X ) λ( X ) m ( X ) dx j ij i j i k A \ Monte Carlo integration. * draw samples from normalized foundation * propagate through potential function i j ki i i

Potential Functions forward backward * represented by a mixture of Gaussians (fit to mocap data).

Temporal Potentials Introduces loops Time 6 6 ψ t, t+1 6 6 * we also define potentials backwards/forwards in time.

Approach Problems: * state space is continuous and relatively high dimensional * likelihoods are non-gaussian and multi-modal * relationships between limbs are complex (not Gaussian) Approach: * exploit particle set idea to represent messages * Algorithm: Non-parametric Belief Propagation (Isard 03, Sudderth et al 03)

1. represent messages and beliefs by a discrete set of weighted samples (ie. mixture of narrow Gaussians). Algorithm Sketch

2. compute product of incoming messages (also a mixture of Gaussians). Product of d mixtures of n Gaussians: n d Algorithm Sketch

Algorithm Sketch 2. compute product of incoming messages (also a mixture of Gaussians). Product of d mixtures of n Gaussians: n d Gibbs sampler (Sudderth et al): fix d-1 samples take the product with n Gaussians.

Algorithm Sketch 2. compute product of incoming messages (also a mixture of Gaussians). Product of d mixtures of n Gaussians: n d Gibbs sampler (Sudderth et al): weight the samples

Algorithm Sketch 2. compute product of incoming messages (also a mixture of Gaussians). Repeat until you ve drawn n samples. Cost: O(dkn 2 ) Gibbs sampler (Sudderth et al): Sample to select a new Gaussian. * Repeat with each message. * Repeat the process k times. * Take the product of the selected Gaussians and draw a sample.

Algorithm Sketch 3. to construct a message, we draw samples from a proposal distribution (including incoming message product, belief, and bottom up processes); importance re-weight. Noisy bottom up process (limb detector)

4. propagate samples through the potential to get new message. Repeat. Algorithm Sketch

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Iteration 7

Iteration 10

Summary We have tackled four important parts of the problem: likelihood prior search initialization 1. Probabilistically modeling human appearance in a generic, yet constraining, way. 2. Representing the range of possible motions using techniques from texture modeling. 3. Dealing with ambiguities and non-linearities using particle filtering for Bayesian inference. 4. Automatic initialization using Belief Propagation.

What Next? * part detectors (faces, limbs, hands, feet, etc) * interactions between non-adjacent limbs introduces new nodes and loops to the graphical model * new techniques for learning probabilistic models in high dimensional spaces with limited training data.

Outlook 5-10 years: - Relatively reliable people tracking in monocular video - Accurate with multiple cameras - Path is pretty clear. Next step: Beyond person-centric - people interacting with object/world solve the vision problem. Beyond that: Recognizing action - goals, intentions,... solve the AI problem.