Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Similar documents
Part based models for recognition. Kristen Grauman

Video Google: A Text Retrieval Approach to Object Matching in Videos

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Conditional Visual Tracking in Kernel Space

Spectral Latent Variable Models for Perceptual Inference

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

TA Section: Problem Set 4

Outline 7/2/201011/6/

Preparation Meeting. Recent Advances in the Analysis of 3D Shapes. Emanuele Rodolà Matthias Vestner Thomas Windheuser Daniel Cremers

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Structured light 3D reconstruction

Object Category Detection. Slides mostly from Derek Hoiem

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

AK Computer Vision Feature Point Detectors and Descriptors

Category-level localization

CAP 5415 Computer Vision Fall 2012

Beyond Bags of Features

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

Detecting Object Instances Without Discriminative Features

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

String distance for automatic image classification

Multiple Kernel Learning for Emotion Recognition in the Wild

Patch-based Object Recognition. Basic Idea

Local Image Features

Improving Vision-based Topological Localization by Combining Local and Global Image Features

Visual Object Recognition

Analysis and Synthesis of 3D Shape Families via Deep Learned Generative Models of Surfaces

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Part-based and local feature models for generic object recognition

CLASSIFICATION Experiments

Lecture 18: Human Motion Recognition

Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference

Object Classification Problem

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France

Using Geometric Blur for Point Correspondence

Learning Feature Hierarchies for Object Recognition

Large-scale visual recognition The bag-of-words representation

Instance-level recognition II.

Application questions. Theoretical questions

Inferring 3D from 2D

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Correspondence. CS 468 Geometry Processing Algorithms. Maks Ovsjanikov

Object Detection Using Segmented Images

Contents I IMAGE FORMATION 1

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

PERFORMANCE CAPTURE FROM SPARSE MULTI-VIEW VIDEO

Lecture 16: Object recognition: Part-based generative models

Learning bottom-up visual processes using automatically generated ground truth data

Fitting: The Hough transform

Fitting. Lecture 8. Cristian Sminchisescu. Slide credits: K. Grauman, S. Seitz, S. Lazebnik, D. Forsyth, J. Ponce

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Registration of Dynamic Range Images

Sparse coding for image classification

Dynamic Shape Tracking via Region Matching

Flow Estimation. Min Bai. February 8, University of Toronto. Min Bai (UofT) Flow Estimation February 8, / 47

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Multiple-Choice Questionnaire Group C

Bilevel Sparse Coding

Topological Mapping. Discrete Bayes Filter

Category vs. instance recognition

Action recognition in videos

Image Features: Detection, Description, and Matching and their Applications

Local Feature Detectors

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Structured Models in. Dan Huttenlocher. June 2010

Exploring Bag of Words Architectures in the Facial Expression Domain

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Mobile Visual Search with Word-HOG Descriptors

3D Human Motion Analysis and Manifolds

Recap Image Classification with Bags of Local Features

Object and Class Recognition I:

Evaluation and comparison of interest points/regions

The SIFT (Scale Invariant Feature

Chapter 3 Image Registration. Chapter 3 Image Registration

Computer Vision for HCI. Topics of This Lecture

Combining Appearance and Topology for Wide

Development in Object Detection. Junyuan Lin May 4th

TEXTURE CLASSIFICATION METHODS: A REVIEW

Selection of Scale-Invariant Parts for Object Class Recognition

Computational Design. Stelian Coros

Sparse Models in Image Understanding And Computer Vision

Visual Object Recognition

Finding Structure in Large Collections of 3D Models

SIFT - scale-invariant feature transform Konrad Schindler

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Instance-level recognition part 2

Object Category Detection: Sliding Windows

Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

CRF Based Point Cloud Segmentation Jonathan Nation

Local Features and Bag of Words Models

Snakes, level sets and graphcuts. (Deformable models)

3D object recognition used by team robotto

Learning Visual Similarity Measures for Comparing Never Seen Objects

Feature descriptors and matching

Human Pose Estimation from Monocular Images: A Comprehensive Survey

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Transcription:

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers

3D Human Pose Inference Difficulties Towards automatic monocular methods Background clutter Geometric transforms Scale & viewpoint change Illumination changes Fast motions (Self-)occlusion Variability in the human body proportions

Standard Discriminative Approach Train structured model to predict 3D human poses given image descriptor inputs - Multi-valued predictor necessary for multiple plausible pose interpretations Train using images and corresponding 3D human poses

Problems we address Lack of typical training data (lab, quasi-real) Image descriptors are unstable w.r.t. to geometric deformations and background clutter Clean Quasi-real (QReal) Real Predictor cannot generalize!

This Talk Hierarchical Image Encodings Distance Metric Learning Multi-Valued Semi-supervised Learning 1. Learning hierarchical image descriptors Multi-level / coarse-to-fine encodings stable w.r.t. deformation and misalignment in the training set Metrics for noise suppression (clutter removal) 2. Semi-supervised generalization to multi-valued prediction

Why do we need better features? Global histograms (bag of features) are robust to local deformation but sensitive to background clutter Regular grid descriptors can be made robust to clutter but are sensitive to training set misalignments & local deformations

Hierarchical Image Descriptors Coarse-to-fine encodings designed to represent multiple degrees of selectivity & invariance Progressively relax rigid local spatial encodings to weaker models of geometry accumulated over larger regions Layered encodings (e.g. spatial pyramid) or bottomup, hierarchical aggregation based on successive template matching + max pooling (e.g. HMAX)

Hierarchical Image Descriptors HMAX (Poggio et al, 2002-06) S1 Ω = MAX C1 S2 C2 16 Gabor Filter response Ω Ω Ω Encoding Select Patches / Object Parts to match against results from previous layer

Multilevel Image Descriptors Spatial Pyramid (Lazebnik et al, 2006) SIFT Descriptor, followed by Vector quantization Votes accumulation in a spatial region Concatenate to Descriptor

Dealing with Background Clutter Multi-level encodings are still perturbed by background Need to suppress noise Feature selection based on e.g. sparse linear regression tends to be ineffective for global descriptors

Distance Metric Learning Chunklet 2 Chunklet 1 Learn Mahalanobis distance that maximizes similarity within chunklets = sets of images of people in similar poses, but differently proportioned and placed on different backgrounds Relevant Component Analysis (Hillel et. al. 2003)

Suppressing Background Clutter Clean Quasi-real (QReal) Real The distance between the learned descriptors computed on different backgrounds is diminished

This talk Flexible training Hierarchical Image Encodings Distance Metric Learning Multi-Valued Semi-supervised Learning 1. Learning hierarchical image descriptors Multi-level / coarse-to-fine encodings stable w.r.t. deformation and misalignment in the training set Metrics for noise suppression & clutter removal 2. Semi-supervised generalization to multivalued prediction

Semi-supervised Multi-valued Prediction x 1 x 2 x 3D Human Pose r Image Descriptor r 1 r 2 Manifold Assumption If two image descriptors are close in their intrinsic geometry (e.g. encoded by the graph Laplacian), their 3D outputs should vary smoothly

Semi-supervised Multi-valued Prediction x 1 x 2 x 3 x 3D Human Pose r Image Descriptor r 1 r 3 r 2 Expert Ranking Assumption (Mixture of experts) If two image descriptors are close in their intrinsic image geometry (graph Laplacian), their 3D outputs should be smooth only if predicted by the same expert (prevent smoothing across partitions)

Experiments Multi-level encodings 5 hierarchical descriptors ~1500d image descriptor Dataset of human poses 56d human joint angle state vector 5 Motions obtained with motion capture Walk, Pantomime, Bending Pickup, Dancing and Running 3247 x 3 images + 1000 unlabeled images Multi-valued predictor uses 5 experts

Prediction Accuracy Multilevel vs. Global Descriptors Multilevel / hierarchical descriptors perform significantly better than global histograms or single-layer (fine) grids of local descriptors

Prediction Accuracy Before / after Metric Learning Metric learning improves the prediction error for global histogram descriptors

Run Lola Run Movie Automatic 3D Pose Reconstruction Integrated scanning detection window + 3D prediction Notice: scale change, occlusions (trees), self-occlusion, fast motion

Run Lola Run Movie 3D Pose Reconstruction Integrated scanning detection window + 3D prediction Notice: scale change, fast motion, transparencies

We have argued for 1. Learning of hierarchical image descriptors for better generalization under shape variability and background clutter 2. Semi-supervised generalization to multivalued prediction Ongoing work Jointly learn the features and the predictor Scaling to large datasets (> 500K samples)