Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Similar documents
CS 664 Flexible Templates. Daniel Huttenlocher

Pictorial Structures for Object Recognition

Learning and Recognizing Visual Object Categories Without First Detecting Features

Algorithms for Markov Random Fields in Computer Vision

Some material taken from: Yuri Boykov, Western Ontario

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Segmentation. Bottom up Segmentation Semantic Segmentation

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Beyond Bags of features Spatial information & Shape models

CS 664 Segmentation. Daniel Huttenlocher

Topological Mapping. Discrete Bayes Filter

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Introduction to SLAM Part II. Paul Robertson

Finding people in repeated shots of the same scene

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Structured Models in. Dan Huttenlocher. June 2010

Object recognition (part 1)

Computer Vision. Exercise Session 10 Image Categorization

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Computer Vision I - Filtering and Feature detection

Object Recognition with Deformable Models

Computer Vision 2 Lecture 8

Scene Grammars, Factor Graphs, and Belief Propagation

Generative and discriminative classification techniques

Scene Grammars, Factor Graphs, and Belief Propagation

Nearest Neighbor Classifiers

Markov Networks in Computer Vision

Graphical Models for Computer Vision

Collective classification in network data

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Markov Networks in Computer Vision. Sargur Srihari

10-701/15-781, Fall 2006, Final

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Outline. Data Association Scenarios. Data Association Scenarios. Data Association Scenarios

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Regularization and model selection

08 An Introduction to Dense Continuous Robotic Mapping

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Digital Image Processing Laboratory: MAP Image Restoration

Classification: Linear Discriminant Functions

Sparse Shape Registration for Occluded Facial Feature Localization

Robotics Programming Laboratory

Supervised vs unsupervised clustering

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Lecture 28 Intro to Tracking

Robust Model-Free Tracking of Non-Rigid Shape. Abstract

Segmentation: Clustering, Graph Cut and EM

Recall: Blob Merge/Split Lecture 28

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

Conditional Random Fields for Object Recognition

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION

EE795: Computer Vision and Intelligent Systems

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Template Matching Rigid Motion

Verification: is that a lamp? What do we mean by recognition? Recognition. Recognition

What do we mean by recognition?

1) Give decision trees to represent the following Boolean functions:

Generative and discriminative classification techniques

Modern Object Detection. Most slides from Ali Farhadi

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Passive Differential Matched-field Depth Estimation of Moving Acoustic Sources

Configuration Space of a Robot

3 : Representation of Undirected GMs

Integrating Probabilistic Reasoning with Constraint Satisfaction

Machine Learning / Jan 27, 2010

Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Part based models for recognition. Kristen Grauman

Graphical Models & HMMs

Face detection and recognition. Detection Recognition Sally

Learning GMRF Structures for Spatial Priors

Local Feature Detectors

CS 664 Lecture 2 Distance Transforms. Prof. Dan Huttenlocher Fall 2003

Content-based image and video analysis. Machine learning

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1

CS 231A Computer Vision (Fall 2012) Problem Set 4

Motion Tracking and Event Understanding in Video Sequences

Loopy Belief Propagation

Template Matching Rigid Motion. Find transformation to align two images. Focus on geometric features

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Belief propagation and MRF s

Skin and Face Detection

Announcements. Recognition (Part 3) Model-Based Vision. A Rough Recognition Spectrum. Pose consistency. Recognition by Hypothesize and Test

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Context Based Object Categorization: A Critical Survey

Feature Detection and Tracking with Constrained Local Models

Learning Articulated Skeletons From Motion

Recognition (Part 4) Introduction to Computer Vision CSE 152 Lecture 17

Machine Learning Classifiers and Boosting

A New Approach to Early Sketch Processing

Transcription:

Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition and overview Part-based models of objects Pictorial structures for D modeling A Bayesian framework Formalize both learning and recognition problems Efficient algorithms for pictorial structures Learning models from labeled examples Recognizing objects (anywhere) in images

Object Recognition Given some kind of model of an object Shape and geometric relations Two- or three-dimensional Appearance and reflectance color, texture, Generic object class versus specific object Recognition involves Detection: determining whether an object is visible in an image (or how likely) Localization: determining where an object is in the image 3 Our Recognition Goal Detect and localize multi-part objects that are at arbitrary locations in a scene Generic object models such as person or car Allow for articulated objects Combine geometry and appearance Provide efficient and practical algorithms 4

Pictorial Structures Local models of appearance with non-local geometric or spatial constraints Image patches describing color, texture, etc. D spatial relations between pairs of patches Simultaneous use of appearance and spatial information Simple part models alone too non-distinctive 5 A Brief History of Recognition Pictorial structures date from early 1970 s Practical recognition algorithms proved difficult Purely geometric models widely used Combinatorial matching to image features Dominant approach through early 1990 s Don t capture appearance such as color, texture Appearance based models for some tasks Templates or patches of image, lose geometry Generally learned from examples Face recognition a common application 6

Other Part-Based Approaches Geometric part decompositions Solid modeling (e.g., Biederman, Dickinson) Person models First detect local features then apply geometric constraints of body structure (Forsyth & Fleck) Local image patches with geometric constraints Gaussian model of spatial distribution of parts (Burl & Perona) Pictorial structure style models (Lipson et al) 7 Formal Definition of Our Model Set of parts V={v 1,, v n } Configuration L=(l 1,, l n ) Random field specifying locations of the parts Appearance parameters A=(a 1,, a n ) Edge e ij, (v i,v j ) E for neighboring parts Explicit dependency between l i, l j Connection parameters C={c ij e ij E} 8

Quick Review of Probabilistic Models Random variable X characterizes events E.g., sum of two dice Distribution p(x) maps to probabilities E.g., 1/36, 5 1/9, Joint distribution p(x,y) for multiple events E.g., rolling a and a 5 p(x,y)=p(x)p(y) when events independent Conditional distribution p(x Y) E.g., sum given the value of one die Random field is set of dependent r.v. s 9 Problems We Address Recognizing model Θ=(A,E,C) in image I Find most likely location L for the parts Or multiple highly likely locations Measure how likely it is that model is present Learning a model Θ from labeled example images I 1,, I m and L 1,,L m Known form of model parameters A and C E.g., constant color rectangle Learn a i : average color and variation E.g., relative translation of parts Learn c ij : average position and variation 10

Standard Bayesian Approach Estimate posterior distribution p(l I,Θ) Probabilities of various configurations L given image I and model Θ Find maximum (MAP) or high values (sampling) Proportional to p(i L,Θ)p(L Θ) [Bayes rule] Likelihood p(i L,Θ): seeing image I given configuration and model Fixed L, depends only on appearance, p(i L,A) Prior p(l Θ): obtaining configuration L given just the model No image, depends only on constraints, p(l E,C) 11 Class of Models Computational difficulty depends on Θ Form of posterior distribution Structure of graph G=(V,E) important G represents a Markov Random Field (MRF) Each r.v. depends explicitly on neighbors Require G be a tree Prior on relative location p(l E,C) = E p(l i,l j c ij ) Natural for models of animate objects skeleton Reasonable for many other objects with central reference part (star graph) Prior can be computed efficiently 1

Class of Models Likelihood p(i L,A) = i p(i l i,a i ) Product of individual likelihoods for parts Good approximation when parts don t overlap Form of connection also important space with deformation distance p(l i,l j c ij ) η(t ij (l i )-T ji (l i ),0,Σ ij ) Normal distribution in transformed space T ij, T ji capture ideal relative locations of parts and Σ ij measures deformation Mahalanobis distance in transformed space (weighted squared Euclidean distance) 13 Bayesian Formulation of Learning Given example images I 1,, I m with configurations L 1,, L m Supervised or labeled learning problem Obtain estimates for model Θ=(A,E,C) Maximum likelihood (ML) estimate is argmax Θ p(i 1,, I m, L 1,, L m Θ) argmax Θ k p(i k,l k Θ) independent examples Rewrite joint probability as product appearance and dependencies separate argmax Θ k p(i k L k,a) k p(l k E,C) 14

Efficiently Learning Models Estimating appearance p(i k L k,a) ML estimation for particular type of part E.g., for constant color patch use Gaussian model, computing mean color and covariance Estimating dependencies p(l k E,C) Estimate C for pairwise locations, p(l ik,l jk c ij ) E.g., for translation compute mean offset between parts and variation in offset Best tree using minimum spanning tree (MST) algorithm Pairs with smallest relative spatial variation 15 Example: Generic Face Model Each part a local image patch Represented as response to oriented filters Vector a i corresponding to each part Pairs of parts constrained in terms of their relative (x,y) position in the image Consider two models: 5 parts and 9 parts 5 parts: eyes, tip of nose, corners of mouth 9 parts: eye split into pupil, left side, right side 16

Learned 9 Part Face Model Appearance and structure parameters learned from labeled frontal views Structure captures pairs with most predictable relative location least uncertainty Gaussian (covariance) model captures direction of spatial variations differs per part 17 Example: Generic Person Model Each part represented as rectangle Fixed width, varying length Learn average and variation Connections approximate revolute joints Joint location, relative position, orientation, foreshortening Estimate average and variation Learned 10 part model All parameters learned Including joint locations Shown at ideal configuration 18

Bayesian Formulation of Recognition Given model Θ and image I, seek good configuration L Maximum a posteriori (MAP) estimate Best (highest probability) configuration L L*=argmax L p(l I,Θ) Sampling from posterior distribution Values of L where p(l I,Θ) is high With some other measure for testing hypotheses Brute force solutions intractable With n parts and s possible discrete locations per part, O(s n ) 19 Efficiently Recognizing Objects MAP estimation algorithm Tree structure allows use of Viterbi style dynamic programming O(ns ) rather than O(s n ) for s locations, n parts Still slow to be useful in practice (s in millions) New dynamic programming method for finding best pair-wise locations in linear time Resulting O(ns) method Requires a distance not arbitrary cost Similar techniques allow sampling from posterior distribution in O(ns) time 0

The Minimization Problem Recall that best location is L*= argmax L p(l I,Θ)=argmax L p(i L,A)p(L E,C) Given the graph structure (MRF) just pairwise dependencies L*= argmax L V p(i l i,a i ) E p(l i,l j c ij ) Standard approach is to take negative log L*= argmin L Σ V m j (l j ) + Σ E d ij (l i,l j ) m j (l j )=-log p(i l j,a j ) how well part v j matches image at l j d ij (l i,l j )=-log p(l i,l j c ij ) how well locations l i,l j agree with model 1 Minimizing Over Tree Structures Use dynamic programming to minimize Σ V m j (l j ) + Σ E d ij (l i,l j ) Can express as function for pairs B j (l i ) Cost of best location of v j given location l i of v i Recursive formulas in terms of children C j of v j B j (l i ) = min lj ( m j (l j ) + d ij (l i,l j ) + Σ Cj B c (l j ) ) For leaf node no children, so last term empty For root node no parent, so second term omitted

Running Time Compute minimum using these equations Start with leaf nodes, build up sub-trees O(ns ) running time for n parts and s locations of each part Each part pair defining one equation B j (l i ) O(s ) time per pair, O(n) pairs When d ij is distance don t need to consider location pairs Define B j (l i ) as a kind of distance transform For each location of v j minimum location of v i 3 Classical Distance Transforms Defined for set of points, P, P (x) = min y P x - y For each location x distance to nearest y in P Think of as cones rooted at each point of P Commonly computed on a grid Γ using P (x) = min y Γ ( x - y + 1 B (y) ) Where 1 B (y) = 0 when y P, otherwise 1 1 1 1 0 1 1 0 1 1 1 1 4

Computing Distance Transforms Two pass algorithm for L 1 norm O(sD) time for s locations on a D-dim grid On each pass, min sum of mask and distance array ( in place ) Simple method to approximate L p norms More involved exact method for L that also reports which point is closest 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 5 Generalized Distance Transforms Replace indicator function with arbitrary f f (x) = min y Γ ( x - y + f(y) ) Intuitively, for grid location x, find y where f(y) plus distance to x is small A distance plus a cost for each location Change in f (x) is bounded by change in x Small value of f dominates nearby large values This generalized distance transform (GDT) computed same way as classic DT 6

O(ns) Algorithm for MAP Estimate Can express B j (l i ) in recursive minimization formulas as a GDT f (T ij (l i )) Cost function for GDT f(y) = m j (T ji -1 (y)) + Cj B c (T ji -1 (y)) T ij maps locations to space where difference between l i and l j is a squared distance Distance zero at ideal relative locations Have n recursive equations Each can be computed in O(sD) time D is number of dimensions to parameter space but is fixed (in our case D is to 4) 7 Recognizing Faces Generic model of frontal view Using learned 5- and 9-part models Local oriented filters for parts Relatively small spatial variation in part locations Similar overall size and orientation of face MAP estimation to find best match Posterior estimate of configuration L is accurate because parts do not overlap Consider all possible locations in image Runs at several frames per second on a desktop workstation 8

Example: Recognizing Faces 9 Example: Recognizing People Frontal view models Generic model using binary rectangles for parts Match to difference image Specific model using color rectangles for parts Match to original image Sampling posterior to find good matches Posterior estimate of L can be high for several configurations due to overlap of parts Use best of 00 samples Measured using correlation (Chamfer matching) Search over all locations runs in under minute 30

Sampling the Posterior Generate good possible matches as hypotheses Locations where p(l I,Θ) large Validate or compare using another technique Here use a correlation-like measure (Chamfer) Computation similar to MAP estimation Recursive equations, one per part Ability to solve each equation in linear time Via convolution with Gaussian Linear time dynamic programming approximation using box filters (due to Wells) 31 Example: Recognizing People 3

Variety of Poses 33 Variety of Poses 34

Samples From Posterior 35 Model of Specific Person 36

Summary Pictorial structures combine local part appearance and global spatial constraints Don t try to localize parts first exploit context Suitable for generic models of object classes Bayesian framework provides natural learning problem ML estimation Only requires placing part models in images; structure and parameters are learned Practical algorithms for searching over all possible locations in image Best match or good matches (high posterior) 37 What s Next Allow for occluded parts Make part likelihood p(i l i,a i ) a robust measure Apply to tracking people in video Incorporate location at previous time frame into prior Use for more efficient methods Start with generic models and use to learn person specific models Discriminate between people Use person and face methods together 38