Combining PGMs and Discriminative Models for Upper Body Pose Detection

Similar documents
Estimating Human Pose in Images. Navraj Singh December 11, 2009

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

High Level Computer Vision

Structured Models in. Dan Huttenlocher. June 2010

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Category vs. instance recognition

Tri-modal Human Body Segmentation

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Body Parts Dependent Joint Regressors for Human Pose Estimation in Still Images

Multiple-Person Tracking by Detection

HOG-based Pedestriant Detector Training

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Tracking People. Tracking People: Context

CS 231A Computer Vision (Fall 2011) Problem Set 4

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Object Category Detection: Sliding Windows

Face detection and recognition. Detection Recognition Sally

The Caltech-UCSD Birds Dataset

Object recognition (part 1)

Detecting Object Instances Without Discriminative Features

Object Detection Design challenges

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Data-driven Depth Inference from a Single Still Image

Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation

A Keypoint Descriptor Inspired by Retinal Computation

Strong Appearance and Expressive Spatial Models for Human Pose Estimation

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

Automatic Tracking of Moving Objects in Video for Surveillance Applications

Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Object detection using non-redundant local Binary Patterns

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Generic Face Alignment Using an Improved Active Shape Model

Face Recognition using Eigenfaces SMAI Course Project

Pose Machines: Articulated Pose Estimation via Inference Machines

Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild

Easy Minimax Estimation with Random Forests for Human Pose Estimation

Category-level localization

Efficient Detector Adaptation for Object Detection in a Video

Poselet Conditioned Pictorial Structures

Robust PDF Table Locator

Study of Viola-Jones Real Time Face Detector

CAP 6412 Advanced Computer Vision

Selective Search for Object Recognition

An Implementation on Histogram of Oriented Gradients for Human Detection

EE368 Project: Visual Code Marker Detection

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

The Kinect Sensor. Luís Carriço FCUL 2014/15

Object Detection by 3D Aspectlets and Occlusion Reasoning

Spatial Localization and Detection. Lecture 8-1

2D Image Processing Feature Descriptors

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Modeling 3D viewpoint for part-based object recognition of rigid objects

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors

Data driven 3D shape analysis and synthesis

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Project 3 Q&A. Jonathan Krause

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Segmentation and Tracking of Partial Planar Templates

Deformable Part Models

Linear combinations of simple classifiers for the PASCAL challenge

Qualitative Pose Estimation by Discriminative Deformable Part Models

Combining Discriminative Appearance and Segmentation Cues for Articulated Human Pose Estimation

A novel template matching method for human detection

Combining Selective Search Segmentation and Random Forest for Image Classification

Pedestrian Detection Using Structured SVM

CS 231A Computer Vision (Fall 2012) Problem Set 3

Human Upper Body Pose Estimation in Static Images

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Articulated Pose Estimation with Flexible Mixtures-of-Parts

CS 231A Computer Vision (Fall 2012) Problem Set 4

Region-based Segmentation and Object Detection

CS 223B Computer Vision Problem Set 3

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Critique: Efficient Iris Recognition by Characterizing Key Local Variations

Decomposing a Scene into Geometric and Semantically Consistent Regions

HUMAN PARSING WITH A CASCADE OF HIERARCHICAL POSELET BASED PRUNERS

Structured Completion Predictors Applied to Image Segmentation

CRF Based Point Cloud Segmentation Jonathan Nation

Human Activity Recognition Using Multidimensional Indexing

Lecture 10: Semantic Segmentation and Clustering

Supplementary Material: Decision Tree Fields

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Artificial Neuron Modelling Based on Wave Shape

Test-time Adaptation for 3D Human Pose Estimation

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Human detection using local shape and nonredundant

Using k-poselets for detecting people and localizing their keypoints

CSE/EE-576, Final Project

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Supporting Information

Face Detection and Alignment. Prof. Xin Yang HUST

A System of Image Matching and 3D Reconstruction


Object Category Detection. Slides mostly from Derek Hoiem

Deep Learning With Noise

Transcription:

Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative models such as SVM to perform upper body pose estimation. Figure 1 illustrates the definition of a problem in terms of input and output. Pose estimation is an important problem that has a wide array of applications such as pedestrian detection, sports video analysis, action recognition, etc. The entire human body can be seen as a graph where each body part is the node in the graph and two adjacent body parts are connected via an edge. Therefore, it is natural to utilize probabilistic graphical models to solve pose detection related problems. Additionally, the idea of incorporating discriminative models has been shown to be very successful as well, specifically in [1] [2]. Intuitively it makes sense that combination of generative and discriminative models should produce a model that is more powerful than each of these individually. Therefore, my hope is that combining these two techniques will produce solid results for this challenging problem. Figure 1: Illustration of a pose estimation problem in terms of input and output. Given an RGB image we want to detect joint locations in human body. In this specific project, I focused on the upper body estimation, which made the problem slightly easier. 1

2 Dataset For this project, I utilized LSP (Leeds Sports Pose) dataset, which is an extremely challenging dataset and consists of 10000 training images and 2000 training images. What makes this dataset so challenging is a large variation of human poses, different scales, in which humans appear, the fact that some of the body parts are occluded and also multiple humans appearing in the same image. Some of the examples from the dataset are illustrated in Figure 2. To reduce computational complexity I used 1500 training and 300 training images in this case. Additionally, to make the problem a bit easier I eliminated some of the examples where most of the parts were occluded. Figure 2: Sample images from Leeds Sports Dataset. As the examples illustrate, the dataset is extremely challenging because it contains a wide array of unusual poses. Additionally, details as variance in occlusion, illumination, etc makes the dataset even more difficult. 3 Proposed Method The basic methodology that I will be using was inspired by [2] [1]. The main idea behind the proposed method is to model a pose detection problem in the form of a PGM where each node in the graph is a joint in humans body (elbow, shoulder, knee, etc) and the edges in PGM simply denote the connection between two adjacent body joints. The method can be decomposed into three main stages: constructing the unary potentials, building the pairwise potentials, and then performing MAP inference. As already mentioned,the main power of the method comes from using discriminative models to construct unary and pairwise potentials and then incorporating these for the inference. I will now discuss each of these steps in greater detail. 3.1 Unary Potentials The basic idea behind constructing unary potentials is as follows. Intuitively, we want our unary potentials to capture the likelihood that a given region in the image contains some specific body part just as shown in Figure 3. I will now describe how to actually compute unary potentials. For simplicity, let s consider an example where we want to evaluate probabilities pertaining to only one human body part Head. Given any region in an image we want to compute the probability of that particular region containing head. We can do so 2

Figure 3: Illustration of broad idea behind constructing unary potentials. We simply want to evaluate the likelihood that a given region in the image contains some specific body part. by utilizing one-vs-all SVM classifier in the following way. First, we use ground truth labels in our training data to collect regions that contain head. We use these regions as positive examples in the proposed SVM framework. We also sample a large collection of examples that do not contain head and use these as negative examples. Using this setup, we then train one-vs-all SVM classifier, which will act as our head detector. We do this for every part of human body, which we want to consider. After training all of these classifiers we can compute a probability that a given image region contains a specific body part in the following way. First, we extract features that represent a given image region (I used HOG features in this case). Then we use these features as an input to our trained SVM classifiers. Each of these SVM classifiers will output a probability that a given region contains a particular body part. The illustration of this approach is presented in Figure 4. Figure 4: To construct unary potentials we utilize one-vs-all SVM framework. First, using the ground truth labels from our training data we collect positive examples (image regions that contain human body part of interest). Then we collect a large sample of random regions that do not contain the body part of interest. Using this setup we can train one-vs-all SVM classifier, which will act as a detector for that particular body part. As already mentioned, to represent the images, I utilized HOG features, which provided 81 dimensional feature representation. Such a low dimensional representation provided very efficient framework but at the same time enforced a limit on the power of the model as will be discussed later. 3

3.2 Pairwise Potentials In addition to unary potentials, I also needed to construct pairwise potentials for our PGM model. The intuitive idea behind pairwise potential construction is to learn a model that would tell us how well two given body part candidates fit together. For example, one would imagine that in the traditional setup the location of left shoulder should be below and to the left of head s position. Based on such a model, this configuration would give us high probability, whereas some two randomly generated configurations should give very low probabilities. Now I will present specific details how to implement this idea. Given two candidate regions that may or may not contain specific pair of body parts we want to evaluate how well this pair fits to each other. We can represent these regions in terms of (x, y) coordinates. Then, using this pair of coordinates we can easily compute the distance between two given regions on the horizontal and vertical axis. Additionally, we can compute the angle between these two locations with the center of the image being our origin points. Then, we can concatenate all of these metrics into a single feature vector that would capture relative distance and relative position between these two given regions. To build our model, once again we turn to one-vs-all SVM framework. Using the ground truth labels from our training data we can sample the coordinates of each pair of body parts that are connected in our model. Using these coordinates, we can then build a feature vector that captures relative distance and position of these two body parts as described earlier, and use these features as our positive examples. For the negative examples, we simply sample a random pair of locations and build the feature vector in the same way. Using these feature vectors we can once again train one-vs-all SVM classifier specific to each pair of connected body parts in our model. We have to train these SVM classifiers for every pair of body parts that have edges between them in our specified PGM. Figure 5 illustrates the basic intuition how to construct pairwise potentials. Figure 5: We want to construct pairwise potentials in such a way so that given two candidate locations of adjacent body parts we could evaluate how well these two locations fit together based on our trained SVM model. 4

3.3 Inference To perform inference I used an external software that implemented inference algorithm presented in [3]. The algorithm provides an approximate inference and is based on Linear Programming techniques. Since I did not study this algorithm in much detail but simply used an already existing implementation, I will not discuss it in any more details. The key thing here is that ir provides an efficient MAP inference with solid results as illustrated in [3]. 4 Experiments 4.1 Quantitative Results In this section, I present the results produced by some state of the art methods in Figure 6 and the results produced by my method in Table 1. It is important to notice that the direct comparison between them cannot be done accurately because the presented state of the art methods predict the actual body parts whereas my method predicts the joints in the human body. However, looking into these results we can still make several observations. First, the results produced by my method suggests that the performance of the method is not that great. We will discuss the reasons for that in the conclusion. However, there are also some positive things related to my proposed method. Firstly, my method is much more efficient than the proposed state of the art methods. Whereas my method can perform testing in 10 seconds per image, state of the art methods can take several minutes to label body parts for one image. Additionally, it is worth noticing that state of the methods struggle with arm predictions. That includes both lower and upper arms. However, as illustrated in Table 1, my proposed method is actually pretty good in predicting upper arm configurations such as shoulders and elbows. Therefore, it may be possible to improve state of the art accuracy on arm predictions by incorporating some of the details from my proposed method. Figure 6: Body part detection results by state of the art methods. Additionally, in Figure 7 I present some additional results produced by my method. Intuitively this figure could be seen as a precision recall curve. Here is the idea of how I generated this figure. The threshold on the x axis depicts what is the maximum distance between the prediction and the ground truth label such that prediction is still considered to be correct. For instance, if we set the threshold to 0 that means that the prediction has to be exactly on the 5

Joint Accuracy Right Wrist 0.186 Right Elbow 0.397 Right Shoulder 0.487 Left Shoulder 0.482 Left Elbow 0.492 Left Wrist 0.236 Neck 0.437 Head Top 0.432 Torso 0.402 Table 1: Results that were produced by my method. In this case my method is predicting the locations of joints in the upper human body. location where the ground truth label is marked. Naturally, as we increase the threshold the accuracy increases as well. The key observations from this figure are similar to what we discussed earlier. Firstly, it is clear that the method performs pretty well for joints such as shoulders and elbows. However, the method produces pretty poor accuracy for wrists even as we increase the threshold. This can be expected since wrists are highly dynamic joints and are probably among the most difficult joints to identify correctly. Figure 7: The detection accuracy rates as we allow predictions to be further away from the ground truth labels. 4.2 Qualitative Results Additionally, to give an intuition of how the proposed method works in practice, I provide some qualitative visualization of the results. Figure 8 depicts predictions that look relatively good whereas Figure 9 illustrates predictions that are quite poor. It is clear that the method performs reasonably well with 6

the poses that are close to standard vertical standing pose. This is good since it obviously means that our model is able to capture at least some structure in human pose. However, in the more difficult cases such as the ones presented in Figure 9, the method is clearly not performing well. This suggests that some of the modeling decisions that I made simplified the model too much and negatively affected the performance. In the conclusion, I will discuss some possible reasons why the model is not performing as well as one would desire. Figure 8: Predictions produced by my proposed method that look relatively OK. Figure 9: Predictions by my proposed method that illustrate the cases where the method performs poorly. 7

5 Conclusions and Future Work After many arduous hours of debugging the code and trying to understand why my proposed method does not work better, I came up with the following reasons, which may explain some weaknesses in my proposed method. Firstly, the main reason that significantly impacts method s performance is the image representation. As already mentioned, in this project I used HOG representation, which provided a 81 dimensional vector representation. This is clearly not enough to capture all of the intricacies of the context in the image. For instance, the authors in [1] used a pretty complicated feature construction scheme that employs several descriptors applied on different orientations and scales and then a concatenation of all of these descriptors. Due to unclear descriptions, and lack of time I was not able to experiment with such complicated features. However, such features would have made a significant impact and I believe would have made my method much closer to state of the art results. Another very important reason that may have degraded the performance was the modeling of pairwise potentials. The authors in [1] used similar methodology as I did. However, in addition to modeling image regions as simply (x, y) coordinates they also introduced scale and orientation, which gave a lot of extra information to the classifier. Additionally, in [1] the entire relationship between two adjacent body parts is represented as a transformation into another space, which is then treated as a Gaussian. This representation is also clearly much more powerful than the one I used. However, due to many technicalities involved in this scheme I was not able to implement it in a given time. I also identified some secondary reasons that may have contributed a little bit to the quality of the performance. As I mentioned, I utilized an approximate inference method. Even though, it is shown to yield solid results in the paper it may still not match Sum-Product algorithm, which performs inference exactly. I used this approximate inference scheme because I experimented with non-tree structured models, for which exact inference is intractable. In addition, for my unary and pairwise potential learning, I utilized linear SVMs. This provided me with very efficient framework at the prediction stage, but may have degraded performance a little bit. A non-linear classifier could have provided more power to the model. Overall, even though the results are not as good as I was expecting, I am still pretty happy with the progress I made. I presented a pretty simple and an extremely efficient method of performing pose estimation. As demonstrated in the results, the proposed method actually works pretty well for some certain body parts such as shoulders or elbows. I believe that with some minor tweaks and fixes, which I outlined in this section the results could be made significantly better. References [1] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009. Best Paper Award Honorable Mention by IGD. 8

[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Discriminative appearance models for pictorial structures. International Journal of Computer Vision (IJCV), 99(3):259 280, 2012. [3] Marius Leordeanu and Martial Hebert. Efficient map approximation for dense energy functions. In ICML, pages 545 552, 2006. 9