Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative models such as SVM to perform upper body pose estimation. Figure 1 illustrates the definition of a problem in terms of input and output. Pose estimation is an important problem that has a wide array of applications such as pedestrian detection, sports video analysis, action recognition, etc. The entire human body can be seen as a graph where each body part is the node in the graph and two adjacent body parts are connected via an edge. Therefore, it is natural to utilize probabilistic graphical models to solve pose detection related problems. Additionally, the idea of incorporating discriminative models has been shown to be very successful as well, specifically in [1] [2]. Intuitively it makes sense that combination of generative and discriminative models should produce a model that is more powerful than each of these individually. Therefore, my hope is that combining these two techniques will produce solid results for this challenging problem. Figure 1: Illustration of a pose estimation problem in terms of input and output. Given an RGB image we want to detect joint locations in human body. In this specific project, I focused on the upper body estimation, which made the problem slightly easier. 1
2 Dataset For this project, I utilized LSP (Leeds Sports Pose) dataset, which is an extremely challenging dataset and consists of 10000 training images and 2000 training images. What makes this dataset so challenging is a large variation of human poses, different scales, in which humans appear, the fact that some of the body parts are occluded and also multiple humans appearing in the same image. Some of the examples from the dataset are illustrated in Figure 2. To reduce computational complexity I used 1500 training and 300 training images in this case. Additionally, to make the problem a bit easier I eliminated some of the examples where most of the parts were occluded. Figure 2: Sample images from Leeds Sports Dataset. As the examples illustrate, the dataset is extremely challenging because it contains a wide array of unusual poses. Additionally, details as variance in occlusion, illumination, etc makes the dataset even more difficult. 3 Proposed Method The basic methodology that I will be using was inspired by [2] [1]. The main idea behind the proposed method is to model a pose detection problem in the form of a PGM where each node in the graph is a joint in humans body (elbow, shoulder, knee, etc) and the edges in PGM simply denote the connection between two adjacent body joints. The method can be decomposed into three main stages: constructing the unary potentials, building the pairwise potentials, and then performing MAP inference. As already mentioned,the main power of the method comes from using discriminative models to construct unary and pairwise potentials and then incorporating these for the inference. I will now discuss each of these steps in greater detail. 3.1 Unary Potentials The basic idea behind constructing unary potentials is as follows. Intuitively, we want our unary potentials to capture the likelihood that a given region in the image contains some specific body part just as shown in Figure 3. I will now describe how to actually compute unary potentials. For simplicity, let s consider an example where we want to evaluate probabilities pertaining to only one human body part Head. Given any region in an image we want to compute the probability of that particular region containing head. We can do so 2
Figure 3: Illustration of broad idea behind constructing unary potentials. We simply want to evaluate the likelihood that a given region in the image contains some specific body part. by utilizing one-vs-all SVM classifier in the following way. First, we use ground truth labels in our training data to collect regions that contain head. We use these regions as positive examples in the proposed SVM framework. We also sample a large collection of examples that do not contain head and use these as negative examples. Using this setup, we then train one-vs-all SVM classifier, which will act as our head detector. We do this for every part of human body, which we want to consider. After training all of these classifiers we can compute a probability that a given image region contains a specific body part in the following way. First, we extract features that represent a given image region (I used HOG features in this case). Then we use these features as an input to our trained SVM classifiers. Each of these SVM classifiers will output a probability that a given region contains a particular body part. The illustration of this approach is presented in Figure 4. Figure 4: To construct unary potentials we utilize one-vs-all SVM framework. First, using the ground truth labels from our training data we collect positive examples (image regions that contain human body part of interest). Then we collect a large sample of random regions that do not contain the body part of interest. Using this setup we can train one-vs-all SVM classifier, which will act as a detector for that particular body part. As already mentioned, to represent the images, I utilized HOG features, which provided 81 dimensional feature representation. Such a low dimensional representation provided very efficient framework but at the same time enforced a limit on the power of the model as will be discussed later. 3
3.2 Pairwise Potentials In addition to unary potentials, I also needed to construct pairwise potentials for our PGM model. The intuitive idea behind pairwise potential construction is to learn a model that would tell us how well two given body part candidates fit together. For example, one would imagine that in the traditional setup the location of left shoulder should be below and to the left of head s position. Based on such a model, this configuration would give us high probability, whereas some two randomly generated configurations should give very low probabilities. Now I will present specific details how to implement this idea. Given two candidate regions that may or may not contain specific pair of body parts we want to evaluate how well this pair fits to each other. We can represent these regions in terms of (x, y) coordinates. Then, using this pair of coordinates we can easily compute the distance between two given regions on the horizontal and vertical axis. Additionally, we can compute the angle between these two locations with the center of the image being our origin points. Then, we can concatenate all of these metrics into a single feature vector that would capture relative distance and relative position between these two given regions. To build our model, once again we turn to one-vs-all SVM framework. Using the ground truth labels from our training data we can sample the coordinates of each pair of body parts that are connected in our model. Using these coordinates, we can then build a feature vector that captures relative distance and position of these two body parts as described earlier, and use these features as our positive examples. For the negative examples, we simply sample a random pair of locations and build the feature vector in the same way. Using these feature vectors we can once again train one-vs-all SVM classifier specific to each pair of connected body parts in our model. We have to train these SVM classifiers for every pair of body parts that have edges between them in our specified PGM. Figure 5 illustrates the basic intuition how to construct pairwise potentials. Figure 5: We want to construct pairwise potentials in such a way so that given two candidate locations of adjacent body parts we could evaluate how well these two locations fit together based on our trained SVM model. 4
3.3 Inference To perform inference I used an external software that implemented inference algorithm presented in [3]. The algorithm provides an approximate inference and is based on Linear Programming techniques. Since I did not study this algorithm in much detail but simply used an already existing implementation, I will not discuss it in any more details. The key thing here is that ir provides an efficient MAP inference with solid results as illustrated in [3]. 4 Experiments 4.1 Quantitative Results In this section, I present the results produced by some state of the art methods in Figure 6 and the results produced by my method in Table 1. It is important to notice that the direct comparison between them cannot be done accurately because the presented state of the art methods predict the actual body parts whereas my method predicts the joints in the human body. However, looking into these results we can still make several observations. First, the results produced by my method suggests that the performance of the method is not that great. We will discuss the reasons for that in the conclusion. However, there are also some positive things related to my proposed method. Firstly, my method is much more efficient than the proposed state of the art methods. Whereas my method can perform testing in 10 seconds per image, state of the art methods can take several minutes to label body parts for one image. Additionally, it is worth noticing that state of the methods struggle with arm predictions. That includes both lower and upper arms. However, as illustrated in Table 1, my proposed method is actually pretty good in predicting upper arm configurations such as shoulders and elbows. Therefore, it may be possible to improve state of the art accuracy on arm predictions by incorporating some of the details from my proposed method. Figure 6: Body part detection results by state of the art methods. Additionally, in Figure 7 I present some additional results produced by my method. Intuitively this figure could be seen as a precision recall curve. Here is the idea of how I generated this figure. The threshold on the x axis depicts what is the maximum distance between the prediction and the ground truth label such that prediction is still considered to be correct. For instance, if we set the threshold to 0 that means that the prediction has to be exactly on the 5
Joint Accuracy Right Wrist 0.186 Right Elbow 0.397 Right Shoulder 0.487 Left Shoulder 0.482 Left Elbow 0.492 Left Wrist 0.236 Neck 0.437 Head Top 0.432 Torso 0.402 Table 1: Results that were produced by my method. In this case my method is predicting the locations of joints in the upper human body. location where the ground truth label is marked. Naturally, as we increase the threshold the accuracy increases as well. The key observations from this figure are similar to what we discussed earlier. Firstly, it is clear that the method performs pretty well for joints such as shoulders and elbows. However, the method produces pretty poor accuracy for wrists even as we increase the threshold. This can be expected since wrists are highly dynamic joints and are probably among the most difficult joints to identify correctly. Figure 7: The detection accuracy rates as we allow predictions to be further away from the ground truth labels. 4.2 Qualitative Results Additionally, to give an intuition of how the proposed method works in practice, I provide some qualitative visualization of the results. Figure 8 depicts predictions that look relatively good whereas Figure 9 illustrates predictions that are quite poor. It is clear that the method performs reasonably well with 6
the poses that are close to standard vertical standing pose. This is good since it obviously means that our model is able to capture at least some structure in human pose. However, in the more difficult cases such as the ones presented in Figure 9, the method is clearly not performing well. This suggests that some of the modeling decisions that I made simplified the model too much and negatively affected the performance. In the conclusion, I will discuss some possible reasons why the model is not performing as well as one would desire. Figure 8: Predictions produced by my proposed method that look relatively OK. Figure 9: Predictions by my proposed method that illustrate the cases where the method performs poorly. 7
5 Conclusions and Future Work After many arduous hours of debugging the code and trying to understand why my proposed method does not work better, I came up with the following reasons, which may explain some weaknesses in my proposed method. Firstly, the main reason that significantly impacts method s performance is the image representation. As already mentioned, in this project I used HOG representation, which provided a 81 dimensional vector representation. This is clearly not enough to capture all of the intricacies of the context in the image. For instance, the authors in [1] used a pretty complicated feature construction scheme that employs several descriptors applied on different orientations and scales and then a concatenation of all of these descriptors. Due to unclear descriptions, and lack of time I was not able to experiment with such complicated features. However, such features would have made a significant impact and I believe would have made my method much closer to state of the art results. Another very important reason that may have degraded the performance was the modeling of pairwise potentials. The authors in [1] used similar methodology as I did. However, in addition to modeling image regions as simply (x, y) coordinates they also introduced scale and orientation, which gave a lot of extra information to the classifier. Additionally, in [1] the entire relationship between two adjacent body parts is represented as a transformation into another space, which is then treated as a Gaussian. This representation is also clearly much more powerful than the one I used. However, due to many technicalities involved in this scheme I was not able to implement it in a given time. I also identified some secondary reasons that may have contributed a little bit to the quality of the performance. As I mentioned, I utilized an approximate inference method. Even though, it is shown to yield solid results in the paper it may still not match Sum-Product algorithm, which performs inference exactly. I used this approximate inference scheme because I experimented with non-tree structured models, for which exact inference is intractable. In addition, for my unary and pairwise potential learning, I utilized linear SVMs. This provided me with very efficient framework at the prediction stage, but may have degraded performance a little bit. A non-linear classifier could have provided more power to the model. Overall, even though the results are not as good as I was expecting, I am still pretty happy with the progress I made. I presented a pretty simple and an extremely efficient method of performing pose estimation. As demonstrated in the results, the proposed method actually works pretty well for some certain body parts such as shoulders or elbows. I believe that with some minor tweaks and fixes, which I outlined in this section the results could be made significantly better. References [1] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009. Best Paper Award Honorable Mention by IGD. 8
[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Discriminative appearance models for pictorial structures. International Journal of Computer Vision (IJCV), 99(3):259 280, 2012. [3] Marius Leordeanu and Martial Hebert. Efficient map approximation for dense energy functions. In ICML, pages 545 552, 2006. 9