Multiple Pose Context Trees for estimating Human Pose in Object Context

Size: px

Start display at page:

Download "Multiple Pose Context Trees for estimating Human Pose in Object Context"

Darren Richardson
5 years ago
Views:

Multiple Pose Context Trees for estimating Human Pose in Object Context Vivek Kumar Singh Furqan Muhammad Khan Ram Nevatia University of Southern California Los Angeles, CA {viveksin furqankh

In such scenarios, pose can be estimated more accurately using the knowledge of scene objects. Previous approaches do not make use of such contextual information.

1 Multiple Pose Context Trees for estimating Human Pose in Object Context Vivek Kumar Singh Furqan Muhammad Khan Ram Nevatia University of Southern California Los Angeles, CA {viveksin furqankh Abstract We address the problem of estimating pose in a static image of a human performing an action that may involve interaction with scene objects. In such scenarios, pose can be estimated more accurately using the knowledge of scene objects. Previous approaches do not make use of such contextual information. We propose Pose Context trees to jointly model human pose and object which allows both accurate and efficient inference when the nature of interaction is known. To estimate the pose in an image, we present a Bayesian framework that infers the optimal pose-object pair by maximizing the likelihood over multiple pose context trees for all interactions. We evaluate our approach on a dataset of 65 images, and show that the joint inference of pose and context gives higher pose accuracy. 1. Introduction We consider the problem of estimating 2D human pose in static images where the human is performing an action that involves interaction with scene objects. For example, a person interacts with the soccer ball with his/her leg while dribbling or kicking the ball. In such case, when the partobject interaction is known, the object position can be used to improve the pose estimation. For instance, if we know the position of the soccer ball or the leg, it can used to improve the estimation of other (see figure 1). However, to determine the interaction in an image, pose and object estimates are themselves needed. We propose a framework that simultaneously estimates the human pose and determines the nature of human object interaction. Note that the primary objective of this work is to estimate the human pose, however, in order to improve the pose estimate using object context we also determine the interaction. Numerous approaches have been developed for estimating human pose in static images [4, 10, 18, 16, 14] but these do not use any contextual information from the scene. Multiple attempts have been made to recognize human pose and/or action, both with and without using scene context (d) (e) Figure 1. Effect of object context on human pose estimation., (c) show sample images of players playing soccer and basketball respectively;, (e) shows the human pose estimation using treestructured models [14]; (c), (f) show the estimated human pose using the object context [6, 9, 8, 17]. These approaches first estimate the human pose and then use the estimated pose as input to determine the action. Although in recent years, general human pose estimation have seen significant advances especially using part-based models [4], these approaches still produce ambiguous results that are not accurate enough to recognize human actions. [5] used a part based model [14] to obtain pose hypotheses and used these hypotheses to obtain descriptors to query poses in unseen images. [8] usedthe upper body estimates obtained using [4] to obtain hand trajectories, and used these trajectories to simultaneously detect objects and recognize human object interaction. [17] discovers action classes using shape of humans described by shape context histograms. Recently, attempts have also been made to classify image scenes by jointly inferring over multiple object classes [13, 3] such as co-occurrence of human and racket for recognizing tennis [13]. [3]alsousedthe relative spatial context of the objects to improve the object detection. (c) (f)

2 Previous approaches that model human-object interaction, either use a coarse estimate of the human to model interaction [9, 3] or estimate the human pose as pre-step to simultaneous object and action classification [8]. In this work, we use an articulate human pose model with 10 body parts which allows accurate modeling of human and object interaction. More precisely, we propose a graphical model, Pose Context Tree to simultaneously estimate the human pose and the object. The model is obtained by adding an object node to the tree-structured part model for human pose [4, 10, 18, 16] such that the resulting structure is still a tree, and thus allows efficient and exact inference [4]. To automatically determine the interaction, we consider multiple pose context trees for each possible human-object interaction based on which part may interact with the object. The best pose is inferred as the pose that correspond to the maximum likelihood score over the set of all trees. We also consider the probability of absence of an object which allows us to determine if the image does not contain any of the known interactions. Thus, our contribution is two-fold, Pose context trees to jointly estimate detailed human pose and object which allows for accurate interaction model A Bayesian framework to jointly infer human pose and human-object interaction in a single image; when the interaction in the image is not modeled, then our algorithm reports unknown interaction and estimate the human pose without any contextual information. To evaluate our approach, we collected images from the Internet and previously released datasets [2, 14, 9]. Our dataset has 65 images, out of which 44 has a person either kicking or dribbling a soccerball, or holding a basketball. We demonstrate that our approach improves the pose accuracy over the dataset by 5% for all parts and by about 8% for parts that are involved in the interaction, for example, legs for soccer. In the rest of the paper, we first discuss the pose context tree in section 2 and the algorithm for simultaneous pose estimation and action recognition in section 3. Next, we present the model learning in section 4, followed by the experiments in section 5 and conclusion in section Modeling Human Body and Object Interaction We use a tree structured model to represent the interaction between the human pose and a manipulable object involved in an action. We refer to this as Pose Context Tree. Our model is based on the observation that during an activity that involves human object interaction such as playing tennis or basketball, humans interact with the part extremities i.e. hands, lower legs and head; for example, basketball players hold the ball in their hand while shooting. We first describe a tree structured model of human body, and then demonstrate how we extend it to a pose context tree that can be used to simultaneously infer the pose and the object. [Tree-structured Human Body Model] We represent the human pose using a tree pictorial structure [4] with torso at the root and the four limbs and head as its branches (see figure 2). The human body X is denoted as a joint configuration of body parts {x i }, where x i = (p i,θ i ) encodes the position and orientation of part i. Given the full object X, the model assumes the likelihood maps for parts are conditionally independent and are kinematically constrained using model parameters Θ. Under this assumption, the posterior likelihood of the object X given the image observations Y is P (X Y,Θ) P (X Θ) P (Y X, Θ) (1) = P (X Θ) P (Y x i ) i exp φ i (Y x i ) ij E ψ ij (x i,x j Θ) + i where (V,E) is the graphical model; φ i (.) is the likelihood potential for part i; ψ ij () is the kinematic prior potential between parts i and j modeled using Θ. For efficiency, priors are assumed to be Gaussian [4]. [Pose Context Tree] Pose context tree models the interaction between the human body and an object involved. Since humans often interact with scene objects with part extremities (legs, hands), pose context trees are obtained by adding an object node to a leaf node in the human tree model (see figure 2). We represent the pose and object jointly using X o = {x i } z o, where z o is the position of the object O. P (X o Y,Θ) P (X o Θ) P (Y X o, Θ) (2) ( ) = P (X o Θ) P (Y x i ) P (Y z o ) where P (X o Θ) is the joint prior for the body parts and the object O. Since the object and interaction are known, we assume knowledge of the body part involved in the interaction with the object. As the graphical model with the context node is a tree, the joint kinematic prior P (X o Θ) can be written as P (X Θ P ) P (x k,z o Θ a ), where k is the part interacting with the object, Θ P is kinematic prior for body parts, Θ a is the spatial prior for interaction model a between z o and x k. Thus, the joint likelihood can be now i

3 written as P (X o Y,Θ) P (X Θ P )P (x o x k, Θ a ) ( ) P (Y x i ) P (Y z o ) (3) i P (X Y,Θ P ) exp (ψ a (x k,z o Θ a )+φ o (Y z o )) where, φ o (.) is the likelihood potential of the object O, ψ a () is the object-pose interaction potential between O and interacting body part k for interaction model a (given by equation 4). ψ a (x k,z o )={ 1 Tko (x k ) T ok (z o ) <d a ko 0 otherwise where T ko (.) is the relative position of the point of interaction between O and part k in the coordinate frame of the object O. x lla x lua x lll x lul x h x t x rul x rua x rll x rla Figure 2. Pose Model: Tree structured human model with observation nodes Pose Context Tree for object interaction with left lower leg; object node is marked as double-lined x x lla o x lua x lll x lul xh x t 3. Human Pose Estimation using Context We use Bayesian formulation to jointly infer the human pose and the part-object interaction in a single image. Here, by inferring part-object interaction we mean estimating the object position and the interacting body part. Note that unlike [9], our model is generative and does not assume that the set of interactions forms a closed set i.e. we consider the set of interactions A = {a i } φ. The joint optimal pose configuration and interaction pair (X,a ) is defined as x rul x rua x rll (4) (X,a ) = arg max P (X, a Y,Θ) = arg max P (a X, Y, Θ)P (X Y,Θ) (5) We define the conditional likelihood of interaction a given the pose estimate X, observations Y and model parameters Θ as the product of likelihoods of the corresponding object xrla O in the neighborhood of X and absence of objects that correspond to other interactions in A, i.e. P (a X, Y, Θ) P (z o X, Y, Θ) ( 1 P (zo(a ) X, Y, Θ) ) (6) a A\{a} Combining equations 5 and 6, we can obtain joint optimal pose-action pair as (X,a )=arg max P (z o X, Y, Θ) P (X Y,Θ) ( 1 P (zo(a ) X, Y, Θ) ) a A\{a} = arg max P (X o Y,Θ) ( 1 P (zo(a ) X, Y, Θ) ) (7) a A\{a} The joint pose-interaction pair likelihood given in equation 7 can be represented as a graphical model, however the graph in such case will have cycles because of edges from object nodes to the left and right body parts. One may use loopy belief propagation to jointly infer over all interactions [9] but in this work, we use an alternate approach by efficiently and accurately solving for each interaction independently and then selecting the best pose-interaction pair. For each interaction a, we estimate the best pose Xo and then add penalties for other objects present close to Xo.The optimal pose-interaction pair is then given by (X,a )=arg max a A\{a} (( max X o P (X o Y,Θ) ) ( ) 1 max P (z o(a z ) Xo,Y,Θ) (8) o(a ) where Xo can be obtained by solving the Pose Context Tree for the corresponding interaction a (described later in this section). Observe that when a = φ, the problem reduces to finding the best pose given the observation and adds a penalty if objects are found near the inferred pose. Thus our model can be applied on any image even when the interaction in the image is not modeled, thereby making our model more apt for estimating human poses in general scenarios Pose Inference for known Object Context using Pose Context Tree Given the object context i.e. object position and interaction model, pose is inferred using pose context tree by maximizing the joint likelihood given by equation 2. Since the corresponding energy equation for pose context tree (eqn 2) has a similar form as that of tree structured human

4 body model (eqn 1), both can be minimized using similar algorithms [4, 14, 5, 1]. These approaches apply part/object detectors over the all image positions and orientations to obtain part hypotheses and then enforce kinematic constraints on these hypotheses using belief propagation [11] over the graphical model. This is sometimes referred to as parsing. Given an image parse of the parts, the best pose is obtained from part hypotheses by sampling methods such as importance sampling [4], maximum likelihood [1], data-driven MCMC [12]. [Body Part and Object Detectors] We used the boundary and region templates trained by Ramanan et al [14] for localizing human pose (see 3(a, b)). Each template is a weighted sum of the oriented bar filters where the weights are obtained by maximizing the conditional joint likelihood (refer [15] for details on training). The likelihood of a part is obtained by convolving the part boundary template with the Sobel edge map, and the part region template with part s appearance likelihood map. Since the appearance of parts is not known at the start, part estimates inferred using boundary templates are used to build the part appearance models (see iterative parsing [14]). For each part, an RGB histogram of the part h fg and its background h bg is learnt; the appearance likelihood map for the part is then simply given by the binary map p(h fg c) >p(h bg c). For more details, please refer to [14]. For each object class such as soccer ball, we trained a separate detector with a variant of Histogram of Gradients features [2], the mean RGB and HSV values, and the normalized Hue and Saturation Histograms. The detectors were trained using Gentle AdaBoost [7]. We use a sliding window approach to detect objects in the image; a window is tested for presence of object by extracting the image features and running them through boosted decision trees learned from training examples. The details on learning object detectors are described in Section 4.2. [Infer Part and Object Distributions] For each part and object, we apply the detector over all image positions and orientations to obtain a dense distribution over the entire configuration space. We then simultaneously compute the posterior distribution of all parts by locally exchanging messages about kinematic information between parts that are connected. More precisely, the message from part i to part j is the distribution of the joint connecting parts i and j, based on the observation at part i. This distribution is efficiently obtained by transforming the part distribution into the coordinate system of the connecting joint and applying a zero mean Gaussian whose variance determines the stiffness between the parts [4]. [Selecting the Best Pose and Object] Since the tree structure model does not represent inter part occlusion between the parts that are not connected, the pose obtained by assembling maximum posterior estimates for each part [1] does not result in a kinematically consistent pose. Thus, we use a top down approach for obtaining a pose by finding the maximum likelihood torso estimate first (root) and then finding the child part given the parent estimate. This ensures a kinematically consistent pose. (c) (d) Figure 4. Inference on Pose Context Tree: Sample image of a soccer player Distributions obtained after applying edge templates (c) Joint part distributions of edge and region templates (d) Inferred pose and object Figure 3. Part Templates: Edge based part templates Region based part templates, dark areas correspond to low probability of edge, and bright areas correspond to a high probability; 4. Model Learning The model learning includes learning the potential functions in the pose context tree i.e. the body part and the object detectors for computing the likelihood potential, and

5 the prior potentials for the Pose Context tree. For the body part detectors, we use templates provided by Ramanan et al [14] (for learning these templates, please refer to [15]) Prior potentials for Pose Context Tree Model parameters include the kinematic functions between the parts ψ ij s and the spatial context for each manipulable object O, ψ ko. [Human Body Kinematic Prior]: The kinematic function is modeled with Gaussians, i.e. position of the connecting joint in a coordinate system of both parts (m ij,σ ij ) and (m ji,σ ji ) and the relative angles of the parts at the connected joint (m ij θ,σij θ ). Given the joint annotations that is available from the training data, we learn the Gaussian parameters with a Maximum Likelihood Estimator [4], [1]. [Pose-Object Spatial Prior]: The spatial function is modeled as a binary potential with a box prior (eqn 4). The box prior ψ ko is parameterized as mean and variance (m, σ), which spans the region [m 1 2 σ, m σ]. Giventhe pose and object annotations, we learn these parameters from the training data Object Detector For each type of object, we train a separate detector for its class using Gentle AdaBoost [7]. We use a variation of Histogram of Gradients [2] to model edge boundary distribution and mean RGB and HSV values for color. For efficiency, we do not perform all the steps suggested by [2] for computing HOGs. We divide the image in a rectangular grid of patches and for each cell in the grid, a histogram of gradients is constructed over orientation bins. Each pixel in the cell cast a vote equal to its magnitude to the bin that corresponds to its orientation. The histograms are then sumnormalized to 1.0. For appearance model of the objects, normalized histograms over hue and saturation values are constructed for each cell. Thus our descriptor for each cell consists of mean RGB and HSV values, and normalized histograms of oriented gradients, hue and saturation. For training, we use images downloaded from Internet for each class. We collected 50 positive samples for each class and 3500 negative samples obtained by random sampling windows of fixed size from the images. During training detector for one class, positive examples from other classes were also added to the negative set. For robustness, we increased the positive sample set by including small affine perturbations the positive samples (rotation and scale). For each object detector, the detection parameters include number of horizontal and vertical partitions of the window, number of histogram bins for gradients, hue and saturation, number of boosted trees and their depths for the classifier, and were selected based on the performance of the detector on the validation set. The validation set contained 15 images for each object class and 15 images containing none of them and does not overlap with the training or test sets. We select the classifier that gives lowest False Positive Rate. 5. Experiments To validate the model, we created a dataset with images downloaded from the Internet and other datasets [9, 2, 14]. The dataset has images for 3 interactions - legs with soccer ball, hands with basketball and miscellaneous. For ease of writing, we refer to these as soccer, basketball and misc respectively. The soccer set includes images of players kicking or dribbling the ball, basketball set has images of players shooting or holding the ball, and the misc set includes images from People dataset [14] and INRIA pedestrian dataset [2]. Similar to People dataset [14], we resize each image such that person is roughly 100 pixels high. Figure 5 show some sample images from our dataset. Note that we do not evaluate our algorithm on the existing dataset [9] as our system assumes that the entire person is within the image Object Detection We evaluate the object detection for each object type i.e. soccer ball and basketball. Figure 6 show some positive examples from the training set. For evaluation, we consider a detection hypothesis to be correct, if detection bounding box overlaps the ground truth for the same class by more than 50%. We first desribe the detection parameters used in our experiments, and then evaluate the detectors on the test set. [Detection Parameters]: As mentioned previously in the Learning section, we set the detection parameters for each object based on its performance on the validation set. We select the detection parameters by experimenting over thegridsize(1 1, 3 3, 5 5), number of histogram bins for gradients (8, 10, 12 over 180 degrees), hue (8, 10, 12) and saturation (4, 5, 6), number of boosted trees (100, 150, 200) and their depths (1, 2) for the classifier, and threshold on detection confidence (0.2, 0.3,..., 0.9). We use training window size of for both soccer and basketball. For soccer ball and basketball, we select the detection parameter settings that gives lowest False Positive Rate with miss rate of at most 20%. For soccer ball, the detector trained with 12 gradient orientation, 10 hue and 4 saturation bins over a 5 5 grid gave the lowest, , FPPW for boosting over 150 trees of depth. On the other hand, for basketball, 200 boosted trees of depth 2 on a 5 5 grid gave lowest FPPW of for 12 HOG, 8 hue, 6 saturation bins. [Evaluation]: We evaluate the detectors by applying

7% 2 10 4 Basketball 84.2% 1.5 10 3 Table 1.

5. Sample images from the dataset; Rows 1, 2

estimation accuracy over the entire dataset with

Pose accuracy is computed as the average

segment endpoints lie within 50% of the length of

location, as in earlier reported results [5, 1].

our algorithm, we compute pose accuracy using 3

pose with known object i.e. using pose context

case when all interactions are correctly recognized

using pose context trees; hence performance of

Method C which is the fully automatic approach.

[14] 67.1 43.2 64.3 57.33 B KnownObject-PCT 72.9 63.

6 threshold of 0.5. Detection Rate False Positives PW Soccer ball 91.7% Basketball 84.2% Table 1. Object detection accuracy over the Test set Figure 5. Sample images from the dataset; Rows 1, 2 contains examples from basketball class; Rows 3, 4 from soccer, and Row 5 from misc Pose Estimation For evaluation, we compute the pose estimation accuracy over the entire dataset with and without using the object context. Pose accuracy is computed as the average correctness of each body part over all the images (total of 650 parts). An estimated body part is considered correct if its segment endpoints lie within 50% of the length of the ground-truth segment from their annotated location, as in earlier reported results [5, 1]. To demonstrate the effectiveness of each module in our algorithm, we compute pose accuracy using 3 approaches: Method A, which estimates human pose using [14] without using any contextual information; Method B, which estimate the human pose with known object i.e. using pose context tree; Method C, which jointly infers the object and estimate the human pose in the image. Note that Method B essentially correspond to the case when all interactions are correctly recognized and the joint pose and object estimation is done using pose context trees; hence performance of Method B gives an upper bound on the performance of Method C which is the fully automatic approach. Figure 8 shows sample results obtained using all 3 methods. Approach Pose Accuracy S B M Overall A No context [14] B KnownObject-PCT C Multiple PCT Table 2. Pose accuracy over the entire dataset with a total of = 650 parts. S correspond to the accuracy over images in the soccer set, similarly B for basketball and M for misc. Figure 6. Sample positive examples used for training object detectors; Row 1 and 2 shows positive examples for basketball and soccer ball respectively. them on test images at known scales. To compute the detection accuracy for each detector on the test set, we merge overlapping windows after rejecting responses below a threshold. The detection and false alarm rate for each detector on the entire test set is reported in Table 1 for The pose accuracy obtained using above methods is shown in Table 2. Notice that the use of contextual knowledge improves the pose accuracy by 9%, and using our approach, which is fully automatic, we can obtain an increase of 5%. To clearly demonstrate the strength of our model we also report the accuracy over the parts involved in the interactions in the soccer and basketball set. As shown in Table 3, methods using context (B and C) significantly outperform method A that does not use context. Notice that improvement in accuracy is especially high for basketball set. This is because the basketball set is significantly more

7 cluttered than the soccer and misc set, and hence, pose estimation is much harder; use of context provides additional constraints that help more accurate pose estimation. Approach Pose Accuracy S(legs) B(hands) Overall A No context [14] B KnownObject-PCT C Multiple PCT Table 3. Accuracy over parts involved in the object interaction; for soccer only the legs are considered and for basketball only the hands; thus, accuracy is computed over 44 4 = 176 parts. For pose accuracy to improve using contextual information, the interaction in the image must also be correctly inferred. Thus in addition to the pose accuracy, we also compute the accuracy of interaction categorization. For comparison, we use an alternate approach to categorize an image using the joint spatial likelihood of the detected object and the human pose estimated without using context. This is similar to the scene and object recognition approach [13]. Figure 7 shows the confusion matrices for both the methods, with object based approach as and use of multiple pose context trees as. Notice that the average categorization accuracy using multiple pose context trees is much higher. Misc Soccer Basketball Basketball Soccer Misc Recognition Rate: 80.6% Basketball Soccer Soccer Basketball Misc Misc Recognition Rate: 90% Figure 7. Confusion matrix for interaction categorization: using scene object detection, using multiple pose context trees 6. Conclusion In this paper we proposed an approach to estimate the human pose when interacting with a scene object, and demonstrated the joint inference of the human pose and object increases pose accuracy. We propose the Pose context trees to jointly model the human pose and the object interaction such as dribbling or kicking the soccer ball. To simultaneously infer the interaction category and estimate the human pose, our algorithm consider multiple pose context trees, one for each possible human-object interaction, and find the tree that gives the highest joint likelihood score. We applied our approach to estimate human pose in a single image over a dataset of 65 images with 3 interactions including a category with assorted unknown interactions, and demonstrated that the use of contextual information improves pose accuracy by about 5% (8% over the parts involved in the interaction such as legs for soccer). References [1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In IEEE CVPR 2009, June , 5, 6 [2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages , june , 4, 5 [3] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In International Conference on Computer Vision (ICCV), , 2 [4] P. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV 2005, 61(1):55 79, , 2, 4, 5 [5] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2008, pages 1 8, , 4, 6 [6] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, pages 1 8, [7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. In The Annals of Statistics, volume 38, , 5 [8] A. Gupta and L. S. Davis. Objects in action: An approach for combining action understanding and object perception. In CVPR, volume 0, pages 1 8, , 2 [9] A. Gupta, A. Kembhavi, and L. S. Davis. Observing humanobject interactions: Using spatial and functional compatibility for recognition. In T-PAMI, volume 31, pages , , 2, 3, 5 [10] G. Hua, M.-H. Yang, and Y. Wu. Learning to estimate human pose with data driven belief propagation. In IEEE CVPR 2005, volume 2, pages vol. 2, June , 2 [11] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2): , Feb [12] M. Lee and I. Cohen. Proposal maps driven mcmc for estimating human body pose in static images. In IEEE CVPR 2004, pages , [13] L.-J. Li and L. Fei-Fei. What, where and who? classifying event by scene and object recognition. In ICCV, , 7 [14] D. Ramanan. Learning to parse images of articulated bodies. In Advances in Neural Information Processing Systems 19, pages MIT Press, Cambridge, MA, , 2, 4, 5, 6, 7 [15] D. Ramanan and C. Sminchisescu. Training deformable models for localization. In IEEE CVPR 2006, volume 1, pages , June , 5 [16] L. Sigal, B. Sidharth, S. Roth, M. Black, and M. Isard. Tracking loose-limbed people. In IEEE CVPR 2004, volume I, pages , June , 2

8 Input Iterative Parse KnownObject PCT Multiple PCT (c) (d) (e) Figure 8. Results on Pose Dataset.,,(c) are images from basketball set and (d),(e) are from soccer set. The posterior distributions are also shown for the Iterative Parsing approach and using PCT when action and object position is known. Notice that even in cases where the MAP pose is similar, the pose distribution obtained using PCT is closer to the ground truth. Soccer ball responses are marked in white, and basketballs are marked in yellow. In example (c), basketball gets detected as a soccer ball and thus results in a poor pose estimate using Multiple-PCT, however, when the context is known, true pose is detected using PCT. [17] Y. Wang, H. Jiang, M. S. Drew, Z.-N. Li, and G. Mori. Unsupervised discovery of action classes. In CVPR, pages , [18] J. Zhang, J. Luo, R. Collins,, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. In IEEE CVPR 2006, volume II, pages , June , 2

Category vs. instance recognition

Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building