Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models

Size: px
Start display at page:

Download "Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models"

Transcription

1 Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models Vivek Kumar Singh University of Southern California Los Angeles, CA, USA Ram Nevatia University of Southern California Los Angeles, CA, USA Abstract We present an approach to recognizing single actor human actions in complex backgrounds. We adopt a Joint Tracking and Recognition approach, which track the actor pose by sampling from 3D action models. Most existing such approaches require large training data or Mo- CAP to handle multiple viewpoints, and often rely on clean actor silhouettes. The action models in our approach are obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matrices between the 3D keypose figures. Poses sampled from coarse action models may not fit the observations well; to overcome this difficulty, we propose an approach for efficiently localizing a pose by generating a Pose-Specific Part Model (PSPM) which captures appropriate kinematic and occlusion constraints in a tree-structure. In addition, our approach also does not require pose silhouettes. We show improvements to previous results on two publicly available datasets as well as on a novel, augmented dataset with dynamic backgrounds. 1. Introduction The objective of this work is to recognize single actor human actions in videos captured from a single camera. This has been a popular research topic over past few years, as effective solutions to this problem find applications in surveillance, HCI, video retrieval, among others. Existing action recognition methods work well with variations in actor appearances, however, handling viewpoint variations with low training requirements, and dealing with cluttered dynamic backgrounds is still a challenge. While view-invariant approaches using 3D models have been proposed, they either require 3D MoCAP for learning models and/or require videos from multiple viewpoints. We present a simultaneous tracking and recognition approach which tracks the actor pose by sampling from 3D action models and localizing each pose sample; this allows view-invariant action recognition. To deal with cluttered dynamic background, we accurately localize each pose using a 2D part model. We model an action as a sequence of transformations between keyposes. These action models can be obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matrices between 3D keyposes [18]; this avoids large training data and MoCAP. However poses sampled from such coarse models do not match observations well. Thus during inference, errors due to pose approximation and observation noise would accumulate over time, and result in tracking failures and lower recognition rates, especially in cluttered dynamic scenes. We address these issues by a more accurate localization of the human pose using a 2D part model with kinematic constraints. Such models have been successfully applied to localize human pose in cluttered images, under the assumption that the parts are not occluded [4, 20, 1]. However, poses often have multiple occluded parts, and hence modeling inter-part occlusions is useful for accurately localizing such poses. Existing methods such as [9, 27] that model such constraints are too inefficient for tracking and recognition where multiple poses may need to be localized every few frames. We propose a novel framework to select a tree-structured model that captures appropriate kinematic and inter-part occlusion constraints for a particular pose in order to accurately localize that pose; we call this model Pose-Specific Part Model (PSPM). To determine the PSPM for a given pose, we search over many possible tree models and select the model with highest localizability score. We demonstrate our approach on 2 publicly available datasets: Full body Gestures [17] with 6 actions captured from multiple viewpoints with cluttered, dynamic backgrounds, and Hand Gestures [18] with 12 actions with subtle pose variations in a static background. To further demonstrate the robustness to background changes, we evaluate our method on an augmented Hand Gestures Set with 25 real sequences with camera shakes and background object motion, and 215 sequences with embedded dynamic backgrounds. We also evaluate localization using PSPM on an image dataset with different poses and backgrounds, and

2 show improvements over the standard Pictorial Structures [4] and other pose localization methods. In the rest of the paper, we first review the related work in Section 2. We then present the action representation and inference in Section 3. Next we describe pose localization from 3D priors using Pose-Specific Part Model (PSPM) in Section 4, followed by the results. 2. Related Work A natural approach to recognizing actions is to first estimate the body pose and then infer the action based on the pose dynamics [8, 5]. However effectiveness of such approaches depends on reliable human pose tracking methods. A popular approach is to avoid pose tracking and directly match image descriptors to the action models by learning action classifiers, using SVMs [13], or graphical models such as CRFs [16], LDA [15]; however it is difficult to capture temporal relationships in such models. Furthermore, these methods typically require large amount of training data from multiple viewpoints. Another approach is to simultaneously track the pose and recognize the action; we refer to these as Joint- Tracking-and-Recognition methods. These methods learn action models that capture the evolution of the actor pose in 3D, and during inference, use the action priors for tracking pose and the estimated pose to recognize actions. While these methods work well across viewpoints, most of them require 3D MoCAP data for learning accurate models [28, 22, 19, 17] and/or rely on person silhouettes for localization and matching [21, 24, 23, 31, 19, 6, 18] which assumes a static background. Recently, [18] proposed a multi-view approach without using MoCAP, by learning 3D action models from 2D keypose annotations and recognizing actions by matching poses sampled from the action models to actor silhouettes. However poses sampled from such coarse models result in large matching error which accumulate over time and significantly affect recognition, especially in cluttered scenes. In our work, we address this issue by using accurate pose localization using part models. An important aspect of the Joint-Tracking-And- Recognition methods is to reliably localize/match the pose to the video. Recently, part-based graphical models (pictorial structures [4]) have been shown to accurately localize 2D poses in complex backgrounds [1, 20] but they do not model inter-part occlusion. Localization of poses with inter-part occlusions require simultaneous modeling of body kinematics and inter-part occlusion which makes inference hard. Existing approaches model such constraints using common-factor models [12], multiple trees [29], or represent them in a kinematic graph (with cycles) and infer pose using non-parametric message passing [23] or branch-and-bound [9, 27]. However, these methods either use person silhouettes [23], requires training data from all viewpoints [12], or are too inefficient [9, 27] for tracking. Recently, [2] trained multiple view-specific models for estimating pose in walking action; however for multiple actions, a large number of models would need to be trained. 3. Action Recognition In this work, we develop on the Joint-Tracking-and- Recognition approach of combining Tracking-by-Priors and Recognition-by-Tracking. For each action, we obtain an approximate model of the human pose dynamics in a scale and pan-normalized 3D space; this allows scale and viewpointinvariant representation. This is done by scaling the poses to a fixed known height. For inference, we match image observations to the action models by tracking using a 3D human model in the action-restricted pose space, and find the action with highest matching score [31, 14, 18]. Here, we first present the action representation and model learning, followed by action and pose inference. The pose localization is described later in Section Representation and Learning We learn a separate model for each action that captures the dynamics of the human pose. Our models are based on the concept that a single actor human action can be represented as a sequence of linear transformations between a few, representative keyposes. Our action model is inspired by [18], which refers to the linear transformation between a keypose pair as a primitive. For example, the walking action can be represented with four primitives - left leg forward right leg crosses left leg right leg forward left leg crosses right leg. Note that each primitive is a conjunction of rotation of body parts, e.g. during walking, rotation of upper leg about the hip and rotation of lower leg about the knee, and thus can be represented as a linear transformation in joint-angle space. This is illustrated in figure 1. Scale Normalized Joint Angle Space Primitive Keypose Keypose Figure 1. Geometric Interpretation of the Action Action for Walking; dotted Red Curves denote different instances of walking action; piecewise linear curve (in gray) denotes the learnt action model with keyposes marked with circles (in black) To capture the variations in keyposes across different instances of the same action, we model each keypose by a set

3 of Gaussian distributions, one for every 3D joint position. For speed variations, we model the length of each primitive as a truncated sigmoid function. We normalize each primitive to unit length and learn a Gaussian over the fraction of primitive that gets covered at each time step. Thus, an action with N k keyposes is modeled by a set of N k (N j +1) Gaussians, where N j is the number of 3D joints (= 15). We learn action models by annotating 2D poses and the primitive action boundaries in the training videos. For each action model, we first manually select the set of keyposes for each action; intuitively, we select a keypose whenever there is a big change in pose dynamics; alternatively if 3D MoCAP is available, keyposes can be automatically obtained as discontinuities in the pose energy [14]. We then learn the 3D model for each keypose from 2D annotations by lifting (using our implementation of [30] for more details). For each primitive, we obtain the expected change in the duration by collecting primitive lengths from action boundary annotations and fitting a Gaussian Conditional Action Network Given the action models, we embed them into a Dynamic Conditional Random Field [26], which we refer to as Conditional Action Network illustrated in Figure 2. Figure 2. Conditional Action Network We define the state s t of CAN at time t by a tuple of action and pose variables s act t,s pose t ; action set s act t = a t,p t,f t include the action label a t, current primitive p t and the fraction of primitive elapsed f t and pose s pose t = x t include the current pose x t. To infer the action from an observation sequence of length T, we estimate the optimal state sequence over all actions by maximizing the log-linear likelihood which takes the following form, T n o s best [1:T ] =arg max w f φ f (s t,s t 1,I t ) s [1:T ] t=1 f=1 where, φ(s t,s t 1,I t ) are observation and transition potentials and w = {w i } is the weight vector, one for each potential function. [Transition Potentials] Action transition potential φ(a t,f t,a t 1,f t 1 ) is modeled as a truncated sigmoid function over the fraction of primitive elapsed f t, such that the probability of staying in the same primitive p t decreases as f t approaches 1 and the probability of transition to a new primitive increases. The pose transition potential φ(x t,x t 1 ) is modeled using a Normal distribution N (0,σ) of displacement in neck position and height h t. [Observation Potentials] We compute the observation likelihood of a pose sample x t, sampled from action-pose potential φ(a t,f t,x t ), by combining shape and motion likelihoods. We first localize the pose using a part based model which is generated from the spatial prior available from the action model and handles constraints due to occlusion. We then compute shape likelihood, as the normalized log likelihood of the parts used in the model. The details of this step are described in Sec 4. φ shape (x) = 1 φ i (x i,i t ) P i P where, P is the set of the parts in the pose model, x i is the i th part in pose x. The motion likelihood is computed by matching the observed optical flow with the direction of motion of each part, using the cosine distance. We used the Lucas-Kanade algorithm (in OpenCV 1.0) for computing optical flow and quantize the flow into 8 orientation bins. [Weight Learning] We assume uniform transition weights across different actions/primitives, and hence weight learning only involves learning 3 weight values, one for each potential. In this work, we use the Voted Perceptron algorithm [3] due to its efficiency and ease of implementation. The ground truth pose estimates for all frames were obtained using our inference with known action label for the sequence Tracking and Recognition Since our action models are continuous and our graphical model has cycles, exact inference is infeasible. Thus, we use a particle filtering approach [22, 25] by sampling poses from the action models and matching each pose to the scene observations. During tracking, we first find the person by applying a full-body and a head-shoulder pedestrian detector [7]; multiple detectors help reliable detection especially in complex scenes. We then uniformly sample poses from action models and localize the poses to fit the observations using the approximate position (neck) and scale (person standing height) available from the detection responses. The details of the localization method are described in detail in Section 3. For viewpoint invariance, poses are matched to the observations at various pan angles. To propagate each sample s t over time, we increment the f t (fraction of primitive elapsed) to obtain the next action state s act t+1; note that if f t is toward the end of a primitive, next state may transition to the next primitive or action. We then perturb the position and scale of the person, and obtain

4 the next pose by localizing the pose to the observations; note that localization step takes into account the spatial prior on the pose from the action model a t+1,p t+1,f t+1. During actions that are performed while standing at the same location such as sitting on ground, we imposed a constraint that the feet of the person remain on ground at roughly the same location (using a penalty function modeled as a zero-mean Gaussian). This constraint makes our tracker more robust to drifting. The best state sequence from the state distribution over all frames is then obtained using Viterbi algorithm. 4. Accurate Pose Localization from 3D Priors In this section, we present our approach to accurately localize a hypothesized pose (from the action model) to the image observations. Given prior information such as scale and position, localization involves searching through the pose space to infer the pose that best describes the image evidence. In our setting, where the pose is being tracked using approximate action models, prior on the pose includes coarse 2D position and scale information and the pose subspace which is likely to include the true pose. It is natural to assume that in cluttered environments, the 2D position and scale priors may be quite noisy. Furthermore, the pose subspace induced by the action model can be large especially for fast moving parts, for e.g. hands during waving. For efficient localization, we first project the 3D pose search space on the 2D image to obtain spatial prior on the 2D pose, then localize the 2D pose using image observations and then estimate the 3D pose from the aligned 2D pose. For 2D pose localization, we use a part-based graphical model approach (similar to pictorial structure [4, 20, 1]) which represents the human body by its constituent parts (see figure 3(a)) and impose pairwise constraints over the parts during inference. These pairwise constraints model the kinematic and/or the inter-part occlusion relationships between the parts; however when all such constraints are imposed, the graphical model has loops (see figure 3(b)). Even though attempts have been made to infer pose using models with loops but they either tend to be computationally expensive [9, 27]. Thus, for efficient and exact inference tree-structured models are preferred. We develop an approach to automatically select a tree structured model that is most likely to give an accurate localization for a given pose, by leveraging the fact that under occlusion, some kinematic constraints may be relaxed in order to model constraints that would be more effective for localization; we call this model Pose-Specific Part Model (PSPM). Next, we first present 2D pose localization using treestructured part model. We then describe the PSPM selection and learning, followed by the 3D pose localization using PSPM. x lla x lll x lua x lul x h x t x rul x rua x rll x rla (a) (b) Figure 3. Graphical Models for 2D pose (a) Kinematic Tree model [4] (b) Graph with edges to model kinematic and inter-part occlusion constraints; observation nodes are not shown for clarity 4.1. Localizing 2D Pose using Part Model In a 2D pose model, each part is represented as a node and the edges represent pairwise constraints between the parts. During inference, detectors for all parts are independently applied on the image, and then best pose x is obtained by maximizing the joint likelihood given by p(x, I Θ) = p(i x i, Θ s i ) p(x i x j, Θ p ij ) (1) i P ij E where x i denote the part i, (P, E) is the graphical model over the parts P ; p(i x i, Θ s i ) represent the likelihood of part hypothesis x i obtained by applying the part detector; p(x i x j, Θ p ij ) represent the pairwise constraints; Θ= (Θ s, Θ p ) are model priors for unary and pairwise potentials. A commonly used 2D pose model [4, 20, 1] assumes a treestructure, as efficient and exact inference can be performed [4] Part Detection Recently, [1] reported that better part detectors can significantly improve localization results; however, better part detectors are also computationally expensive. Thus, in this work, we experiment with 2 different types of detectors that can be applied efficiently and have been previously used for localizing 2D body parts - geometric templates [10] and boundary and region templates [20]. We briefly describe the part detectors here, [Geometric Templates] Each part is modeled with a simple geometric object - head with an ellipse, torso with an oriented rectangle and each arm with a pair of line segments. The log likelihood score of a part is obtained by accumulating the edge strength and orientation match on the boundary points. [Boundary and Region Templates] Each template is a weighted sum of the oriented bar filters where the weights are obtained by maximizing the conditional joint likelihood [20]. We use the detectors provided by the authors. x lla x h x lll x lua x lul x t x rua x rul x rll x rla

5 4.1.2 Pairwise Constraints x lla x h x rla The pairwise kinematic potential between parts is defined using a Gaussian distribution, similar to [4, 1]. To avoid the overlapping parts from occupying exactly the same place, we add additional repulsion constraint that reduces the likelihood of the occluded part to overlap with the occluder. For parts x i and x j such that x i is occluding x j, we define the pairwise potential as p(x i x j, Θ ij )=N (l i l j ; μ ij,σ ij ) Λ(l i,l j ) where, l i denote the position and orientation of x i, and Θ ij = (μ ij,σ ij ) is Gaussian prior over the relative part position and orientation, Λ(l i,l j ) is the repulsive prior between the overlapping parts [5] Pose-Specific Part Models for Localization Given spatial priors on a 3D pose, the Pose-Specific Part Model (PSPM) is a tree-structured graph, and is tuned to accurately localize the specified pose. Obtaining PSPM for a pose involves selecting the model (set of parts P and the structure E) and estimating the model prior Θ which is likely to maximize the joint likelihood. Accurate localization can be obtained by maximizing Eqn 1. [Part Selection] For accurate localization, we select the parts that are at least partially visible, since the part detectors do not work well for heavily occluded parts. To achieve this, we project the 3D pose to obtain the approximate position and orientation of each part. This information, together with relative depth ordering of parts, is used to estimate visibility of each part. The visibility v(p i ) is computed as the fraction of part p i that is unoccluded i.e. v(p i )=1 ovlp(p i, p j ) (2) j i where ovlp(p i,p j ) show the fraction of part p i occluded by p j. For model selection, we only consider the parts with visibility greater than 0.5. [Structure Selection] This step involves selecting a tree from all possible trees, that captures appropriate constraints for localizing the given pose. For localizing poses with partially or fully occluded parts, we can relax some kinematic constraints in the standard tree model 3(a), and add an approximate neighborhood cum non-overlap constraint such that the resulting model is still a tree. For example, consider the pose in figure 4(a). An alternate model to the standard kinematic model connects the left lower leg to the right lower leg, and results in a better pose estimate that using the standard kinematic tree. Since upper and lower parts of the body are rarely coupled (i.e. kinematically connected or occlude each other), we ignore the edges between an arm and a leg. Figure 3(b) shows the edges considered for structure selection. (a) x lll x lua x lul xt (b) Figure 4. Pose localization using Pose-specific Part Model; (a) Image of a person sitting down (b) Selected Pose-specific Part Model (occluded parts are marked with dotted lines) (c) Localized 2D parts obtained using the selected PSPM x rul A standard approach for structure selection is to find the tree-structure that maximizes the joint likelihood over labeled data [11]. This involves estimating the prior parameters (mean and variance) for all pairs of parts that are connected, and then finding a tree-structure which has the lowest score (sum of variance over all edges). Since the tree structure that maximizes the joint likelihood may be different for different poses, the standard learning approach would require labeled data for all poses in the action model, from various viewpoints; which is prohibitively large. In this work, we propose a measure for the model score based on the geometry of the pose. To come up with an appropriate measure, we annotated 2D and 3D poses for 200 images and estimated the tree model with highest localization score by performing an exhaustive search over all tree-structured models from the graph shown figure 3(b). Note that the number of all possible tree models is quite large. To reduce the search space, we consider only those trees which include the kinematic edges and those non-kinematic edges where the connected pair of parts overlap. From our experiments, we observed that for poses with unoccluded parts, the best tree had mostly kinematic edges in it. However non-kinematic edges were preferred when the parts occluded each other. Based on this observation, we propose a score, Localization effect of an edge L(e ij ), which captures the usefulness of that edge toward localizing the given pose. We define the localization effect of an edge as the product of the detection accuracy of part detectors and the degree of occlusion of the connected parts. We define L(e ij ) for an edge e ij as: { D(pi )D(p L(e ij )= j )min{v(p i ),v(p j )},e ij K D(p i )D(p j )max{ovlp(p i,p j ),ovlp(p j,p i )},e ij / K where, K is the set of kinematic edges; D(p i ) is the detection accuracy of detector for part p i ; the min/max term captures the degree of occlusion. The tree selection for accurate localization can be formulated as a search over the set of edges that maximizes the total localization effect. Since the localization effect of x rua x rll (c)

6 an edge is independent of others, the optimal tree structure E can be estimated as: E =max L(e ij ) s.t. E is a tree (3) E G ij E where G is the graph with all pairwise constraints. Note that Equation 3 can be solved efficiently by finding the maximum spanning tree in the graph G, with L(e ij ) as the weight of e ij. [Estimating Model Prior Θ] We define the pairwise potential using a Gaussian (in Section 4.1.2). Previous methods work with uninformed prior and hence, learn the parameters of the Gaussian from a labeled data [4]. But in our case, where prior knowledge of pose is available, learning pose-specific parameters would be more meaningful. However, learning pose-specific parameters would require a prohibitively large number of pose samples (for all poses from various viewpoints). We estimate these parameters using the prior on the 3D pose. The model parameters, mean and variance at each joint, are estimated by projecting the 3D pose prior, modeled as Gaussian distributions, to 2D. For e.g, the mean relative position μ ij of part i w.r.t. part j is simply the difference of the mid-point of the end-joints of part p i and that of part p j Localizing Pose from 3D Action Priors The action prior include the 3D prior on the pose represented with Gaussian distributions (one for each joint) and approximate position and scale of the person available from the tracker. Given this prior, we obtain accurate 2D localization of that pose using PSPM (as described earlier). Note that during inference, we only apply the part detector in the neighborhood of the projected 2D position, orientation and scale for each part. After localizing the pose in 2D, we then estimate the 3D pose from the 2D joints positions. While estimating 3D pose from 2D joints is ambiguous; in our case the spatial priors on pose available from action model and the tracking information help remove such ambiguities. For accurate 3D pose estimation from 2D pose with known depth ordering of parts, one can estimate the 3D joints using non-linear least squares to fit the 2D estimates while constraining the joints to stay within the pose search space (similar to [30]); in this work, we simply update each joint position, starting from the neck, assuming the 3D length of the parts does not change. An initial estimate of 3D part lengths are obtained by scaling a canonical 3D model in standing pose, such that height of the model is same as the observed height of the actor (available from tracking). 5. Experiments We first demonstrate our pose localization approach using PSPMs on an image dataset with pose annotations. We then evaluate our action recognition algorithm that uses PSPMs for localization on 2 publicly available datasets: Full body Gestures [17] and Hand Gestures [18]. Compared to KTH [13], Weizmann, HumanEva[23] and Hand Gestures [18] datasets which have a clean background and/or few viewpoint variations, Full body Gestures set includes videos with cluttered dynamic backgrounds, captured at various viewpoints. We also report results on hand gestures in dynamic scenes Pose Localization We selected frames from existing action recognition datasets [17, 18] and created a collection of 195 images with variety of poses. For each image, we annotated the 3D pose of the actor by marking the 2D joint positions and their relative depths, followed by lifting to 3D (similar to keypose annotations). To quantitatively evaluate pose localization, we computed the average localization score over the visible parts: a part is considered to be correctly localized if it overlaps more than 50% with the ground truth part. Recall that the pose prior include approximate 2D scale and position information from the tracker, and the approximate 3D pose (represented as a set of Gaussian distributions over the 3D joint positions). To simulate the noisy prior obtained from the action models, we set the variance of each 3D joint to be 5% of the part length. This prior was then used as an input to various localization methods. We first apply our implementation of Pictorial Structure (PS) [4], which is a tree-structure model with kinematic edges and uses an uninformed prior. Using Boundary Templates (BT), PS gives a localization accuracy of 44.53%. Then we modify PS by applying part detectors only in the search region provided by the prior and enforce kinematics using parameters estimated from the prior; we refer to this as CPS (Constrained Pictorial Structures). Applying CPS using Boundary Templates, gave a localization accuracy of 63.74%, which when compared to PS, clearly shows the importance of incorporating pose prior. We then apply the Pose-Specific Part Model [20] and achieved a much higher localization accuracy 71%, which demonstrates the advantage of modeling occlusion based constraints. We also compare with [17], which uses Hausdorff distance between the pose boundary and canny edges as a shape likelihood measure to localize the pose. This approach achieved a lower accuracy of 62.71%. We test the robustness of our approach to uncertainty in position and scale of the pose (which is likely to occur during tracking). Figure 5 shows the accuracy plots for various localization methods against the degree of uncertainty. Notice that localization using PSPM and CPS with Boundary Templates is quite robust to position uncertainty compared to Hausdorff method. Using CPS with Geometric Templates and Boundary Templates gave comparable accuracy

7 scores at low uncertainty, but deteriorates as the uncertainty increases; this indicates that Boundary Templates are more robust to noise. Also notice in Figure 5.b, that the PSPM using Boundary Templates tolerates small errors in the height estimate ( 10%). However PSPM based localization is about times slower than using Hausdorff distance. Localization Accuracy Uncertainty in position Hausdorff PS-BT CPS-BT CPS-BT-IIP CPS-GT PSPM-BT Localization Accuracy Hausdorff PSPM-BT Uncertainty in Height Figure 5. Plots showing Localization accuracy of different approaches (a) with uncertainty in position (shown in ratio of position error to person height) (b) with uncertainty in height estimate (scale); 5.2. Action Recognition From pose localization experiments, we observe that Hausdorff distance based method localizes well when predicted pose is not far from the true pose. Thus for efficiency, we apply PSPMs every 5 th frame and use Hausdorff distance based method for intermediate frames. In addition, for efficient localization using PSPM, we scale down the image so the actor is 100 pixels high. Our entire system runs at 1 frame per second on a 3GHz Xeon CPU running Windows/C++ programs. We now present our results on three datasets. Hand Gestures Dataset [18]: This dataset has 5-6 instances of 12 actions from 8 different actors in an indoor lab setting; a total of 495 action sequences across all actions. Even though the background is not cluttered, recognition task is still challenging due to the large number of actions with small pose difference. For evaluation, we train the models on a subset of actors and test on the rest. We compare our approach to [18], that uses a similar joint tracking and recognition approach but uses discrete action duration models and foreground based features for localization and matching. [18] reports recognition rate of 78% and 90% with 1:8and 3:5train:test respectively. Our algorithm achieves 92% recognition accuracy at 1:8train:test. Ifwe replace the PSPM based localization with Hausdorff distance based method, recognition rate drops to 84%. This illustrates that even in clean backgrounds, use of PSPMs improves action recognition. Augmented Hand Gestures Dataset: To demonstrate robustness to cluttered dynamic backgrounds, we generated a dataset by embedding 45 action instances from the original dataset [18] into videos with complex dynamic backgrounds (see figure 6(f-k) for sample images). The dataset Dataset Method Train:Test Recognition % Natarajan et al [18] 1:8 78% Natarajan et al [18] 3: % Hand Gestures CAN (Hausdorff) 1:8 84.2% CAN (PSPM) 1:8 92% SFD-CRF [17] MoCAP 77.45% USC Gestures CAN (PSPM) 1:6 89.5% Table 1. Evaluation Results on the Hand Gestures Dataset. has 215 videos including 3 different actors performing hand gestures in 5 different scenes. Our algorithm achieves 91% recognition accuracy. Note that the recognition accuracy on the original 45 videos from [18], that were used for embedding, was about 95%. To process these videos, we used the parameters trained on the original hand gestures dataset [18]. In addition, we also collected 25 videos including 4 hand gestures but performed in dynamic scenes, with camera shakes and/or objects moving in the background. Our algorithm, trained on the original dataset, correctly recognized 20 action instances ( 80% accuracy). USC Gestures Dataset [17]: This dataset has videos of 6 full body actions, captured at various pan and tilt angles; actions include - sit-on-ground, standup-from-ground, siton-chair, stand-from-chair, pickup-from-ground and pointforward. We evaluated our approach on a part of this dataset that was captured at 0 tilt in 6 varying backgrounds including cluttered indoor scenes and outdoors in front of moving vehicles; rest of the dataset was captured at other tilt angles in a relatively clean, static background. The selected set include actions captured at 5 different camera pan angles w.r.t. to the actor - 0, 45, 90, 270, 315, for a total 240 action instances, each performed either by a different actor, at a different pan or in a different background. For our experiments, we trained our models using 2 actions instances from one actor and evaluated on the rest. Note that models were trained only on 2 viewpoints, and tested on 5 different viewpoints. On segmented action instances, our approach achieved an accuracy of 75.91%. Figure 6(n-s) show sample results. [17] reports an accuracy of 77.35%; however they assume that the sit-on-chair and sit-on-ground actions are followed by stand-from-chair and stand-fromground respectively. When we incorporate this information, our action recognition accuracy improves to 89.5% which shows a 12% improvement over [17]. 6. Conclusion We have presented an approach for joint pose tracking and action recognition in cluttered dynamic environments, which has low training requirements and doesn t require 3D MoCAP data. We achieve this by proposing an accurate and efficient pose localization approach using Pose-Specific Part Models (PSPMs). We have demonstrated that the our localization approach is robust to noise and works well in

8 (b) (a) (g) (h) (n) (c) (j) (i) (o) (d) (p) (e) (f) (k) (q) (l) (r) (m) (s) Figure 6. Results obtained on Gesture datasets (a-e) Hand Gestures [18], (f-m) Augmented Hand Gestures, (n-s) USC Gesures [17]. The estimated pose is overlaid on each image (in red), and the corresponding part distribution obtained by applying PSPM is shown next to it cluttered environments. Further, we have also demonstrated our approach for action recognition on hand gestures as well as on USC Gestures dataset with full body gestures in cluttered and dynamic environments. Acknowledgements. This research was supported, in part, by the Office of Naval Research under grants #N and #N References [1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, pages , , 2, 4, 5 [2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In CVPR, pages , [3] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, [4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55 79, , 2, 4, 5, 6 [5] V. Ferrari, M. J. Marı n-jime nez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, , 5 [6] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. Huang. Action detection in complex scenes with spatial and temporal ambiguities. In ICCV, pages , [7] C. Huang and R. Nevatia. High performance object detection by collaborative learning of joint ranking of granules features. In CVPR, pages 41 48, [8] N. Ikizler and D. A. Forsyth. Searching video for complex activities with finite state models. In CVPR, [9] H. Jiang and D. Martin. Global pose estimation using non-tree models. In CVPR, , 2, 4 [10] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated image motion. In FG, [11] D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, [12] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor models for 2d human pose recovery. ICCV, [13] I. Laptev. On space-time interest points. International Journal on Computer Vision (IJCV), 64(2-3): , , 6 [14] F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In CVPR, , 3 [15] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, [16] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR, [17] P. Natarajan and R. Nevatia. View and scale invariant action recognition using multiview shape-flow models. In CVPR, , 2, 6, 7, 8 [18] P. Natarajan, V. K. Singh, and R. Nevatia. Learning 3d action models from a few 2d videos for view invariant action recognition. In CVPR, , 2, 6, 7, 8 [19] H. Ning, W. Xu, Y. Gong, and T. S. Huang. Latent pose estimator for continuous action recognition. In ECCV (2), [20] D. Ramanan. Learning to parse images of articulated bodies. In NIPS, pages , , 2, 4, 6 [21] R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. In CVPR, [22] L. Sigal, A. O. Balan, and M. J. Black. Combined discriminative and generative articulated pose and non-rigid shape estimation. In NIPS, , 3 [23] L. Sigal and M. J. Black. Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In CVPR, pages , , 6 [24] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Conditional random fields for contextual human motion recognition. In ICCV, pages , [25] J. Sullivan and S. Carlsson. Recognizing and tracking human action. In ECCV (1), pages , [26] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In ICML, page 99, [27] T.-P. Tian and S. Sclaroff. Fast globally optimal 2d human detection with loopy graph models. In CVPR, , 2, 4 [28] R. Urtasun, D. J. Fleet, and P. Fua. 3d people tracking with gaussian process dynamical models. In CVPR, [29] Y. Wang and G. Mori. Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV, [30] X. K. Wei and J. Chai. Modeling 3d human poses from uncalibrated monocular images. In ICCV, pages , , 6 [31] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars. In ICCV,

Simultaneous tracking and action recognition for single actor human actions

Simultaneous tracking and action recognition for single actor human actions Vis Comput (2011) 27:1115 1123 DOI 10.1007/s00371-011-0656-x ORIGINAL ARTICLE Simultaneous tracking and action recognition for single actor human actions Vivek Kumar Singh Ram Nevatia Published online:

More information

Multiple Pose Context Trees for estimating Human Pose in Object Context

Multiple Pose Context Trees for estimating Human Pose in Object Context Multiple Pose Context Trees for estimating Human Pose in Object Context Vivek Kumar Singh Furqan Muhammad Khan Ram Nevatia University of Southern California Los Angeles, CA {viveksin furqankh nevatia}@usc.edu

More information

Human Upper Body Pose Estimation in Static Images

Human Upper Body Pose Estimation in Static Images 1. Research Team Human Upper Body Pose Estimation in Static Images Project Leader: Graduate Students: Prof. Isaac Cohen, Computer Science Mun Wai Lee 2. Statement of Project Goals This goal of this project

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Object Detection by 3D Aspectlets and Occlusion Reasoning

Object Detection by 3D Aspectlets and Occlusion Reasoning Object Detection by 3D Aspectlets and Occlusion Reasoning Yu Xiang University of Michigan Silvio Savarese Stanford University In the 4th International IEEE Workshop on 3D Representation and Recognition

More information

Combining Discriminative Appearance and Segmentation Cues for Articulated Human Pose Estimation

Combining Discriminative Appearance and Segmentation Cues for Articulated Human Pose Estimation Combining Discriminative Appearance and Segmentation Cues for Articulated Human Pose Estimation Sam Johnson and Mark Everingham School of Computing University of Leeds {mat4saj m.everingham}@leeds.ac.uk

More information

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors Bo Wu Ram Nevatia University of Southern California Institute for Robotics and Intelligent

More information

Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos

Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos Yihang Bo Institute of Automation, CAS & Boston College yihang.bo@gmail.com Hao Jiang Computer Science Department, Boston

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Lecture 18: Human Motion Recognition

Lecture 18: Human Motion Recognition Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio

More information

Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation

Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation JOHNSON, EVERINGHAM: CLUSTERED MODELS FOR HUMAN POSE ESTIMATION 1 Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation Sam Johnson s.a.johnson04@leeds.ac.uk Mark Everingham m.everingham@leeds.ac.uk

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Human Context: Modeling human-human interactions for monocular 3D pose estimation

Human Context: Modeling human-human interactions for monocular 3D pose estimation Human Context: Modeling human-human interactions for monocular 3D pose estimation Mykhaylo Andriluka Leonid Sigal Max Planck Institute for Informatics, Saarbrücken, Germany Disney Research, Pittsburgh,

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Learning and Recognizing Visual Object Categories Without First Detecting Features

Learning and Recognizing Visual Object Categories Without First Detecting Features Learning and Recognizing Visual Object Categories Without First Detecting Features Daniel Huttenlocher 2007 Joint work with D. Crandall and P. Felzenszwalb Object Category Recognition Generic classes rather

More information

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Johnson Hsieh (johnsonhsieh@gmail.com), Alexander Chia (alexchia@stanford.edu) Abstract -- Object occlusion presents a major

More information

A Joint Model for 2D and 3D Pose Estimation from a Single Image

A Joint Model for 2D and 3D Pose Estimation from a Single Image A Joint Model for 2D and 3D Pose Estimation from a Single Image E. Simo-Serra 1, A. Quattoni 2, C. Torras 1, F. Moreno-Noguer 1 1 Institut de Robòtica i Informàtica Industrial (CSIC-UPC), Barcelona, Spain

More information

Nonflat Observation Model and Adaptive Depth Order Estimation for 3D Human Pose Tracking

Nonflat Observation Model and Adaptive Depth Order Estimation for 3D Human Pose Tracking Nonflat Observation Model and Adaptive Depth Order Estimation for 3D Human Pose Tracking Nam-Gyu Cho, Alan Yuille and Seong-Whan Lee Department of Brain and Cognitive Engineering, orea University, orea

More information

3D Human Motion Analysis and Manifolds

3D Human Motion Analysis and Manifolds D E P A R T M E N T O F C O M P U T E R S C I E N C E U N I V E R S I T Y O F C O P E N H A G E N 3D Human Motion Analysis and Manifolds Kim Steenstrup Pedersen DIKU Image group and E-Science center Motivation

More information

Behaviour based particle filtering for human articulated motion tracking

Behaviour based particle filtering for human articulated motion tracking Loughborough University Institutional Repository Behaviour based particle filtering for human articulated motion tracking This item was submitted to Loughborough University's Institutional Repository by

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

A Joint Model for 2D and 3D Pose Estimation from a Single Image

A Joint Model for 2D and 3D Pose Estimation from a Single Image 213 IEEE Conference on Computer Vision and Pattern Recognition A Joint Model for 2D and 3D Pose Estimation from a Single Image E. Simo-Serra 1, A. Quattoni 2, C. Torras 1, F. Moreno-Noguer 1 1 Institut

More information

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Sung Chun Lee, Chang Huang, and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA sungchun@usc.edu,

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2 High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object

More information

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2 High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

Pairwise Threshold for Gaussian Mixture Classification and its Application on Human Tracking Enhancement

Pairwise Threshold for Gaussian Mixture Classification and its Application on Human Tracking Enhancement Pairwise Threshold for Gaussian Mixture Classification and its Application on Human Tracking Enhancement Daegeon Kim Sung Chun Lee Institute for Robotics and Intelligent Systems University of Southern

More information

Predicting 3D People from 2D Pictures

Predicting 3D People from 2D Pictures Predicting 3D People from 2D Pictures Leonid Sigal Michael J. Black Department of Computer Science Brown University http://www.cs.brown.edu/people/ls/ CIAR Summer School August 15-20, 2006 Leonid Sigal

More information

Tracking People. Tracking People: Context

Tracking People. Tracking People: Context Tracking People A presentation of Deva Ramanan s Finding and Tracking People from the Bottom Up and Strike a Pose: Tracking People by Finding Stylized Poses Tracking People: Context Motion Capture Surveillance

More information

CS 231A Computer Vision (Fall 2012) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3 CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest

More information

Visual Motion Analysis and Tracking Part II

Visual Motion Analysis and Tracking Part II Visual Motion Analysis and Tracking Part II David J Fleet and Allan D Jepson CIAR NCAP Summer School July 12-16, 16, 2005 Outline Optical Flow and Tracking: Optical flow estimation (robust, iterative refinement,

More information

Hierarchical Part-Based Human Body Pose Estimation

Hierarchical Part-Based Human Body Pose Estimation Hierarchical Part-Based Human Body Pose Estimation R. Navaratnam A. Thayananthan P. H. S. Torr R. Cipolla University of Cambridge Oxford Brookes Univeristy Department of Engineering Cambridge, CB2 1PZ,

More information

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards

More information

Robust Model-Free Tracking of Non-Rigid Shape. Abstract

Robust Model-Free Tracking of Non-Rigid Shape. Abstract Robust Model-Free Tracking of Non-Rigid Shape Lorenzo Torresani Stanford University ltorresa@cs.stanford.edu Christoph Bregler New York University chris.bregler@nyu.edu New York University CS TR2003-840

More information

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Articulated Pose Estimation with Flexible Mixtures-of-Parts Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Outline Modeling Special Cases Inferences Learning Experiments Problem and Relevance Problem:

More information

Detecting Object Instances Without Discriminative Features

Detecting Object Instances Without Discriminative Features Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance

More information

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Visuelle Perzeption für Mensch- Maschine Schnittstellen Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Introduction In this supplementary material, Section 2 details the 3D annotation for CAD models and real

More information

Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion

Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon Institute for Advanced Computer Studies University of

More information

Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences

Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences Presentation of the thesis work of: Hedvig Sidenbladh, KTH Thesis opponent: Prof. Bill Freeman, MIT Thesis supervisors

More information

High Level Computer Vision

High Level Computer Vision High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de http://www.d2.mpi-inf.mpg.de/cv Please Note No

More information

A Comparison on Different 2D Human Pose Estimation Method

A Comparison on Different 2D Human Pose Estimation Method A Comparison on Different 2D Human Pose Estimation Method Santhanu P.Mohan 1, Athira Sugathan 2, Sumesh Sekharan 3 1 Assistant Professor, IT Department, VJCET, Kerala, India 2 Student, CVIP, ASE, Tamilnadu,

More information

Global Pose Estimation Using Non-Tree Models

Global Pose Estimation Using Non-Tree Models Global Pose Estimation Using Non-Tree Models Hao Jiang and David R. Martin Computer Science Department, Boston College Chestnut Hill, MA 02467, USA hjiang@cs.bc.edu, dmartin@cs.bc.edu Abstract We propose

More information

Detecting and Segmenting Humans in Crowded Scenes

Detecting and Segmenting Humans in Crowded Scenes Detecting and Segmenting Humans in Crowded Scenes Mikel D. Rodriguez University of Central Florida 4000 Central Florida Blvd Orlando, Florida, 32816 mikel@cs.ucf.edu Mubarak Shah University of Central

More information

Human Pose Estimation Using Consistent Max-Covering

Human Pose Estimation Using Consistent Max-Covering Human Pose Estimation Using Consistent Max-Covering Hao Jiang Computer Science Department, Boston College Chestnut Hill, MA 2467, USA hjiang@cs.bc.edu Abstract We propose a novel consistent max-covering

More information

Recognizing Human Actions from Still Images with Latent Poses

Recognizing Human Actions from Still Images with Latent Poses Recognizing Human Actions from Still Images with Latent Poses Weilong Yang, Yang Wang, and Greg Mori School of Computing Science Simon Fraser University Burnaby, BC, Canada wya16@sfu.ca, ywang12@cs.sfu.ca,

More information

Test-time Adaptation for 3D Human Pose Estimation

Test-time Adaptation for 3D Human Pose Estimation Test-time Adaptation for 3D Human Pose Estimation Sikandar Amin,2, Philipp Müller 2, Andreas Bulling 2, and Mykhaylo Andriluka 2,3 Technische Universität München, Germany 2 Max Planck Institute for Informatics,

More information

A Unified Spatio-Temporal Articulated Model for Tracking

A Unified Spatio-Temporal Articulated Model for Tracking A Unified Spatio-Temporal Articulated Model for Tracking Xiangyang Lan Daniel P. Huttenlocher {xylan,dph}@cs.cornell.edu Cornell University Ithaca NY 14853 Abstract Tracking articulated objects in image

More information

Inferring 3D from 2D

Inferring 3D from 2D Inferring 3D from 2D History Monocular vs. multi-view analysis Difficulties structure of the solution and ambiguities static and dynamic ambiguities Modeling frameworks for inference and learning top-down

More information

HUMAN PARSING WITH A CASCADE OF HIERARCHICAL POSELET BASED PRUNERS

HUMAN PARSING WITH A CASCADE OF HIERARCHICAL POSELET BASED PRUNERS HUMAN PARSING WITH A CASCADE OF HIERARCHICAL POSELET BASED PRUNERS Duan Tran Yang Wang David Forsyth University of Illinois at Urbana Champaign University of Manitoba ABSTRACT We address the problem of

More information

CS 664 Flexible Templates. Daniel Huttenlocher

CS 664 Flexible Templates. Daniel Huttenlocher CS 664 Flexible Templates Daniel Huttenlocher Flexible Template Matching Pictorial structures Parts connected by springs and appearance models for each part Used for human bodies, faces Fischler&Elschlager,

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Chi Li, M. Zeeshan Zia 2, Quoc-Huy Tran 2, Xiang Yu 2, Gregory D. Hager, and Manmohan Chandraker 2 Johns

More information

Segmentation. Bottom up Segmentation Semantic Segmentation

Segmentation. Bottom up Segmentation Semantic Segmentation Segmentation Bottom up Segmentation Semantic Segmentation Semantic Labeling of Street Scenes Ground Truth Labels 11 classes, almost all occur simultaneously, large changes in viewpoint, scale sky, road,

More information

Guiding Model Search Using Segmentation

Guiding Model Search Using Segmentation Guiding Model Search Using Segmentation Greg Mori School of Computing Science Simon Fraser University Burnaby, BC, Canada V5A S6 mori@cs.sfu.ca Abstract In this paper we show how a segmentation as preprocessing

More information

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10 BUNDLE ADJUSTMENT FOR MARKERLESS BODY TRACKING IN MONOCULAR VIDEO SEQUENCES Ali Shahrokni, Vincent Lepetit, Pascal Fua Computer Vision Lab, Swiss Federal Institute of Technology (EPFL) ali.shahrokni,vincent.lepetit,pascal.fua@epfl.ch

More information

Object Detection by 3D Aspectlets and Occlusion Reasoning

Object Detection by 3D Aspectlets and Occlusion Reasoning Object Detection by 3D Aspectlets and Occlusion Reasoning Yu Xiang University of Michigan yuxiang@umich.edu Silvio Savarese Stanford University ssilvio@stanford.edu Abstract We propose a novel framework

More information

Face detection in a video sequence - a temporal approach

Face detection in a video sequence - a temporal approach Face detection in a video sequence - a temporal approach K. Mikolajczyk R. Choudhury C. Schmid INRIA Rhône-Alpes GRAVIR-CNRS, 655 av. de l Europe, 38330 Montbonnot, France {Krystian.Mikolajczyk,Ragini.Choudhury,Cordelia.Schmid}@inrialpes.fr

More information

Learning Articulated Skeletons From Motion

Learning Articulated Skeletons From Motion Learning Articulated Skeletons From Motion Danny Tarlow University of Toronto, Machine Learning with David Ross and Richard Zemel (and Brendan Frey) August 6, 2007 Point Light Displays It's easy for humans

More information

Efficient Detector Adaptation for Object Detection in a Video

Efficient Detector Adaptation for Object Detection in a Video 2013 IEEE Conference on Computer Vision and Pattern Recognition Efficient Detector Adaptation for Object Detection in a Video Pramod Sharma and Ram Nevatia Institute for Robotics and Intelligent Systems,

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

Located Hidden Random Fields: Learning Discriminative Parts for Object Detection

Located Hidden Random Fields: Learning Discriminative Parts for Object Detection Located Hidden Random Fields: Learning Discriminative Parts for Object Detection Ashish Kapoor 1 and John Winn 2 1 MIT Media Laboratory, Cambridge, MA 02139, USA kapoor@media.mit.edu 2 Microsoft Research,

More information

Multiple-Person Tracking by Detection

Multiple-Person Tracking by Detection http://excel.fit.vutbr.cz Multiple-Person Tracking by Detection Jakub Vojvoda* Abstract Detection and tracking of multiple person is challenging problem mainly due to complexity of scene and large intra-class

More information

Recognition Rate. 90 S 90 W 90 R Segment Length T

Recognition Rate. 90 S 90 W 90 R Segment Length T Human Action Recognition By Sequence of Movelet Codewords Xiaolin Feng y Pietro Perona yz y California Institute of Technology, 36-93, Pasadena, CA 925, USA z Universit a dipadova, Italy fxlfeng,peronag@vision.caltech.edu

More information

CS 231A Computer Vision (Winter 2014) Problem Set 3

CS 231A Computer Vision (Winter 2014) Problem Set 3 CS 231A Computer Vision (Winter 2014) Problem Set 3 Due: Feb. 18 th, 2015 (11:59pm) 1 Single Object Recognition Via SIFT (45 points) In his 2004 SIFT paper, David Lowe demonstrates impressive object recognition

More information

A Hierarchical Compositional System for Rapid Object Detection

A Hierarchical Compositional System for Rapid Object Detection A Hierarchical Compositional System for Rapid Object Detection Long Zhu and Alan Yuille Department of Statistics University of California at Los Angeles Los Angeles, CA 90095 {lzhu,yuille}@stat.ucla.edu

More information

People-Tracking-by-Detection and People-Detection-by-Tracking

People-Tracking-by-Detection and People-Detection-by-Tracking People-Tracking-by-Detection and People-Detection-by-Tracking Mykhaylo Andriluka Stefan Roth Bernt Schiele Computer Science Department TU Darmstadt, Germany {andriluka, sroth, schiele}@cs.tu-darmstadt.de

More information

3D Pictorial Structures for Multiple Human Pose Estimation

3D Pictorial Structures for Multiple Human Pose Estimation 3D Pictorial Structures for Multiple Human Pose Estimation Vasileios Belagiannis 1, Sikandar Amin 2,3, Mykhaylo Andriluka 3, Bernt Schiele 3, Nassir Navab 1, and Slobodan Ilic 1 1 Computer Aided Medical

More information

CS 223B Computer Vision Problem Set 3

CS 223B Computer Vision Problem Set 3 CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.

More information

Improved Human Parsing with a Full Relational Model

Improved Human Parsing with a Full Relational Model Improved Human Parsing with a Full Relational Model Duan Tran and David Forsyth University of Illinois at Urbana-Champaign, USA {ddtran2,daf}@illinois.edu Abstract. We show quantitative evidence that a

More information

Separating Objects and Clutter in Indoor Scenes

Separating Objects and Clutter in Indoor Scenes Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous

More information

Modeling 3D Human Poses from Uncalibrated Monocular Images

Modeling 3D Human Poses from Uncalibrated Monocular Images Modeling 3D Human Poses from Uncalibrated Monocular Images Xiaolin K. Wei Texas A&M University xwei@cse.tamu.edu Jinxiang Chai Texas A&M University jchai@cse.tamu.edu Abstract This paper introduces an

More information

3D Spatial Layout Propagation in a Video Sequence

3D Spatial Layout Propagation in a Video Sequence 3D Spatial Layout Propagation in a Video Sequence Alejandro Rituerto 1, Roberto Manduchi 2, Ana C. Murillo 1 and J. J. Guerrero 1 arituerto@unizar.es, manduchi@soe.ucsc.edu, acm@unizar.es, and josechu.guerrero@unizar.es

More information

Introduction to behavior-recognition and object tracking

Introduction to behavior-recognition and object tracking Introduction to behavior-recognition and object tracking Xuan Mo ipal Group Meeting April 22, 2011 Outline Motivation of Behavior-recognition Four general groups of behaviors Core technologies Future direction

More information

Chapter 9 Object Tracking an Overview

Chapter 9 Object Tracking an Overview Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging

More information

Better appearance models for pictorial structures

Better appearance models for pictorial structures EICHNER, FERRARI: BETTER APPEARANCE MODELS FOR PICTORIAL STRUCTURES 1 Better appearance models for pictorial structures Marcin Eichner eichner@vision.ee.ethz.ch Vittorio Ferrari ferrari@vision.ee.ethz.ch

More information

An Adaptive Appearance Model Approach for Model-based Articulated Object Tracking

An Adaptive Appearance Model Approach for Model-based Articulated Object Tracking An Adaptive Appearance Model Approach for Model-based Articulated Object Tracking Alexandru O. Bălan Michael J. Black Department of Computer Science - Brown University Providence, RI 02912, USA {alb, black}@cs.brown.edu

More information

Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation

Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation Leonid Sigal Michael J. Black Department of Computer Science, Brown University, Providence, RI 02912 {ls,black}@cs.brown.edu

More information

Sports Field Localization

Sports Field Localization Sports Field Localization Namdar Homayounfar February 5, 2017 1 / 1 Motivation Sports Field Localization: Have to figure out where the field and players are in 3d space in order to make measurements and

More information

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA) Detecting and Parsing of Visual Objects: Humans and Animals Alan Yuille (UCLA) Summary This talk describes recent work on detection and parsing visual objects. The methods represent objects in terms of

More information

Efficient Extraction of Human Motion Volumes by Tracking

Efficient Extraction of Human Motion Volumes by Tracking Efficient Extraction of Human Motion Volumes by Tracking Juan Carlos Niebles Princeton University, USA Universidad del Norte, Colombia jniebles@princeton.edu Bohyung Han Electrical and Computer Engineering

More information

Human Action Recognition Using Dynamic Time Warping and Voting Algorithm (1)

Human Action Recognition Using Dynamic Time Warping and Voting Algorithm (1) VNU Journal of Science: Comp. Science & Com. Eng., Vol. 30, No. 3 (2014) 22-30 Human Action Recognition Using Dynamic Time Warping and Voting Algorithm (1) Pham Chinh Huu *, Le Quoc Khanh, Le Thanh Ha

More information

Poselet Conditioned Pictorial Structures

Poselet Conditioned Pictorial Structures Poselet Conditioned Pictorial Structures Leonid Pishchulin 1 Mykhaylo Andriluka 1 Peter Gehler 2 Bernt Schiele 1 1 Max Planck Institute for Informatics, Saarbrücken, Germany 2 Max Planck Institute for

More information

CRF Based Point Cloud Segmentation Jonathan Nation

CRF Based Point Cloud Segmentation Jonathan Nation CRF Based Point Cloud Segmentation Jonathan Nation jsnation@stanford.edu 1. INTRODUCTION The goal of the project is to use the recently proposed fully connected conditional random field (CRF) model to

More information

CS201: Computer Vision Introduction to Tracking

CS201: Computer Vision Introduction to Tracking CS201: Computer Vision Introduction to Tracking John Magee 18 November 2014 Slides courtesy of: Diane H. Theriault Question of the Day How can we represent and use motion in images? 1 What is Motion? Change

More information

Virtual Training for Multi-View Object Class Recognition

Virtual Training for Multi-View Object Class Recognition Virtual Training for Multi-View Object Class Recognition Han-Pang Chiu Leslie Pack Kaelbling Tomás Lozano-Pérez MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139, USA {chiu,lpk,tlp}@csail.mit.edu

More information

Object Pose Detection in Range Scan Data

Object Pose Detection in Range Scan Data Object Pose Detection in Range Scan Data Jim Rodgers, Dragomir Anguelov, Hoi-Cheung Pang, Daphne Koller Computer Science Department Stanford University, CA 94305 {jimkr, drago, hcpang, koller}@cs.stanford.edu

More information

Category-level localization

Category-level localization Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object

More information

Scene Grammars, Factor Graphs, and Belief Propagation

Scene Grammars, Factor Graphs, and Belief Propagation Scene Grammars, Factor Graphs, and Belief Propagation Pedro Felzenszwalb Brown University Joint work with Jeroen Chua Probabilistic Scene Grammars General purpose framework for image understanding and

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Segmentation and Tracking of Multiple Humans in Complex Situations Λ

Segmentation and Tracking of Multiple Humans in Complex Situations Λ Segmentation and Tracking of Multiple Humans in Complex Situations Λ Tao Zhao Ram Nevatia Fengjun Lv University of Southern California Institute for Robotics and Intelligent Systems Los Angeles CA 90089-0273

More information

Introduction to SLAM Part II. Paul Robertson

Introduction to SLAM Part II. Paul Robertson Introduction to SLAM Part II Paul Robertson Localization Review Tracking, Global Localization, Kidnapping Problem. Kalman Filter Quadratic Linear (unless EKF) SLAM Loop closing Scaling: Partition space

More information

Joint Vanishing Point Extraction and Tracking. 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision ETH Zürich

Joint Vanishing Point Extraction and Tracking. 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision ETH Zürich Joint Vanishing Point Extraction and Tracking 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision Lab @ ETH Zürich Definition: Vanishing Point = Intersection of 2D line segments,

More information

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France Monocular Human Motion Capture with a Mixture of Regressors Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France IEEE Workshop on Vision for Human-Computer Interaction, 21 June 2005 Visual

More information

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Accurate 3D Face and Body Modeling from a Single Fixed Kinect Accurate 3D Face and Body Modeling from a Single Fixed Kinect Ruizhe Wang*, Matthias Hernandez*, Jongmoo Choi, Gérard Medioni Computer Vision Lab, IRIS University of Southern California Abstract In this

More information

Computer Vision II Lecture 14

Computer Vision II Lecture 14 Computer Vision II Lecture 14 Articulated Tracking I 08.07.2014 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Outline of This Lecture Single-Object Tracking Bayesian

More information

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary

More information

Supplemental Material: Discovering Groups of People in Images

Supplemental Material: Discovering Groups of People in Images Supplemental Material: Discovering Groups of People in Images Wongun Choi 1, Yu-Wei Chao 2, Caroline Pantofaru 3 and Silvio Savarese 4 1. NEC Laboratories 2. University of Michigan, Ann Arbor 3. Google,

More information