Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models

Size: px

Start display at page:

Download "Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models"

Abraham Caldwell
6 years ago
Views:

1 Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models Vivek Kumar Singh University of Southern California Los Angeles, CA, USA Ram Nevatia University of Southern California Los Angeles, CA, USA Abstract We present an approach to recognizing single actor human actions in complex backgrounds. We adopt a Joint Tracking and Recognition approach, which track the actor pose by sampling from 3D action models. Most existing such approaches require large training data or Mo- CAP to handle multiple viewpoints, and often rely on clean actor silhouettes. The action models in our approach are obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matrices between the 3D keypose figures. Poses sampled from coarse action models may not fit the observations well; to overcome this difficulty, we propose an approach for efficiently localizing a pose by generating a Pose-Specific Part Model (PSPM) which captures appropriate kinematic and occlusion constraints in a tree-structure. In addition, our approach also does not require pose silhouettes. We show improvements to previous results on two publicly available datasets as well as on a novel, augmented dataset with dynamic backgrounds. 1. Introduction The objective of this work is to recognize single actor human actions in videos captured from a single camera. This has been a popular research topic over past few years, as effective solutions to this problem find applications in surveillance, HCI, video retrieval, among others. Existing action recognition methods work well with variations in actor appearances, however, handling viewpoint variations with low training requirements, and dealing with cluttered dynamic backgrounds is still a challenge. While view-invariant approaches using 3D models have been proposed, they either require 3D MoCAP for learning models and/or require videos from multiple viewpoints. We present a simultaneous tracking and recognition approach which tracks the actor pose by sampling from 3D action models and localizing each pose sample; this allows view-invariant action recognition. To deal with cluttered dynamic background, we accurately localize each pose using a 2D part model. We model an action as a sequence of transformations between keyposes. These action models can be obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matrices between 3D keyposes [18]; this avoids large training data and MoCAP. However poses sampled from such coarse models do not match observations well. Thus during inference, errors due to pose approximation and observation noise would accumulate over time, and result in tracking failures and lower recognition rates, especially in cluttered dynamic scenes. We address these issues by a more accurate localization of the human pose using a 2D part model with kinematic constraints. Such models have been successfully applied to localize human pose in cluttered images, under the assumption that the parts are not occluded [4, 20, 1]. However, poses often have multiple occluded parts, and hence modeling inter-part occlusions is useful for accurately localizing such poses. Existing methods such as [9, 27] that model such constraints are too inefficient for tracking and recognition where multiple poses may need to be localized every few frames. We propose a novel framework to select a tree-structured model that captures appropriate kinematic and inter-part occlusion constraints for a particular pose in order to accurately localize that pose; we call this model Pose-Specific Part Model (PSPM). To determine the PSPM for a given pose, we search over many possible tree models and select the model with highest localizability score. We demonstrate our approach on 2 publicly available datasets: Full body Gestures [17] with 6 actions captured from multiple viewpoints with cluttered, dynamic backgrounds, and Hand Gestures [18] with 12 actions with subtle pose variations in a static background. To further demonstrate the robustness to background changes, we evaluate our method on an augmented Hand Gestures Set with 25 real sequences with camera shakes and background object motion, and 215 sequences with embedded dynamic backgrounds. We also evaluate localization using PSPM on an image dataset with different poses and backgrounds, and

2 show improvements over the standard Pictorial Structures [4] and other pose localization methods. In the rest of the paper, we first review the related work in Section 2. We then present the action representation and inference in Section 3. Next we describe pose localization from 3D priors using Pose-Specific Part Model (PSPM) in Section 4, followed by the results. 2. Related Work A natural approach to recognizing actions is to first estimate the body pose and then infer the action based on the pose dynamics [8, 5]. However effectiveness of such approaches depends on reliable human pose tracking methods. A popular approach is to avoid pose tracking and directly match image descriptors to the action models by learning action classifiers, using SVMs [13], or graphical models such as CRFs [16], LDA [15]; however it is difficult to capture temporal relationships in such models. Furthermore, these methods typically require large amount of training data from multiple viewpoints. Another approach is to simultaneously track the pose and recognize the action; we refer to these as Joint- Tracking-and-Recognition methods. These methods learn action models that capture the evolution of the actor pose in 3D, and during inference, use the action priors for tracking pose and the estimated pose to recognize actions. While these methods work well across viewpoints, most of them require 3D MoCAP data for learning accurate models [28, 22, 19, 17] and/or rely on person silhouettes for localization and matching [21, 24, 23, 31, 19, 6, 18] which assumes a static background. Recently, [18] proposed a multi-view approach without using MoCAP, by learning 3D action models from 2D keypose annotations and recognizing actions by matching poses sampled from the action models to actor silhouettes. However poses sampled from such coarse models result in large matching error which accumulate over time and significantly affect recognition, especially in cluttered scenes. In our work, we address this issue by using accurate pose localization using part models. An important aspect of the Joint-Tracking-And- Recognition methods is to reliably localize/match the pose to the video. Recently, part-based graphical models (pictorial structures [4]) have been shown to accurately localize 2D poses in complex backgrounds [1, 20] but they do not model inter-part occlusion. Localization of poses with inter-part occlusions require simultaneous modeling of body kinematics and inter-part occlusion which makes inference hard. Existing approaches model such constraints using common-factor models [12], multiple trees [29], or represent them in a kinematic graph (with cycles) and infer pose using non-parametric message passing [23] or branch-and-bound [9, 27]. However, these methods either use person silhouettes [23], requires training data from all viewpoints [12], or are too inefficient [9, 27] for tracking. Recently, [2] trained multiple view-specific models for estimating pose in walking action; however for multiple actions, a large number of models would need to be trained. 3. Action Recognition In this work, we develop on the Joint-Tracking-and- Recognition approach of combining Tracking-by-Priors and Recognition-by-Tracking. For each action, we obtain an approximate model of the human pose dynamics in a scale and pan-normalized 3D space; this allows scale and viewpointinvariant representation. This is done by scaling the poses to a fixed known height. For inference, we match image observations to the action models by tracking using a 3D human model in the action-restricted pose space, and find the action with highest matching score [31, 14, 18]. Here, we first present the action representation and model learning, followed by action and pose inference. The pose localization is described later in Section Representation and Learning We learn a separate model for each action that captures the dynamics of the human pose. Our models are based on the concept that a single actor human action can be represented as a sequence of linear transformations between a few, representative keyposes. Our action model is inspired by [18], which refers to the linear transformation between a keypose pair as a primitive. For example, the walking action can be represented with four primitives - left leg forward right leg crosses left leg right leg forward left leg crosses right leg. Note that each primitive is a conjunction of rotation of body parts, e.g. during walking, rotation of upper leg about the hip and rotation of lower leg about the knee, and thus can be represented as a linear transformation in joint-angle space. This is illustrated in figure 1. Scale Normalized Joint Angle Space Primitive Keypose Keypose Figure 1. Geometric Interpretation of the Action Action for Walking; dotted Red Curves denote different instances of walking action; piecewise linear curve (in gray) denotes the learnt action model with keyposes marked with circles (in black) To capture the variations in keyposes across different instances of the same action, we model each keypose by a set

3 of Gaussian distributions, one for every 3D joint position. For speed variations, we model the length of each primitive as a truncated sigmoid function. We normalize each primitive to unit length and learn a Gaussian over the fraction of primitive that gets covered at each time step. Thus, an action with N k keyposes is modeled by a set of N k (N j +1) Gaussians, where N j is the number of 3D joints (= 15). We learn action models by annotating 2D poses and the primitive action boundaries in the training videos. For each action model, we first manually select the set of keyposes for each action; intuitively, we select a keypose whenever there is a big change in pose dynamics; alternatively if 3D MoCAP is available, keyposes can be automatically obtained as discontinuities in the pose energy [14]. We then learn the 3D model for each keypose from 2D annotations by lifting (using our implementation of [30] for more details). For each primitive, we obtain the expected change in the duration by collecting primitive lengths from action boundary annotations and fitting a Gaussian Conditional Action Network Given the action models, we embed them into a Dynamic Conditional Random Field [26], which we refer to as Conditional Action Network illustrated in Figure 2. Figure 2. Conditional Action Network We define the state s t of CAN at time t by a tuple of action and pose variables s act t,s pose t ; action set s act t = a t,p t,f t include the action label a t, current primitive p t and the fraction of primitive elapsed f t and pose s pose t = x t include the current pose x t. To infer the action from an observation sequence of length T, we estimate the optimal state sequence over all actions by maximizing the log-linear likelihood which takes the following form, T n o s best [1:T ] =arg max w f φ f (s t,s t 1,I t ) s [1:T ] t=1 f=1 where, φ(s t,s t 1,I t ) are observation and transition potentials and w = {w i } is the weight vector, one for each potential function. [Transition Potentials] Action transition potential φ(a t,f t,a t 1,f t 1 ) is modeled as a truncated sigmoid function over the fraction of primitive elapsed f t, such that the probability of staying in the same primitive p t decreases as f t approaches 1 and the probability of transition to a new primitive increases. The pose transition potential φ(x t,x t 1 ) is modeled using a Normal distribution N (0,σ) of displacement in neck position and height h t. [Observation Potentials] We compute the observation likelihood of a pose sample x t, sampled from action-pose potential φ(a t,f t,x t ), by combining shape and motion likelihoods. We first localize the pose using a part based model which is generated from the spatial prior available from the action model and handles constraints due to occlusion. We then compute shape likelihood, as the normalized log likelihood of the parts used in the model. The details of this step are described in Sec 4. φ shape (x) = 1 φ i (x i,i t ) P i P where, P is the set of the parts in the pose model, x i is the i th part in pose x. The motion likelihood is computed by matching the observed optical flow with the direction of motion of each part, using the cosine distance. We used the Lucas-Kanade algorithm (in OpenCV 1.0) for computing optical flow and quantize the flow into 8 orientation bins. [Weight Learning] We assume uniform transition weights across different actions/primitives, and hence weight learning only involves learning 3 weight values, one for each potential. In this work, we use the Voted Perceptron algorithm [3] due to its efficiency and ease of implementation. The ground truth pose estimates for all frames were obtained using our inference with known action label for the sequence Tracking and Recognition Since our action models are continuous and our graphical model has cycles, exact inference is infeasible. Thus, we use a particle filtering approach [22, 25] by sampling poses from the action models and matching each pose to the scene observations. During tracking, we first find the person by applying a full-body and a head-shoulder pedestrian detector [7]; multiple detectors help reliable detection especially in complex scenes. We then uniformly sample poses from action models and localize the poses to fit the observations using the approximate position (neck) and scale (person standing height) available from the detection responses. The details of the localization method are described in detail in Section 3. For viewpoint invariance, poses are matched to the observations at various pan angles. To propagate each sample s t over time, we increment the f t (fraction of primitive elapsed) to obtain the next action state s act t+1; note that if f t is toward the end of a primitive, next state may transition to the next primitive or action. We then perturb the position and scale of the person, and obtain

4 the next pose by localizing the pose to the observations; note that localization step takes into account the spatial prior on the pose from the action model a t+1,p t+1,f t+1. During actions that are performed while standing at the same location such as sitting on ground, we imposed a constraint that the feet of the person remain on ground at roughly the same location (using a penalty function modeled as a zero-mean Gaussian). This constraint makes our tracker more robust to drifting. The best state sequence from the state distribution over all frames is then obtained using Viterbi algorithm. 4. Accurate Pose Localization from 3D Priors In this section, we present our approach to accurately localize a hypothesized pose (from the action model) to the image observations. Given prior information such as scale and position, localization involves searching through the pose space to infer the pose that best describes the image evidence. In our setting, where the pose is being tracked using approximate action models, prior on the pose includes coarse 2D position and scale information and the pose subspace which is likely to include the true pose. It is natural to assume that in cluttered environments, the 2D position and scale priors may be quite noisy. Furthermore, the pose subspace induced by the action model can be large especially for fast moving parts, for e.g. hands during waving. For efficient localization, we first project the 3D pose search space on the 2D image to obtain spatial prior on the 2D pose, then localize the 2D pose using image observations and then estimate the 3D pose from the aligned 2D pose. For 2D pose localization, we use a part-based graphical model approach (similar to pictorial structure [4, 20, 1]) which represents the human body by its constituent parts (see figure 3(a)) and impose pairwise constraints over the parts during inference. These pairwise constraints model the kinematic and/or the inter-part occlusion relationships between the parts; however when all such constraints are imposed, the graphical model has loops (see figure 3(b)). Even though attempts have been made to infer pose using models with loops but they either tend to be computationally expensive [9, 27]. Thus, for efficient and exact inference tree-structured models are preferred. We develop an approach to automatically select a tree structured model that is most likely to give an accurate localization for a given pose, by leveraging the fact that under occlusion, some kinematic constraints may be relaxed in order to model constraints that would be more effective for localization; we call this model Pose-Specific Part Model (PSPM). Next, we first present 2D pose localization using treestructured part model. We then describe the PSPM selection and learning, followed by the 3D pose localization using PSPM. x lla x lll x lua x lul x h x t x rul x rua x rll x rla (a) (b) Figure 3. Graphical Models for 2D pose (a) Kinematic Tree model [4] (b) Graph with edges to model kinematic and inter-part occlusion constraints; observation nodes are not shown for clarity 4.1. Localizing 2D Pose using Part Model In a 2D pose model, each part is represented as a node and the edges represent pairwise constraints between the parts. During inference, detectors for all parts are independently applied on the image, and then best pose x is obtained by maximizing the joint likelihood given by p(x, I Θ) = p(i x i, Θ s i ) p(x i x j, Θ p ij ) (1) i P ij E where x i denote the part i, (P, E) is the graphical model over the parts P ; p(i x i, Θ s i ) represent the likelihood of part hypothesis x i obtained by applying the part detector; p(x i x j, Θ p ij ) represent the pairwise constraints; Θ= (Θ s, Θ p ) are model priors for unary and pairwise potentials. A commonly used 2D pose model [4, 20, 1] assumes a treestructure, as efficient and exact inference can be performed [4] Part Detection Recently, [1] reported that better part detectors can significantly improve localization results; however, better part detectors are also computationally expensive. Thus, in this work, we experiment with 2 different types of detectors that can be applied efficiently and have been previously used for localizing 2D body parts - geometric templates [10] and boundary and region templates [20]. We briefly describe the part detectors here, [Geometric Templates] Each part is modeled with a simple geometric object - head with an ellipse, torso with an oriented rectangle and each arm with a pair of line segments. The log likelihood score of a part is obtained by accumulating the edge strength and orientation match on the boundary points. [Boundary and Region Templates] Each template is a weighted sum of the oriented bar filters where the weights are obtained by maximizing the conditional joint likelihood [20]. We use the detectors provided by the authors. x lla x h x lll x lua x lul x t x rua x rul x rll x rla

For parts x i and x j such that x i is occluding x j, we define the pairwise potential as p(x i x j, Θ ij )=N (l i l j ; μ ij,σ ij ) Λ(l i,l j ) where, l i denote the position and orientation of x i,

5 4.1.2 Pairwise Constraints x lla x h x rla The pairwise kinematic potential between parts is defined using a Gaussian distribution, similar to [4, 1]. To avoid the overlapping parts from occupying exactly the same place, we add additional repulsion constraint that reduces the likelihood of the occluded part to overlap with the occluder. For parts x i and x j such that x i is occluding x j, we define the pairwise potential as p(x i x j, Θ ij )=N (l i l j ; μ ij,σ ij ) Λ(l i,l j ) where, l i denote the position and orientation of x i, and Θ ij = (μ ij,σ ij ) is Gaussian prior over the relative part position and orientation, Λ(l i,l j ) is the repulsive prior between the overlapping parts [5] Pose-Specific Part Models for Localization Given spatial priors on a 3D pose, the Pose-Specific Part Model (PSPM) is a tree-structured graph, and is tuned to accurately localize the specified pose. Obtaining PSPM for a pose involves selecting the model (set of parts P and the structure E) and estimating the model prior Θ which is likely to maximize the joint likelihood. Accurate localization can be obtained by maximizing Eqn 1. [Part Selection] For accurate localization, we select the parts that are at least partially visible, since the part detectors do not work well for heavily occluded parts. To achieve this, we project the 3D pose to obtain the approximate position and orientation of each part. This information, together with relative depth ordering of parts, is used to estimate visibility of each part. The visibility v(p i ) is computed as the fraction of part p i that is unoccluded i.e. v(p i )=1 ovlp(p i, p j ) (2) j i where ovlp(p i,p j ) show the fraction of part p i occluded by p j. For model selection, we only consider the parts with visibility greater than 0.5. [Structure Selection] This step involves selecting a tree from all possible trees, that captures appropriate constraints for localizing the given pose. For localizing poses with partially or fully occluded parts, we can relax some kinematic constraints in the standard tree model 3(a), and add an approximate neighborhood cum non-overlap constraint such that the resulting model is still a tree. For example, consider the pose in figure 4(a). An alternate model to the standard kinematic model connects the left lower leg to the right lower leg, and results in a better pose estimate that using the standard kinematic tree. Since upper and lower parts of the body are rarely coupled (i.e. kinematically connected or occlude each other), we ignore the edges between an arm and a leg. Figure 3(b) shows the edges considered for structure selection. (a) x lll x lua x lul xt (b) Figure 4. Pose localization using Pose-specific Part Model; (a) Image of a person sitting down (b) Selected Pose-specific Part Model (occluded parts are marked with dotted lines) (c) Localized 2D parts obtained using the selected PSPM x rul A standard approach for structure selection is to find the tree-structure that maximizes the joint likelihood over labeled data [11]. This involves estimating the prior parameters (mean and variance) for all pairs of parts that are connected, and then finding a tree-structure which has the lowest score (sum of variance over all edges). Since the tree structure that maximizes the joint likelihood may be different for different poses, the standard learning approach would require labeled data for all poses in the action model, from various viewpoints; which is prohibitively large. In this work, we propose a measure for the model score based on the geometry of the pose. To come up with an appropriate measure, we annotated 2D and 3D poses for 200 images and estimated the tree model with highest localization score by performing an exhaustive search over all tree-structured models from the graph shown figure 3(b). Note that the number of all possible tree models is quite large. To reduce the search space, we consider only those trees which include the kinematic edges and those non-kinematic edges where the connected pair of parts overlap. From our experiments, we observed that for poses with unoccluded parts, the best tree had mostly kinematic edges in it. However non-kinematic edges were preferred when the parts occluded each other. Based on this observation, we propose a score, Localization effect of an edge L(e ij ), which captures the usefulness of that edge toward localizing the given pose. We define the localization effect of an edge as the product of the detection accuracy of part detectors and the degree of occlusion of the connected parts. We define L(e ij ) for an edge e ij as: { D(pi )D(p L(e ij )= j )min{v(p i ),v(p j )},e ij K D(p i )D(p j )max{ovlp(p i,p j ),ovlp(p j,p i )},e ij / K where, K is the set of kinematic edges; D(p i ) is the detection accuracy of detector for part p i ; the min/max term captures the degree of occlusion. The tree selection for accurate localization can be formulated as a search over the set of edges that maximizes the total localization effect. Since the localization effect of x rua x rll (c)

6 an edge is independent of others, the optimal tree structure E can be estimated as: E =max L(e ij ) s.t. E is a tree (3) E G ij E where G is the graph with all pairwise constraints. Note that Equation 3 can be solved efficiently by finding the maximum spanning tree in the graph G, with L(e ij ) as the weight of e ij. [Estimating Model Prior Θ] We define the pairwise potential using a Gaussian (in Section 4.1.2). Previous methods work with uninformed prior and hence, learn the parameters of the Gaussian from a labeled data [4]. But in our case, where prior knowledge of pose is available, learning pose-specific parameters would be more meaningful. However, learning pose-specific parameters would require a prohibitively large number of pose samples (for all poses from various viewpoints). We estimate these parameters using the prior on the 3D pose. The model parameters, mean and variance at each joint, are estimated by projecting the 3D pose prior, modeled as Gaussian distributions, to 2D. For e.g, the mean relative position μ ij of part i w.r.t. part j is simply the difference of the mid-point of the end-joints of part p i and that of part p j Localizing Pose from 3D Action Priors The action prior include the 3D prior on the pose represented with Gaussian distributions (one for each joint) and approximate position and scale of the person available from the tracker. Given this prior, we obtain accurate 2D localization of that pose using PSPM (as described earlier). Note that during inference, we only apply the part detector in the neighborhood of the projected 2D position, orientation and scale for each part. After localizing the pose in 2D, we then estimate the 3D pose from the 2D joints positions. While estimating 3D pose from 2D joints is ambiguous; in our case the spatial priors on pose available from action model and the tracking information help remove such ambiguities. For accurate 3D pose estimation from 2D pose with known depth ordering of parts, one can estimate the 3D joints using non-linear least squares to fit the 2D estimates while constraining the joints to stay within the pose search space (similar to [30]); in this work, we simply update each joint position, starting from the neck, assuming the 3D length of the parts does not change. An initial estimate of 3D part lengths are obtained by scaling a canonical 3D model in standing pose, such that height of the model is same as the observed height of the actor (available from tracking). 5. Experiments We first demonstrate our pose localization approach using PSPMs on an image dataset with pose annotations. We then evaluate our action recognition algorithm that uses PSPMs for localization on 2 publicly available datasets: Full body Gestures [17] and Hand Gestures [18]. Compared to KTH [13], Weizmann, HumanEva[23] and Hand Gestures [18] datasets which have a clean background and/or few viewpoint variations, Full body Gestures set includes videos with cluttered dynamic backgrounds, captured at various viewpoints. We also report results on hand gestures in dynamic scenes Pose Localization We selected frames from existing action recognition datasets [17, 18] and created a collection of 195 images with variety of poses. For each image, we annotated the 3D pose of the actor by marking the 2D joint positions and their relative depths, followed by lifting to 3D (similar to keypose annotations). To quantitatively evaluate pose localization, we computed the average localization score over the visible parts: a part is considered to be correctly localized if it overlaps more than 50% with the ground truth part. Recall that the pose prior include approximate 2D scale and position information from the tracker, and the approximate 3D pose (represented as a set of Gaussian distributions over the 3D joint positions). To simulate the noisy prior obtained from the action models, we set the variance of each 3D joint to be 5% of the part length. This prior was then used as an input to various localization methods. We first apply our implementation of Pictorial Structure (PS) [4], which is a tree-structure model with kinematic edges and uses an uninformed prior. Using Boundary Templates (BT), PS gives a localization accuracy of 44.53%. Then we modify PS by applying part detectors only in the search region provided by the prior and enforce kinematics using parameters estimated from the prior; we refer to this as CPS (Constrained Pictorial Structures). Applying CPS using Boundary Templates, gave a localization accuracy of 63.74%, which when compared to PS, clearly shows the importance of incorporating pose prior. We then apply the Pose-Specific Part Model [20] and achieved a much higher localization accuracy 71%, which demonstrates the advantage of modeling occlusion based constraints. We also compare with [17], which uses Hausdorff distance between the pose boundary and canny edges as a shape likelihood measure to localize the pose. This approach achieved a lower accuracy of 62.71%. We test the robustness of our approach to uncertainty in position and scale of the pose (which is likely to occur during tracking). Figure 5 shows the accuracy plots for various localization methods against the degree of uncertainty. Notice that localization using PSPM and CPS with Boundary Templates is quite robust to position uncertainty compared to Hausdorff method. Using CPS with Geometric Templates and Boundary Templates gave comparable accuracy

7 scores at low uncertainty, but deteriorates as the uncertainty increases; this indicates that Boundary Templates are more robust to noise. Also notice in Figure 5.b, that the PSPM using Boundary Templates tolerates small errors in the height estimate ( 10%). However PSPM based localization is about times slower than using Hausdorff distance. Localization Accuracy Uncertainty in position Hausdorff PS-BT CPS-BT CPS-BT-IIP CPS-GT PSPM-BT Localization Accuracy Hausdorff PSPM-BT Uncertainty in Height Figure 5. Plots showing Localization accuracy of different approaches (a) with uncertainty in position (shown in ratio of position error to person height) (b) with uncertainty in height estimate (scale); 5.2. Action Recognition From pose localization experiments, we observe that Hausdorff distance based method localizes well when predicted pose is not far from the true pose. Thus for efficiency, we apply PSPMs every 5 th frame and use Hausdorff distance based method for intermediate frames. In addition, for efficient localization using PSPM, we scale down the image so the actor is 100 pixels high. Our entire system runs at 1 frame per second on a 3GHz Xeon CPU running Windows/C++ programs. We now present our results on three datasets. Hand Gestures Dataset [18]: This dataset has 5-6 instances of 12 actions from 8 different actors in an indoor lab setting; a total of 495 action sequences across all actions. Even though the background is not cluttered, recognition task is still challenging due to the large number of actions with small pose difference. For evaluation, we train the models on a subset of actors and test on the rest. We compare our approach to [18], that uses a similar joint tracking and recognition approach but uses discrete action duration models and foreground based features for localization and matching. [18] reports recognition rate of 78% and 90% with 1:8and 3:5train:test respectively. Our algorithm achieves 92% recognition accuracy at 1:8train:test. Ifwe replace the PSPM based localization with Hausdorff distance based method, recognition rate drops to 84%. This illustrates that even in clean backgrounds, use of PSPMs improves action recognition. Augmented Hand Gestures Dataset: To demonstrate robustness to cluttered dynamic backgrounds, we generated a dataset by embedding 45 action instances from the original dataset [18] into videos with complex dynamic backgrounds (see figure 6(f-k) for sample images). The dataset Dataset Method Train:Test Recognition % Natarajan et al [18] 1:8 78% Natarajan et al [18] 3: % Hand Gestures CAN (Hausdorff) 1:8 84.2% CAN (PSPM) 1:8 92% SFD-CRF [17] MoCAP 77.45% USC Gestures CAN (PSPM) 1:6 89.5% Table 1. Evaluation Results on the Hand Gestures Dataset. has 215 videos including 3 different actors performing hand gestures in 5 different scenes. Our algorithm achieves 91% recognition accuracy. Note that the recognition accuracy on the original 45 videos from [18], that were used for embedding, was about 95%. To process these videos, we used the parameters trained on the original hand gestures dataset [18]. In addition, we also collected 25 videos including 4 hand gestures but performed in dynamic scenes, with camera shakes and/or objects moving in the background. Our algorithm, trained on the original dataset, correctly recognized 20 action instances ( 80% accuracy). USC Gestures Dataset [17]: This dataset has videos of 6 full body actions, captured at various pan and tilt angles; actions include - sit-on-ground, standup-from-ground, siton-chair, stand-from-chair, pickup-from-ground and pointforward. We evaluated our approach on a part of this dataset that was captured at 0 tilt in 6 varying backgrounds including cluttered indoor scenes and outdoors in front of moving vehicles; rest of the dataset was captured at other tilt angles in a relatively clean, static background. The selected set include actions captured at 5 different camera pan angles w.r.t. to the actor - 0, 45, 90, 270, 315, for a total 240 action instances, each performed either by a different actor, at a different pan or in a different background. For our experiments, we trained our models using 2 actions instances from one actor and evaluated on the rest. Note that models were trained only on 2 viewpoints, and tested on 5 different viewpoints. On segmented action instances, our approach achieved an accuracy of 75.91%. Figure 6(n-s) show sample results. [17] reports an accuracy of 77.35%; however they assume that the sit-on-chair and sit-on-ground actions are followed by stand-from-chair and stand-fromground respectively. When we incorporate this information, our action recognition accuracy improves to 89.5% which shows a 12% improvement over [17]. 6. Conclusion We have presented an approach for joint pose tracking and action recognition in cluttered dynamic environments, which has low training requirements and doesn t require 3D MoCAP data. We achieve this by proposing an accurate and efficient pose localization approach using Pose-Specific Part Models (PSPMs). We have demonstrated that the our localization approach is robust to noise and works well in

8 (b) (a) (g) (h) (n) (c) (j) (i) (o) (d) (p) (e) (f) (k) (q) (l) (r) (m) (s) Figure 6. Results obtained on Gesture datasets (a-e) Hand Gestures [18], (f-m) Augmented Hand Gestures, (n-s) USC Gesures [17]. The estimated pose is overlaid on each image (in red), and the corresponding part distribution obtained by applying PSPM is shown next to it cluttered environments. Further, we have also demonstrated our approach for action recognition on hand gestures as well as on USC Gestures dataset with full body gestures in cluttered and dynamic environments. Acknowledgements. This research was supported, in part, by the Ofﬁce of Naval Research under grants #N and #N References [1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, pages , , 2, 4, 5 [2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In CVPR, pages , [3] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, [4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55 79, , 2, 4, 5, 6 [5] V. Ferrari, M. J. Marı n-jime nez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, , 5 [6] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. Huang. Action detection in complex scenes with spatial and temporal ambiguities. In ICCV, pages , [7] C. Huang and R. Nevatia. High performance object detection by collaborative learning of joint ranking of granules features. In CVPR, pages 41 48, [8] N. Ikizler and D. A. Forsyth. Searching video for complex activities with ﬁnite state models. In CVPR, [9] H. Jiang and D. Martin. Global pose estimation using non-tree models. In CVPR, , 2, 4 [10] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated image motion. In FG, [11] D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, [12] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor models for 2d human pose recovery. ICCV, [13] I. Laptev. On space-time interest points. International Journal on Computer Vision (IJCV), 64(2-3): , , 6 [14] F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In CVPR, , 3 [15] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, [16] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR, [17] P. Natarajan and R. Nevatia. View and scale invariant action recognition using multiview shape-ﬂow models. In CVPR, , 2, 6, 7, 8 [18] P. Natarajan, V. K. Singh, and R. Nevatia. Learning 3d action models from a few 2d videos for view invariant action recognition. In CVPR, , 2, 6, 7, 8 [19] H. Ning, W. Xu, Y. Gong, and T. S. Huang. Latent pose estimator for continuous action recognition. In ECCV (2), [20] D. Ramanan. Learning to parse images of articulated bodies. In NIPS, pages , , 2, 4, 6 [21] R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. In CVPR, [22] L. Sigal, A. O. Balan, and M. J. Black. Combined discriminative and generative articulated pose and non-rigid shape estimation. In NIPS, , 3 [23] L. Sigal and M. J. Black. Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In CVPR, pages , , 6 [24] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Conditional random ﬁelds for contextual human motion recognition. In ICCV, pages , [25] J. Sullivan and S. Carlsson. Recognizing and tracking human action. In ECCV (1), pages , [26] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random ﬁelds: factorized probabilistic models for labeling and segmenting sequence data. In ICML, page 99, [27] T.-P. Tian and S. Sclaroff. Fast globally optimal 2d human detection with loopy graph models. In CVPR, , 2, 4 [28] R. Urtasun, D. J. Fleet, and P. Fua. 3d people tracking with gaussian process dynamical models. In CVPR, [29] Y. Wang and G. Mori. Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV, [30] X. K. Wei and J. Chai. Modeling 3d human poses from uncalibrated monocular images. In ICCV, pages , , 6 [31] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars. In ICCV,

Simultaneous tracking and action recognition for single actor human actions

Vis Comput (2011) 27:1115 1123 DOI 10.1007/s00371-011-0656-x ORIGINAL ARTICLE Simultaneous tracking and action recognition for single actor human actions Vivek Kumar Singh Ram Nevatia Published online: