3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis

Size: px

Start display at page:

Download "3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis"

Elfreda Joseph
5 years ago
Views:

1 3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis Kooksang Moon and Vladimir Pavlović Department of Computer Science, Rutgers University Piscataway, NJ 885 {ksmoon, Abstract We propose a generative statistical approach to human motion modeling and tracking that utilizes probabilistic latent semantic (PLSA) models to describe the mapping of image features to 3D human pose estimates. PLSA has been successfully used to model the co-occurrence of dyadic data on problems such as image annotation where image features are mapped to word categories via latent variable semantics. We apply the PLSA approach to motion tracking by extending it to a sequential setting where the latent variables describe intrinsic motion semantics linking human figure appearance to 3D pose estimates. This dynamic PLSA (DPLSA) approach is in contrast to many current methods that directly learn the often high-dimensional image-to-pose mappings and utilize subspace projections as a constraint on the pose space alone. As a consequence, such mappings may often exhibit increased computational complexity and insufficient generalization performance. We demonstrate the utility of the proposed model on the synthetic dataset and the task of 3D human motion tracking in monocular image sequences with arbitrary camera views. Our experiments show that the proposed approach can produce accurate pose estimates at a fraction of the computational cost of alternative subspace tracking methods.. Introduction Estimating 3D body pose from D monocular image is a fundamental problem for many applications ranging from surveillance to advanced human-machine interfaces. However, the shape variation of D images caused by changes in pose, camera setting, and view points makes this estimation a challenging problem. Computational approaches to pose estimation in these settings are often characterized by complex algorithms and a tradeoff between the estimation accuracy and computational efficiency. In this paper we propose the low-dimensional embedding method for 3D pose estimation that exhibits both high accuracy, tractable estimation, and invariance to viewing direction. 3D human pose estimation from monocular D images can be formulated as the task of matching an image of the tracked subject to the most likely 3D pose. To learn such a mapping one needs to deal with a dyadic set of high dimensional objects the poses, y and the image features, z. Because of the high dimensionality of the two spaces learning a direct mapping z y often results in complex models with poor generalization properties. One way to solve this problem is to map the two high dimensional vectors to a lower dimensional subspace x: x z and x y [3, ]. However, in these approaches, the correlation between the pose and the image feature is weakened by learning the two mappings independently and the temporal relationship is ignored during the embedding procedure. Our approach to pose estimation is inspired by probabilistic latent semantic analysis (PLSA) that is often used in domains such as the computational language modeling to correlate complex processes. In this paper we extended PLSA to account for the dynamic nature of sequential data. We choose the Gaussian Process Latent Variable model (GPLVM) to form a pair of mapping functions between the latent variable and the two objects. A GPLVM framework is particularly suited for this model because the dynamic nature of sequence can be directly integrated into the embedding procedure in a probabilistic manner [5,, 3]. The two generative nonlinear embedding models and the marginal dynamics results in a new hybrid model called the dynamic PLSA (DPLSA). DPLSA models the semantical relationship between a sequence of 3D poses and a sequence of image features via the shared latent dynamic space. This paper is organized as follows. We first define the DPLSA model that utilizes the marginal dynamic prior to learn the latent space of sequential data. We then propose the new framework for human motion modeling based on the DPLSA model and suggest learning and inference methods in this specific modeling context. The framework can

2 be directly extended for multiple view points by using the mixture model in the space of the latent variables and the image features. The utility of the the new framework is examined thorough a set of experiments of tracking 3D human figure motion from synthetic and real image sequences.. Related Work Dyadic data refers to a domain with two sets of objects in which data is measured on pairs of units. One of the popular approaches for learning from this kind of data is the latent semantic analysis (LSA) that was devised for document indexing. Deerwester et al. [] considered the term-document association data and used singular-value decomposition to decompose document matrix into a set of orthogonal matrices. LSA has been applied to a wide range of problems such as information retrieval and natural language processing [, ]. Probabilistic Latent Semantic Analysis (PLSA) [5] is a generalization of LSA to probabilistic settings. The main purpose of LSA and PLSA is to reveal semantical relations between the data entities by mapping the high dimensional data such text documents to a lower dimensional representation called latent semantic space. Some exemplary application areas of PLSA in computer vision include image annotation [] and image category recognition [, 8]. Latent space approach for high-dimensional data has been applied to human motion tracking problems in the past. Various dimensionality reduction techniques such as Principal Components Analysis (PCA), isometric feature mapping (Isomap), Local linear (LLE) and spectral embedding have been successfully used in human tracking [6, 3, 9]. Recently, the GPLVM that produces a continuous mapping between the latent space and the high dimensional data in a probabilistic manner [8] was used for human motion tracking. Tian et al. [] use a GPLVM to estimate the D upper body pose from the D silhouette features. Urtasun et al. [] exploit the SGPLVM for 3D people tracking. Wang et al. [5] introduced Gaussian Process Dynamic Model (GPDM) that utilizes the dynamic priors for embedding and the GPDM is effectively used for 3D human motion tracking [3]. In [] marginal AR prior for GPLVM embedding is proposed and utilized for 3D human pose estimation from the synthetic and real image sequences. Lawrence and Moore [9] propose the extension of GPLVM using a hierarchical model in which the conditional independency between human body parts is exploited with low dimensional non-linear manifolds. However, these approaches utilize only the pose in latent space estimation and as a consequence, the optimized latent space cannot guarantee the proper dependency between the poses and the image observations in regression setting. Shon et al. [5] propose a shared latent structure model that utilizes the latent space that links corresponding pairs of observations from the multiple different spaces and apply it for image synthesis and robotic imitation of human actions. Although their model also utilizes GPLVM as the embedding model, their applications are limited to nonsequential cases and the linkage between two observations is explicit(e.g.image-image or pose-pose). The shared latent structure model using GPLVM is utilized for pose estimation in [3]. This work focuses on the semi-supervised regression learning and makes use of unlabeled data (only pose or image) to regularize the regression model. In contrast, our work, using a statistical foundation of PLSA, focuses on the computational advantages of the shared latent space. In addition, it explicitly considers the latent dynamics and the multi-view setting ignored in [3]. 3. Dynamic PLSA with GPLVM The starting point of our framework design is the symmetric parameterization of Probabilistic Latent Semantic Analysis [5]. In this setting the co-occurrence data y Y and z Z are associated via an unobserved latent variable x X: P (y, z) = x X P (x)p (y x)p (z x). () With a conditional independence assumption, the joint probability over data can be easily computed by marginalizing over the latent variable. We extend the idea to the case in which the two sets of objects, Y and Z are sequences and the latent variable x t is only associated with the dyadic pair (y t, z t ) at time t. And we solve the dual problem by marginalizing the parameters in the conditional probability models instead of marginalization of Z. Consider the sequence of length T of M-dimensional vectors, Y = [y y...y T ], where y t is a human pose (e.g.joint angles) at time t. The corresponding sequence Z = [z z...z T ] represent the sequence of N-dimensional image features observed for the given poses. The key idea of our Dynamic Probabilistic Latent Semantic Analysis (DPLSA) model is that the correlation between the pose Y and the image feature Z can be modeled using a latent-variable model where two mappings between the latent variable X and Y and between X and Z are defined using a Gaussian Process latent variable model of [8]. In other words, X can be regarded as the intrinsic subspace that Y and Z jointly share. The graphical representation of DPLSA for human motion modeling is depicted in Fig.. We assume that sequence X R D T of length T is generated by possibly nonlinear dynamics modeled as a known

3 y x y x y 3... x 3 z z z y T x T z T Figure. Graphical model of DPLSA. mapping φ parameterized by parameter γ x [5, 3] such as x t = A φ t (x t γ x,t )+A φ t (x t γ x,t )+...+w t. () Then the st order nonlinear dynamics is characterized by the kernel matrix K xx = φ(x γ x )φ(x γ x ) T + α I. (3) The model can further be generalized to higher order dynamics. The mapping from X to Y is a generative model defined using a GPLVM [8]. We assume that the relationship between the latent variable and the pose is nonlinear with additive noise, v t a zero-mean Gaussian noise with covariance βy I: y t = Cf(x t γ y ) + v t. () C represents a linear mapping matrix and f( ) is a nonlinear mapping function with a hyperparameter γ y. By choosing the simple prior of a unit covariance, zero mean Gaussian distribution on the element c ij in C and x t, marginalization of C results in a mapping: { P (Y X, β y ) K yx M/ exp } tr{k yx Y Y T } where (5) K yx (X, X) = f(x γ y )f(x γ y ) T + β y I. (6) Similarly, the mapping from the latent variable X into the image feature Z can be defined by z t = Dg(x t γ z ) + u t. (7) The marginal distribution of this mapping becomes { P (Z X, β z ) K zx N/ exp } tr{k zx ZZ T } (8) Notice that the kernel functions and parameters are different in the two mappings from the common latent variable sequence X to Y and to Z. The joint distribution of all co-occurrence data and all intrinsic sequence in a Y Z X space is finally modeled as P (X, Y, Z θ x, θ y, θ z ) = P (X θ x )P (Y X, θ y )P (Z X, θ z ) () where θ y {β y, γ y } and θ z {β z, γ z } represent the sets of hyperparameters in the two mapping functions from X to Y and from X to Z. θ x represents a set of hyperparameters in the dynamic model (e.g.α for a linear model and α, γ x for a nonlinear model). 3.. Human Motion Modeling Using Dynamic PLSA In human motion modeling, one s goal is to recover two important aspects of human motion from image feature: () 3D posture of the human figure in each image and () an intrinsic representation of the motion. Given a sequence of image features Z, the joint conditional model of the pose sequence Y and the corresponding embedded sequence X can be expressed as P (X, Y Z, θ x, θ y, θ z ) P (X θ x )P (Y X, θ y )P (Z X, θ z ). () Notice that the two mapping processes P (Y X) and P (Z X) have different noise models which can account for different factors (e.g.motion capture noise for the pose and camera noise for the image) that influence one but not the other process. 3.. Learning Human motion model is parameterized using a set of hyperparameters θ x, θ y and θ z, and the choice of kernel functions, K yx and K zx. Given both the sequence of poses and the corresponding image features, the learning task is to infer the subspace sequence X in the marginal dynamics space and the hyperparameters. Using the Bayes rule and () the joint likelihood is in the form P (X, Y, Z, θ x, θ y, θ z ) = P (X θ x )P (Y X, θ y ) P (Z X, θ z )P (θ x )P (θ y )P (θ z ). () where K zx (X, X) = g(x γ z )g(x γ z ) T + β z I. (9) To mitigate the overfitting problem, we utilize priors over the hyperparameters [5, 3, 5] such as P (θ x ) α (or α γx ), P (θ y ) βy γy and P (θ z ) βz γy. 3

4 The task of estimating the mode X and the hyperparameters, {θ x, θ y, θ z} can then be formulated as the ML/MAP estimation problem {X, θ x, θ y, θ z} = arg max X,θ x,θ y,θ z {log P (X θ x ) + log P (Y X, θ y ) + log P (Z X, θ z )} (3) which can be achieved using generalized gradient optimization such as CG, SCG or BFG. The task s nonconvex objective can give rise to point-based estimates of the posterior P (X Y, Z) that can be obtained by starting the optimization process from different initial points Inference and Tracking Having learned the DPLSA on training data Y and Z, the motion model can be used effectively in inference and tracking. Because we have two conditionally independent GPs, estimating current pose (distribution) y t and estimating current point x t in the embedded space can be decoupled. Given image features z t in frame t the optimal point estimate x t is the result of the following nonlinear optimization x t = arg max x t P (x t x t, θ x )P (z t x t, θ z ). () Due to GP nature of dependencies, the second term assumes conditional Gaussian form, however its dependency on x t is nonlinear [8] even with linear motion models in x. As a result, the tracking posterior P (x t z t, z t,...) may become highly multimodal. We utilize a particle-based tracker for our final pose estimation during tracking. However, because the search space is the low dimensional embedding space, only a small number of particles (<, empirical result) is sufficient for tracking allowing us to effectively avoid the computational problems associated with sampling in high dimensional spaces. A sketch of this procedure using particle filter based on the sequential importance sampling algorithm with N P particles and weights (w (i), i =,..., N P ) is shown below. Input : Image z t, Human motion model e.g.() and prior point estimates (w (i) t, x(i) t, y(i) t ) Z..t, i =,..., N P. Output: Current intrinsic state estimates (w (i) t, x (i) t ) Z..t, i =,..., N P ) Draw the initial estimates x (i) t p(x t x (i) t, θ x). ) Find optimal estimates x (i) t using nonlinear optimization in (). 3) Find point weights w (i) t P (x (i) t x (i) t, θ x)p (z (i) t x (i) t, θ z ). Algorithm : Particle filter in human motion tracking. Finally, because the mapping from X to Y is a GP function, we can easily compute the distribution of poses y t for each particle x (i) t by using the well known result from GP theory: P (y t x (i) t ) N (µ (i), σ (i) I). µ (i) = µ Y + Y T K yx (X, X) K yx (X, x (i) t ) (5) σ (i) = K yx (x (i) t, x (i) t ) K yx (X, x (i) t ) T K yx (X, X) K yx (X, x (i) t ) (6) where µ Y is the mean of training set. The distribution of poses at time t is thus approximated by a Gaussian mixture model. The mode of this distribution can be selected as the final pose estimate.. Mixture Models for Unknown View The image feature for the specific pose can vary according to a camera view point and orientation of the person with respect to the imaging plane. In a dynamic PLSA framework, the view point factor R can be easily combined into the generative model P (Z X) that represents the image formation process. P (X, Y, Z, R θ x ) = P (X θ x )P (Y X)P (Z X, R)P (R). (7) While the continuous representation of R is possible, learning such a representation from a finite set of view samples may be infeasible in practice. As an alternative, we use a quantized set of view points and suggest a mixture model, P (Z X, β z, γ r ) = S P (Z X, R = r, βz, r γz)p r (R = r) r= (8) where S denotes the number of views. Note that all the kernel parameters (β r z, γ r z) can be potentially different for different r... Learning Collecting enough training data for a large set of view points can be a tedious task. Instead, by using realistic synthetic data generated by 3D rendering software which allows us to simulate the realistic humanoid model and render textured images from a desired point of view, one can build a large training set for multi-view human motion model. In this setting one can simultaneously use all views to jointly estimate a complete set of DPLSA parameters as well as the latent space X. Given the pairs of the pose and the corresponding image feature with view point, learning the com-

5 plete mixture models reduces to joint optimization of P (X, Y, Z, Z,..., Z S, R,..., R S ) = P (X)P (Y X) P (Z s X, R s = s)p (R s = s). (9) s where S is the number of quantized views. The optimization of X and model parameters is a straightforward generalization of the method described in Sec Inference and Tracking The presence of an unknown viewing direction during tracking necessitates its estimation in addition to that of the latent state x t. This joint estimation of x t and R can be accomplished by directly extending the particle tracking method of Sec This approach is reminiscent of [3] in that it maintains the multiple view-based manifolds representing the various mappings caused from different view points. 5. Experiments 5.. Synthetic Data In our first experiment we demonstrate the advantage of our DPLSA framework on the subspace selection problem. We also compare the predictive ability of DPLSA when estimating the sequence Y from the observation Z. We generate a set of synthetic sequences using the following model: intrinsic motion X is generated with periodic functions in the R T space. The sequences are then mapped to a higher dimensional space of Y (in R T 7 ) through a mapping which is a linear combination of nonlinear features x, x, x, x. Z (in R T 3 ) is finally generated by mapping Y into a non-linear lower observation space in a similar manner. Examples of the three sequences are depicted in Fig.. This model is reminiscent of the Figure. A example of synthetic sequences. Left: X in the intrinsic subspace. Middle: Y generated from X Right: Z generated from Y. See text for detail. generative process that may reasonably model the mapping from intrinsic human motion to image appearance/features. We apply three different motion modeling approaches to model the relationship between the intrinsic motion X, the 3D pose space Y and the image feature space Z. The first method (Model ) is the manifold mapping of [3] which learns the embedding space using LLE or Isomap based on the observation Z and optimizes the mapping between X and Y using Generalized RBF interpolation. The second approach (Model ) is the human motion modeling using Marginal Nonlinear Dynamic System (MNDS) [], a model that attempts to closely approximate the data generation process. Model 3 is our proposed DPLSA approach described in Sec. 3.. During the learning process, initial embedding is estimated using probabilistic PCA. Initial kernel hyperparameter values were typically set to, except for the dynamic models were the variances were initially assigned values of.. We evaluate predictive accuracy of models in inferring Y from Z. We generate 5 sequences using the procedure described above. We generate the testing Z by adding white noise to the training sequence and infer Y from this Z. Table shows individual mean square error (MSE) rates of predicting all 7 dimensions in Y. All values are normalized with respect to the total variance in the true Y. The results demonstrate that the DPLSA model outperforms both the LLE-based model as well as the MNDS. We attribute the somewhat surprising result when compared to Model to the sensitivity of this model to estimates of the initial parameters of Z Y mapping. This problem can be mitigated by careful manual selection of the initial parameters, a typically burdensome task. However, another crucial advantage of our DPLSA model over Model is the computational cost in inferring Y. For instance, the mean number of iterations of scaled CG optimization is 7.9 for Model 3 and 3.8 for Model. This advantage will be further exemplified in the next set of experiment. 5.. Synthetic Human Motion Data 5... Single view point. In a controlled study we conducted experiments using a database of motion capture data for a 59 d.o.f body model from the CMU Graphics Lab Motion Capture Database [6] and synthetic video sequences. We used 5 walking sequences from 3 different subjects and running sequences from different subjects. The models were trained and tested on different subjects to emphasize the robustness of the approach to changes in the motion style. Initial model values, prior to learning updates, were set in the manner described in Sec. 5.. We exclude 6 joint angles that exhibit very small variances but very noisy (e.g.clavicle and finger). The human figure images are rendered on Maya using the software generously provided by the authors of [, ]. Following this, we extract the silhouette images to compute the -dimensional Alt moment 5

Table. MSE rates of predicting Y from Z. Model ē y ē y ē y3 ē y ē y5 ē y6 ē y7 ēyi LLE+GRBF.6..8.6.7.6..8 MNDS..6.6.6.3...7 DPLSA.3.6.3.3..8..3 image features as in [].

6 Table. MSE rates of predicting Y from Z. Model ē y ē y ē y3 ē y ē y5 ē y6 ē y7 ēyi LLE+GRBF MNDS DPLSA image features as in []. Also 3D latent space is employed for all motion tracking experiments. Fig. 3 depicts the log precisions of P (Y X) and P (Z X) on the D projection of latent space learned from cycle of walking sequence. Note that the precisions around the embedded X are different in the two spaces even though the X is commonly shared by both GPLVM models. To evaluate our DPLSA average, requires /5 of the iterations to achieve the same level of accuracy as the competing model. This can be explained by the presence of complex direct interaction between high dimensional features and pose states in the competing model. In DPLSA such interactions summarized via the low dimensional subspace. As a result, DPLSA representation can lead to potentially more suitable algorithm for real-time tracking MNDS Tracking DPLSA Tracking 8 MNDS Tracking DPLSA Tracking Mean Error.5 SCG Iterations Figure 3. Latent spaces with the grayscale map of log precisions. Left: P (Y X). Right: P (Z X). model in human motion tracking, we again compare it to the MNDS model in [] that utilizes the direct mapping between the poses and the image features. The models are learned from one specific motion sequence and tested on different sequences. Fig. shows the mean error in the 3D joint position estimation and the number of iterations in SCG per frame during the tracking. We use the error metric similar to the one in [7]. The error between estimated pose Ŷ and the ground truth pose Y from motion capture is E(Y, Ŷ ) = J j= ŷ j y j /J where y j is the 3D location of specific joint and J is the number of joints considered in the error matric. We choose 9 joints which have a wide motion range (e.g.throx and right & left wrist, humerus, femur and tibia). The height of human figure in this virtual space is 8 and the error unit can be relatively computed (e.g.when the height of man is 75cm, the error unit is 75/8 6.5cm). When the model was learned from walking sequence and tested on other sequences, the average error was.9 for MNDS tracking and.85 for DPLSA tracking. Results of full pose estimation for one running sequence are depicted in Fig. 5. The models achieve similarly a good accuracy in pose estimation. The distinct difference between the two models, however, is exhibited in the computational complexity of the learning and inference stages as shown in Fig.. The DPLSA model, on Frame Frame Figure. Tracking performance comparison. Left: pose estimation accuracy. Right: mean number of iterations of SCG. Figure 5. Input silhouettes and 3D reconstructions from a known viewpoint of π. First row: true poses. Second rows: silhouette images. Third row: estimated poses Multiple view points. We used one person sequence to learn the mixture models of the 8 different views (view angles = π i, i =,,..., 8 in clockwise direction, for frontal view) and human motion model. We made testing sequences by picking different motion capture sequences and rendering the images from 8 view points. Fig. 6 shows the one example of 3D tracking results from different view points. In the experiment the view point of input images are unknown and inferred frame by frame during 6

7 tracking with pose estimation. Although the pose estimation for some ambiguous silhouette is erroneous, our system can track the pose until the end of sequence with the proper view point estimation and 3D reconstructions are matched well to the true poses. Notice that the last two rows depict poses viewed from 3π/, i.e.the subject walking in the direction of the top left corner Real Video Sequence We applied our method to tracking of real monocular image sequences with fixed view point. We used the sideview sequences from CMU Mobo database [7]. DPLSA model was trained on walking sequences from the Mocap data and tested on the motion sequences from the Mobo set. Fig. 7 shows two example sequences of our tracking result. The lengths of the testing sequence are 3 and 3 frames. Although the frame rates and the walking style are different and there exists noise in the silhouette images, the reconstructed pose sequence depicts a plausible walking motion that agrees with the observed images. 6. Conclusions We have reformulated the shared latent space approach for learning models from sequential dyadic data. The reformulated generative statistical model is called the Dynamic Probabilistic Latent Semantic Analysis (DPLSA), which extends the successful PLSA formalism to a continuous state estimation problem of mapping sequences of human figure appearances in images to estimates of the 3D figure pose. Our preliminary results indicate that the DPLSA formalism can result in highly accurate trackers that exhibit fractional computational cost of the traditional subspace tracking methods. Moreover, the method is easily amenable to extensions to unknown or multi-view camera tracking tasks. The proposed model has the potential to handle complex classes of tracking problems, such as the challenging rapidly changing motions, when coupled with multiple model frameworks such as the switching dynamic models. Our future work will address these new directions as well as focus on continuing extensive evaluation of DPLSA on additional motion datasets. References [] J. Bellegarda. Exploiting both local and global constraints for multi-spanstatistical languagemodeling. In ICASSP, volume, pages , 998. [] S. C. Deerwester, S. T. Dumais, T. K. Landauer, and G. W. F. andr. A. Harshman. Indexing by latent semantic analysis. JASIST, (6):39 7, 99. [3] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity manifold learning. In CVPR, volume, pages ,. [] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google s image search. In ICCV, pages 86 83, 5. [5] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in A.I,, Stockhol, 999. [6] [7] [8] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensionaldata. In NIPS.. [9] N. D. Lawrence and A. J. Moore. Hierarchical gaussian process latent variable models. In ICML, pages ACM, 7. [] C.-S. Lee and A. Elgammal. Simultaneous inference of view and body pose using torus manifolds. In ICPR, 6. [] F. Monay and D. Gatica-Perez. Plsa-based image autoannotation: constraining the latent space. In ACM Inter. Conf. on Multimedia, pages 38 35,. [] K. Moon and V. Pavlovic. Impact of dynamics on subspace embedding and tracking of sequences. In CVPR, pages 98 5, June 6. [3] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. Semisupervised joint manifold learning for multi-valued regression. In ICCV, 7. [] P. W. Peter and S. T. Dumais. Personalized information delivery: an analysis of information filteringmethods. Commun. ACM, 35:5 6, December 99. [5] A. Shon, K. Grochow, A. Hertzmann, and R. Rao. Learning shared latent structure for image synthesis. NIPS, 5. [6] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3d human figures using d image motion. In ECCV, pages 7 78,. [7] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Tracking loose-limbed people. In CVPR, pages 8,. [8] J. Sivic, B. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering object categories in image collections. In ICCV, volume, pages , 5. [9] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly embedded visual inference. In ICML,. [] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative density propagation for 3d human motion estimation. In CVPR, pages , 5. [] C. Sminchisescu, A. Kanujia, Z. Li, and D. Metaxas. Conditional visual tracking in kernel space. In NIPS, 5. [] T.-P. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned smooth space of feasible solutions. In CVPR, 5. [3] R. Urtasun, D. J. Fleet, and P. Fua. 3d people tracking with gaussian process dynamical models. In CVPR, pages 38 5, 6. [] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. In ICCV, 5. [5] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In NIPS. 5. 7

First: Input real walking images of subject.

8 Figure 6. Input images with unknown view point and 3D reconstructions using DPLSA tracking. First row: true pose. Second and third rows: π view angle. Fourth and fifth rows: 3π view angle. Figure 7. First: Input real walking images of subject. Second row: Image silhouettes. Third row: Images of the reconstructed 3D poses. Fourth row: Input real walking images of subject 5. Fifth row: Images of the reconstructed 3D poses. 8

Graphical Models for Human Motion Modeling

Graphical Models for Human Motion Modeling Kooksang Moon and Vladimir Pavlović Rutgers University, Department of Computer Science Piscataway, NJ 8854, USA Abstract. The human figure exhibits complex and