Spectral Latent Variable Models for Perceptual Inference

Size: px

Start display at page:

Download "Spectral Latent Variable Models for Perceptual Inference"

Jessie Jeffry Nash
5 years ago
Views:

1 Technical Report, Department of Computer Science, Rutgers University, #DCS-TR-6, February 7 Spectral Latent Variable for Perceptual Inference Atul Kanaujia 1 Cristian Sminchisescu 2,3 Dimitris Metaxas 1 1 Department of Computer Science, Rutgers University, USA {kanaujia,dnm}@cs.rutgers.edu, kanaujia,dnm 2 Department of Computer Science, University of Toronto, Canada 3 TTI-Chicago, USA crismin@cs.toronto.edu, crismin May 22, 7 Abstract We propose low-dimensional non-linear latent variable models adapted for perceptual inference. We introduce a family of algorithms, referred to as Sparse Spectral Latent Variable (SLVM) that combine the advantages of spectral embeddings with the ones of parametric latent variable models like the Generative Topographic Mapping (): (1) they provide local-optima free embeddings that preserve intuitive properties, e.g. local or global geometry of the data, (2) offer low-dimensional generative models with probabilistic, bi-directional mappings between latent and ambient spaces, (3) are consistent and efficient to estimate. We show empirically that SLV versions of ISOMAP or LLE can be used successfully for complex visual inference tasks like the fully automatic reconstruction of 3d human poses in un-instrumented monocular video. 1 Introduction Many models in computer vision or machine learning are based on high-dimensional ambient observations (images, sound), considered to have been generated by non-linear, intrinsically low-dimensional processes. Many methods exist for computing such representations, but relatively few provide all the components necessary for inference tasks, e.g. tracking and reconstructing the three-dimensional pose of an object based on image observations. For such problems we need global representations and low-dimensional models with bidirectional mappings between latent and observation spaces. We also need to preserve global or local geometric properties of the data in latent space a model used for visual tracking based on ambient observations needs locality: small state changes in latent space should only produce commensurate changes in the observation space. If this doesn t hold the tracker may need to make realistically large and difficult to predict jumps in latent space in order to follow smooth trajectories in the observation space. We also need models that are probabilistic and can be learnt consistently, e.g. all their components estimated jointly, using ML. In this work we identify components that can provide consistent models with all the necessary properties: we marriage spectral embeddings and parametric latent variable models. This is a seemingly simple idea, but we know of no precursor. Spectral methods are attractive for their intuitive geometric properties and efficient global computation, but do not have a probabilistic extension, whereas regular grid-based generative methods like the Generative Topographic Mapping () are not geometry preserving, and encounter difficulties with latent spaces of dimension higher than 2-3, or when unfolding complex non-linear structure due to local optima. We propose a family of algorithms that combines the advantages of the two classes of methods: (1) they offer low-dimensional generative models with bi-directional, possible multimodal distributions (mappings) between latent and ambient spaces, (2) are probabilistically consistent and efficient to estimate. We refer to these probabilistic constructs defined on top of the irregular grid obtained from a spectral embedding as Sparse Spectral Latent Variable (SLVM). We show how SLV versions of ISOMAP, LLE of HE can 1

2 be used successfully for complex visual inference tasks like the fully automatic 3d reconstruction of human poses in non-instrumented monocular video. 1.1 Related Work Our research relates to work in spectral manifold learning, latent variable models and visual tracking. Spectral methods can model intuitive local or global geometric constraints [17, 21, 3, 7, 2] and their, local-optima free, polynomial time calculations are amenable to efficient optimization either sparse eigenvalue problems, or dense problems that can be solved with algebraic multigrid methods [4]. A variety of latent variable models exist, both linear (factor analysis, P) [16], and non-linear, e.g. mixture of factor analyzers or P [22], or grid-based methods like []. These methods are powerful but besides computational difficulties caused by multiple local optima, the models do not provide either global coordinate systems or latent spaces with flexible geometric structure. Recent work on visual tracking has recognized the need for global, low-dimensional, probabilistic models with intuitive geometric properties and latent-ambient mappings. More complex models have emerged by complementing with mappings the results of spectral methods. Elgammal & Lee [8] fit RBFs to the corresponding points of an LLE-embedding, but their model devoid of a latent space prior is not fully probabilistic and there are no mappings to the latent space. In independent work, Sminchisescu & Jepson [19] augment spectral embeddings (e.g. Laplacian Eigenmaps) with both latent space Gaussian mixture priors and sparse kernel regression mappings from latent to ambient space. Their model qualifies as a latent variable one, but ambient to latent mappings are computationally difficult to compute and the model is trained piece-wise not jointly. Urtasun et al [24] use an efficient Gaussian Process Latent Variable Model () of Lawrence [], but this optimizes a data reconstruction criterion, not based on the preservation of local or global geometric properties. The zero mean unit covariance prior in latent space makes the model somewhat agnostic for visual inference applications, where it is useful to penalize drifts from the manifold of admissible visual configurations. For improved stability, [24] use an augmented constrained (latent,ambient) state for tracking. We demonstrate the SLVMs for discriminative 3d human pose estimation, but here we predict a lowdimensional latent variable of the human pose directly from image sequences or human photographs. Existing research on visual inference using low-dimensional methods (e.g. 3d human pose tracking) has focused on generative tracking frameworks, based on either complex appearance (observation) models [19] or simpler models based on image tracks of the human body, assumed known, and assigned to the corresponding joints of a 3d model [24]. In practice such correspondences are difficult to obtain, and existing methods had to be initialized by hand. Initialization becomes more difficult as the dimension of the latent space increases, even for known correspondences - ultimately the search complexity is exponential in the latent space dimension. The advantages of latent variable models with generative trackers are counterbalanced by the difficulty to make models based on alignment work when the representation capacity of the 3d body model is reduced. E.g., even for simple walking motions, the amplitude of a tested motion may be different from the one in the training set. The 3d model may not align well with the image features and may gradually loose track. This explains why existing generative frameworks either relay on quite large training sets [19], assume that long-range model-image correspondences are given, or assume clean image silhouette data exists [8]. We depart from alignment approaches, to construct (inverse) image-based predictors that regress the latent body pose directly against image evidence. This allows more robust tracking when the latent variable models used have reduced representation capacity. 2 Sparse Spectral Latent Variable (SLVM) We work with two sets of vectors, of equal size N (possibly in correspondence), in two spaces referred as latent and ambient. Both sets are un-ordered, hence there is no constraint to place either one on a regular grid (i.e. a matrix in 2d or the cells of cube in 3d, as is the case of the latent vectors in []). The latent space vectors are generically denoted x, with dim(x) = d, the ambient vectors are y with dim(y) = D, typically with D >> d. 2

3 Spectral Embeddings We assume that vector-valued points in ambient space Y are captured from a highdimensional process (images, sound, human motion capture systems), whereas latent space points X are obtained using a spectral, non-linear embedding method like ISOMAP, LLE, Hessian or Laplacian Eigenmaps, etc. This provides several advantages: (1) Preserves intuitive local or global geometric properties. These are important for visual tracking applications, which allow efficient, smooth estimates of the state of an object in latent space for stability, it is important that small changes in latent space lead to small, continuous changes in the ambient / observation space, (2) no local optima (3) provide corresponding points with similar geometric properties. The resulting irregular grid of embedded points is used in order to construct a joint probability distribution over latent and ambient variables. Sparse Generative Spectral Mappings For our models, we assume that ambient vectors are related to the latent ones using a nonlinear vector-valued function F(x, W, α) with parameters W and output noise covariance σ. In other words the joint distribution has a non-linear constraint between two blocks of its variables. 1 The model can be derived analogously with the Generative Topographic Map [] (hence the name of this section), except that it is defined on an irregular grid X in latent space, as computed from a spectral embedding: p(y x,w, α, σ) = G(y F(x,W, α), σ) (1) where G is a Gaussian function of variable y, with mean F and covariance σ. F is chosen here to be a generalized (non-linear parametric) regression model: F(x,W, α) = Wφ(x) (2) with φ(x) = [G(x x 1, δ),..., G(x x M, δ)], having Gaussian kernels (others can be used) placed at an M-sized subset of the training set (we will select this automatically using sparsity priors, notice the difference from where the location of the basis functions is selected apriori), with covariance δ, and W is a weight matrix of size DxM. The model can be made computationally efficient and more robust to overfitting by introducing hierarchical priors on the parameters W of the mapping F, in order to select a sparse subset of X for prediction [23, 14]: p(w α) = D j=1 k=1 N G(w jk, 1 α k ) (3) with: N p(α) = Gamma(α i a, b) (4) Gamma(α a, b) = Γ(a) 1 b a α a 1 e bα () where a = 2, b = 4 are chosen to give broad hyperpriors. The ambient space prior can be obtained by integrating in latent space: p(y W, α, σ) = p(y x, W, α, σ)p(x)dx (6) where p(x) is the latent space prior, which for tractability, can be chosen conveniently as a sum of delta functions [], here placed on an irregular grid in latent space: 1 This is one possible construction for the joint distribution p(x,y) = p(x)p(y x), arguably not the only possible one. For instance one can approximate it as a product of marginal distributions in latent and ambient space p(x,y) = p(x)p(y) = P N p(x,xi)p(y,yi), and rely on the constraints already present in the embedding to account for correlations. 3

4 This leads to the ambient prior: p(x) = 1 K δ(x x i ) (7) K p(y W, α, σ) = 1 K K p(y x i,w, α, σ) (8) The conditional distribution in latent space can be obtained using Bayes rule: p (i,n) = p(x i y n ) = p(y n x i )p(x i ) p(y n ) = p(y n x i ) N p(y n x i,w, α, σ) This allows computing either the conditional mean or the mode (better for multimodal distributions) in latent space: K < x y n,w, α, σ >= p(x y n,w, α, σ)xdx = p (i,n) x i () i max = argmax p (i,n) (11) i The model contains all the ingredients for efficient computation in both latent and ambient space: eq. (7) gives the prior in latent space, (8) provides the induced prior in ambient space, (1) provides the conditional distribution (or mapping) from latent to ambient space, and () and (11) give the mean or mode of the mapping from ambient to latent space. The model can be trained my maximizing the log-likelihood of the ambient data: (9) L = log N p(y n W, α, σ) = N log{ 1 K n=1 K p(y n x i,w, α, σ)} (12) Maximizing the likelihood provides estimates for W, α, σ,(consider σ is diagonal with values σ for simplicity) where: Σ = (σ 2 Φ GΦ + S) 1 (13) W = σ 2 ΣΦ RY (14) where S = diag(α 1,..., α K ) with α corresponding only to the active set, G = diag(g 1,..., G M ) with G i = N j=1 p (i,j), R is a KxN matrix with elements p (i,j), and Y is an NxD matrix that stores the output vectors y i, i = 1...N columnwise, and Φ is a KxM matrix with elements G(x i x j, δ). The variance is estimated as usual from prediction error []: σ 2 = 1 ND N n=1 k=1 K p (kn) W φ(x) y n 2 (1) where a superscript identifies an updated variable estimate. The hyperparameters are re-estimated using the relevance determination equations [14]: α i = where µ i is the i-th column of W. The algorithm is summarized in fig. 1. γ i µ i 2, γ i = 1 α i Σ ii (16) 4

5 SLVM Learning Algorithm Input: Set of high-dimensional, ambient points Y = {y i }...N. Output: Sparse, Spectral Latent Variable Model (SLVM), with parameters (W, α, σ), defined on a irregular, fixed grid in latent space X = {x i }...N, that preserves local or global geometric properties of Y. Step 1. Compute spectral (non-linear) embedding of Y to obtain corresponding latent points X, using any standard embedding method like ISOMAP, LLE, HE, LE. Step 2. Compute inital estimates (W,α,σ ), using sparse kernel regression [23, 11] with training points (x i,y i ) N. Step 3. EM Algorithm to learn Sparse Spectral Generative Mappings on an irregular latent space grid. E-step Compute posterior probabilities p (i,n) for the assignment of latent to ambient space points using (9). M-step: Solve weighted non-linear regression problem to update (W, α, σ) acording to (13), (1), and (16). This uses Laplace approximation for the hyperparameters and analytical integration for the weights, and optimization with greedy weight subset selection [1, 14, 23, 11]. Figure 1: 3 Image-based Direct 3D Pose Prediction We estimate a distribution over solutions in latent space, given input descriptors derived from images obtained by cropping the person using a bounding-box. We then map from latent states to 3d human joint angles (using e.g. (1)) in order to recover body configurations for visualization or error reporting. To predict latent space distributions from image features, we use a conditional mixture of expert predictors. Each one is paired with an observation dependent gate function that scores its competence when presented with different images. As this change, different experts may be active and their rankings (relative probabilities) may change. The model is: p(x r) = M g i (r)p i (x r) (17) g i (r) = exp(λ i r) k exp(λ k r) (18) p i (x r) = G(x E i r,ω i ) (19) with r image descriptors, x state outputs, g i input dependent gates, computed using linear regressors c.f. (18), with weights λ i. g are normalized to sum to 1 for any given input r and p i are Gaussian functions (19)

6 with covariance Ω i, centered at the expert predictions, sparse linear regressors with weights E i. The model is trained using a double-loop EM algorithms [9]. 4 Experiments In order to illustrate the SLVM algorithms, we run experiments on a simple Swiss roll toy data problem and on a real world computer vision application: the reconstruction of 3d human poses from monocular images of people photographed against complex, non-stationary backgrounds. 4.1 Swiss Roll This set of experiments are illustrated in fig. 2. The original data set (top-row, fig.a) consists of points sampled regularly on the surface and (in most plots) color coded to show their relative spatial positions. In fig.b), we show reconstruction using []. The reconstruction is not entirely inaccurate and scrambles the geometric ordering when traversing the S sheet. uses a regular grid and points subsampled on this grid c.f. []. Fig.c), top row, shows accurate reconstructions from our sparse SLV-ISOMAP model, which is also computationally efficient - the 16 (out of ) latent space centers selected by the model are shown in fig.d). The bottom row of fig. 2 illustrates the computation of conditional distributions and the induced prior in latent space for SLV-ISOMAP. In fig.a) we show a dataset together with an out-of-sample ambient point (at one extremity) for which the conditional latent space distribution is computed. Fig. b) shows the intuitive bi-modal distribution in latent space, computed c.f. (9). Figs.c) and d) on the bottom row of fig. 2 show the SLV-ISOMAP prior induced in ambient space, computed using (8), for two different levels of sparsity, % (regular sub-sampling by a factor of 2) and automatic, as computed by our model at 1.6%. Notice that unlike all the other plots shown in the figure, in the rightmost two on the bottom row, the color of the points represents probability, not geometric ordering. The prior distribution has the intuitive property to peak higher on the principal curve of the S-shape, away from the borders of the manifold (where there is less data, on average) d Human Pose Reconstruction In this section, we report quantitative comparisons and qualitative 3d reconstruction (joint angles) of human motion from video or photographs. Image Descriptor A difficulty for reliable discriminative pose prediction is the design of image descriptors that are distinctive enough to differentiate among different poses, yet invariant to within the same pose class deformations and spatial misalignments people in similar stances, but differently proportioned, or photographed on different backgrounds. Exiting methods have successfully demonstrated that bag of features or regular-grid based representations of local descriptors (e.g. bag of shape context features, block of SIFT features [2, ]) can be surprisingly effective at predicting 3d human poses, but the representations tend to be too inflexible for reconstruction in general scenes. It is more appropriate to view them as two useful extremes of a multilevel, hierarchical representation of images - a family of descriptors that progressively relaxes block-wise, rigid local spatial image encodings to increasingly weaker spatial models of position / geometry accumulated over increasingly larger image regions. We derive such an encoding called Multilevel Spatial Blocks (MSB) which computes a description of the image at multiple levels of details. The putative image bounding-box of a person is split using a regular grid of overlapping image blocks, and increasingly large (SIFT) descriptor cell sizes are extracted. SIFT [13] is a local image encoding that builds histograms of edge orientations in a particular image region. We concatenate SIFTs within each layer and across layers, orderly, in order to obtain encodings of an entire image or sub-window. For our problem we use MSBs of size 1344, obtained by decomposing the image into blocks at 3 different levels, with 16, 4, 1 SIFT block, 4x4 cells per block, 12x12 pixel cell size 8 image gradient orientations histogramed within each cell. 6

x 3 4. 4 3. 3 2. 2 1. 1. Figure 2: Analysis of SLV-ISOMAP and on the Swiss roll.

(sampled) from, with associations between datapoints color-coded scrambles the points and does not preserve their

(c) Reconstruction from our sparse SLV-ISOMAP model with associations color-coded; (d) Active (sparse) basis setwith

Bottom row: Illustration of computations of conditional distributions and induced prior in latent space by SLV-ISOMAP.

(b) The multimodal conditional distribution in latent space, computed c.f. (9).

computer graphics human model that was animated using the Maya graphics package, and rendered on real image

We have 3247 different 3d poses from the CMU motion capture database [1] and these are rendered to produce supervised

one side, dancing, conversation, bending and picking, running and pantomime (one of the 3 training sets of 3247 poses

The test motions are executed by different subjects, not in the training set.

The other two test sets are progressively more complicated: one has the model randomly placed at different locations,

In all cases, a 3x2 bounding box of the model and the background is obtained, possibly using rescaling.

7 x Figure 2: Analysis of SLV-ISOMAP and on the Swiss roll. Top Row: (a) Original dataset of points, color coded to emphasize the geometric ordering; (b) Data reconstructed (sampled) from, with associations between datapoints color-coded scrambles the points and does not preserve their ambient geometric structure. The reconstruction is not accurate. (c) Reconstruction from our sparse SLV-ISOMAP model with associations color-coded; (d) Active (sparse) basis setwith 1.6%=16 of the datapoints shown in latent space, as automatically selected by SLV- ISOMAP. Bottom row: Illustration of computations of conditional distributions and induced prior in latent space by SLV-ISOMAP. (a) Dataset with ambient point for which conditional latent space distribution is computed. (b) The multimodal conditional distribution in latent space, computed c.f. (9). (c-d) Prior of SLV-ISOMAP in ambient space, computed using (8) for two different levels of sparsity, % (sub-sampling by a factor of 2) and automatic, 1.6%. Important: Please, notice that unlike all the other plots shown in the figure the color of the points represents probability, not geometric ordering!. Database and Multivalued Predictor For qualitative experiments we use images from a movie (Run Lola Run) and the INRIA pedestrian database [6]. For quantitative experiments we use our own database consisting of = 9741 quasi-real images, generated using a computer graphics human model that was animated using the Maya graphics package, and rendered on real image backgrounds (see fig. ). We have 3247 different 3d poses from the CMU motion capture database [1] and these are rendered to produce supervised (3d human joint angle, MSB image descriptor) pairs for different patterns of walks, either viewed frontally or from one side, dancing, conversation, bending and picking, running and pantomime (one of the 3 training sets of 3247 poses is placed on a clean background). We collect three test sets of poses for each of the five motion classes. The test motions are executed by different subjects, not in the training set. We also render one test set on a clean background to use as baseline. The other two test sets are progressively more complicated: one has the model randomly placed at different locations, but on the same images as in the training set, the other has the model placed on unseen backgrounds. In all cases, a 3x2 bounding box of the model and the background is obtained, possibly using rescaling. There is significant variability and lack of centering in this dataset because certain poses are vertically (and horizontally) more symmetric than others (e.g. compare a person who picks an object with one who is standing, or pointing an arm in one direction). We train multivalued predictors (conditional Bayesian mixture of sparse linear regressors) on each activity in the dataset (the latent variable models are also computed separately for each activity). Empirically, for the experts, we observe sparsity patterns in the range 1% 4%, with lower values usually associated with models that generalize better. We use sparse linear regressors, for robustness to image descriptor components affected by fluctuating background clutter c.f. 3. The error is always computed w.r.t. the most 7

8 Average Joint Angle Errors SideWalk SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Average Joint Angle Errors Running SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Average Joint Angle Errors BendPickup SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Pantomime Dancing Cumulative Plot Average Joint Angle Errors SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Average Joint Angle Errors SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Average Joint Angle Errors SLV ISOMAP(NN1) SLV LLE(NN1) SLV HE(NN1) Figure 3: Quantitative results (prediction error, per joint angle, in degrees) for different motions (+ a cumulative plot) and 3 different imagining conditions, as follows: people on Clean backgrounds, people on Clutter1 backgrounds, used in the training set (but the test image has the person in a different position w.r.t. the background, in a relative position not seen in the training set) and people on Clutter2 backgrounds, not seen at all in the training set. We compare several methods, including our SLVM-(ISOMAP,LLE,HE), [], [] and, all with 2 latent space dimensions. We use Discriminative image-based predictors (multivalued mappings from images to latent space), based on conditional Bayesian mixtures of experts, sparse linear regressors, for robustness to image descriptor components affected by fluctuating background clutter c.f. 3. Both LVM models and corresponding predictors have been trained separately for each motion type. The error is always computed w.r.t. the most probable expert. Figure 4: Posterior plots showing the predicted ditribution p(x r) in latent space, for (a-c) and for one of our SLVM models (d-f) for a a walking motion viewed from a side. The topmost points in (a-c) are the training points and for illustration we link points corresponding to adjacent frames in the motion sequence. The ground truth is shown with a circle, the predicted posterior is color-coded. Half of the points on the grid have RBFs placed on top, regularly subsampled (not only top). Notice the loss of track, and the assignments of ambient points to multiple grid points (top). Figs (d-f) shows tracking with our SLV-ISOMAP model. The distribution is ocassionally multimodal and sometimes the most probable peak is not the true one. In general, however, the ground truth is tracked closely. probable expert. In fig. 3, we show quantitative comparisons (prediction error, per joint angle, in degrees) for different motions (+ a cumulative plot) and 3 different imagining conditions: Clean backgrounds, Clutter1 back- 8

Figure : Images from our training and testing database where a

HE), [], [] and, all with 2 latent space dimensions embedded

both a separate LVM and an imagebased latent state predictor

For visualization and error reporting we use the conditional

latent point estimates x to joint angles y.

6 is based on real images from the movie Run Lola Run, where

humans in non-instrumented, complex We use a model trained

model placed on real backgrounds, rendered from 8 different

humans running and walking in cluttered scenes.

are not entirely accurate in a classical alignment sense.

images from the movie Run Lola Run (block of leftmost 4 images)

(a) Top row shows the original images, (b) Bottom row shows

Conclusions We have presented spectral latent variable models

are probabilistic; (b) preserve intuitive geometric properties

9 Figure : Images from our training and testing database where a virtual, CG human impostor, is placed in real-world environments. grounds and Clutter2 backgrounds, not seen at all in the training set. We compare several methods, including our SLVM-(ISOMAP, LLE, HE), [], [] and, all with 2 latent space dimensions embedded from 6d ambient human joint angles (recall that in each case, both a separate LVM and an imagebased latent state predictor have been computed). For visualization and error reporting we use the conditional estimate of the ambient state (that uses function F) to map latent point estimates x to joint angles y. The final set of results we show in fig. 6 is based on real images from the movie Run Lola Run, where our model is seduced by Lola and runs after her, and INRIA pedestrian dataset [6]. These are all automatic 3d reconstructions of fast moving humans in non-instrumented, complex environments. We use a model trained with walking and running labeled poses (quasireal data of our model placed on real backgrounds, rendered from 8 different viewpoints) with an additional unlabeled (real) images of humans running and walking in cluttered scenes. As usual with discriminative predictive methods, the solutions are not entirely accurate in a classical alignment sense. However, we find that the 3d reconstructions have reasonable perceptual accuracy. Figure 6: Qualitative 3d reconstruction results obtained on images from the movie Run Lola Run (block of leftmost 4 images) and the INRIA pedestrian dataset (rightmost 4 images) [6]. (a) Top row shows the original images, (b) Bottom row shows automatic 3d reconstructions. Conclusions We have presented spectral latent variable models (SLVM) and showed their potential for visual inference applications. We have argued in support of low-dimensional models that: (a) are probabilistic; (b) preserve intuitive geometric properties of the ambient distribution, e.g. locality, as required for visual tracking applications; (c) provide mappings, or more generally multimodal conditional distributions between latent and ambient spaces, and (d) are consistent, efficient to learn and estimate and applicable with any embedding method like ISOMAP, LLE or LE. To allow for (a)-(d), we introduce models that combine the geometric and computational properties of spectral embeddings with the probabilistic formulation and the mappings 9

10 offered by LVMs like the. The resulting modes are simple, but we are not aware of a precursor with similar properties and consistency in the literature. We empirically show that (in conjunction with discriminative pose prediction methods and multilevel image encodings), SLVMs are useful for the fully automatic 3d reconstruction of complex human poses from non-instrumented monocular images. Future work We currently investigate alternative approximations to the latent space prior distribution as well as alternative constraints between latent and ambient points. We also plan to study the behavior of our algorithms for unfolding more complex structures, e.g. motion combinations and higher dimensional latent spaces, where the non-linear models are expected perform best. References [1] (3). CMU Human Motion Capture DataBase. Available online at [2] Agarwal, A., & Triggs, B. (6). A local basis representation for estimating human pose from cluttered images. ACCV. [3] Belkin, M., & Niyogi, P. (2). Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. NIPS. [4] Bengio, J., Paiement, J., & Vincent, P. (3). Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps and Spectral Clustering. NIPS. [] Bishop, C., Svensen, M., & Williams, C. K. I. (1998). Gtm: The generative topographic mapping. Neural Computation, [6] Dalal, N., & Triggs, B. (). Histograms of oriented gradients for human detection. CVPR. [7] Donoho, D., & Grimes, C. (3). Hessian Eigenmaps: Locally Linear Embedding Techniques for High-dimensional Data. Proc. Nat. Acad. Arts and Sciences. [8] Elgammal, A., & Lee, C. (4). Inferring 3d body pose from silhouettes using activity manifold learning. CVPR. [9] Jordan, M., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, [] Lawrence, N. (). Probabilistic non-linear component analysis with gaussian process latent variable models. JMLR, [11] Lawrence, N., Seeger, M., & Herbrich, R. (3). Fast sparse Gaussian process methods: the informative vector machine. NIPS. [12] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. of IEEE. [13] Lowe, D. (4). Distinctive image features from scale-invariant keypoints. IJCV, 6. [14] Mackay, D. (1998). Comparison of Approximate Methods for Handling Hyperparameters. Neural Computation, 11. [1] Neal, R. (1996). Bayesian learning for neural networks. Springer-Verlag. [16] Roweis, S., & Ghahramani, Z. (1997). A unifying review of linear gaussian models. Neural Computation. [17] Roweis, S., & Saul, L. (). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science.

11 [18] Schölkopf, B., Smola, A., & Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation,, [19] Sminchisescu, C., & Jepson, A. (4). Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. ICML (pp ). Banff. [] Sminchisescu, C., Kanaujia, A., Li, Z., & Metaxas, D. (). Discriminative Density Propagation for 3D Human Motion Estimation. CVPR (pp ). [21] Tenenbaum, J., Silva, V., & Langford, J. (). A Global Geometric Framewok for Nonlinear Dimensionality Reduction. Science. [22] Tipping, M. (1998). Mixtures of probabilistic principal component analysers. Neural Computation. [23] Tipping, M. (1). Sparse Bayesian learning and the Relevance Vector Machine. JMLR. [24] Urtasun, R., Fleet, D., Hertzmann, A., & Fua, P. (). Priors for people tracking in small training sets. ICCV. [2] Weinberger, K., & Saul, L. (4). Unsupervised Learning of Image Manifolds by Semidefinite Programming. CVPR. 11

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards