HUMAN pose estimation is an important problem in computer

Size: px

Start display at page:

Download "HUMAN pose estimation is an important problem in computer"

Leonard Hunt
5 years ago
Views:

1 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 Robust 3D Huan Pose Estiation fro Single Iages or Video Sequences Chunyu Wang, Student Meber, IEEE, Yizhou Wang, Meber, IEEE, Zhouchen Lin, Meber, IEEE, and Alan L. Yuille, Fellow, IEEE Abstract We propose a ethod for estiating 3D huan poses fro single iages or video sequences. The task is challenging because: (a any 3D poses can have siilar D pose projections which akes the lifting abiguous, and (b current D joint detectors are not accurate which can cause big errors in 3D estiates. We represent 3D poses by a sparse cobination of bases which encode structural pose priors to reduce the lifting abiguity. This prior is strengthened by adding lib length constraints. We estiate the 3D pose by iizing an L nor easureent error between the D pose and the 3D pose because it is less sensitive to inaccurate D poses. We odify our algorith to output K 3D pose candidates for an iage, and for videos, we ipose a teporal soothness constraint to select the best sequence of 3D poses fro the candidates. We deonstrate good results on 3D pose estiation fro static iages and iproved perforance by selecting the best 3D pose fro the K proposals. Our results on video sequences also show iproveents (over static iages of roughly 5%. Index Ters 3D huan pose estiation, sparse basis, anthropoorphic constraints, L -nor penalty function INTRODUCTION HUMAN pose estiation is an iportant proble in coputer vision which has received uch attention because any applications require huan poses as inputs for further processing [] [] [3] [4]. Representing huan otion by poses is arguably better than using low-level features [5] because it is ore interpretable and copact [3]. In recent years there has been uch progress in estiating D poses fro iages [6] [7] [8] [9] and videos [] [] [3]. A D pose is typically represented by a set of body joints [6] [] or body parts [7] [8]. Then a graphical odel is forulated where the graph node corresponds to a joint (or body part and the edges between the nodes encode spatial relations. This can be extended to video sequences [] [] [3] to explore the teporal cues for iproving perforance. Nevertheless, it sees ore natural to represent huans in ters of their 3D poses because this is invariant to viewpoint and the spatial relations between joints are sipler. But estiating 3D poses fro a single iage is difficult for any reasons. Firstly, it is an under-constrained proble because we are issing depth inforation and any 3D poses can give rise to siilar D poses after projection into the iage plane. In short, there are severe abiguities when lifting D poses to 3D. Secondly, estiating 3D poses requires first estiating the D joint locations in iages which can ake istakes and can result in bad estiates for soe joints. All these issues can degrade 3D pose estiation if not dealt with carefully. Thirdly, the situation becoes even worse if the caera paraeters are unknown which is typically the case in real applications. Hence we ust estiate the 3D pose and caera paraeters jointly [3] which leads to non-convex forulations which is difficult.. Method Overview We present an overview of our approach illustrating our ain contributions. This builds on, and gives a ore detailed description of, our preliary work [4] which estiated 3D poses fro a single iage. Our new contributions include extending the work to output K candidate 3D poses, to iprove our 3D pose estiates by post-processing, and to estiate 3D poses fro videos exploiting teporal cues. We break our approach down into five coponents described below. These are: (i the 3D pose prior, (ii the easureent error between the D pose and the 3D projection, (iii the inference algorith for estiating 3D pose and caera paraeters using the alternate direction ethod (ADM, (iv our ethod for outputting K candidate 3D poses, and (v the extension to video sequences... The Prior for 3D Poses We represent 3D poses by a linear cobination of basis functions. This is partly otivated by earlier work [3] which used PCA to estiate the bases fro a 3D dataset. By contrast, we learn the bases by iposing a sparsity constraint. This iplies that for a typical 3D pose only a sall nuber of the basis coefficients will be non-zero. We argue that sparse bases are ore natural than PCA for representing 3D poses because the space of 3D poses is highly non-linear. The sparsity requireent puts a strong prior on the space of 3D poses. But this prior needs to be strengthened because unrealistic 3D poses can still have sparse representations. To strengthen the sparsity prior we build on previous work on anthropoorphic constraints [5] [6] which shows that the lib length ratios of people are siilar and can be exploited to estiate 3D pose (but when used by theselves anthropoorphic constraints have abiguities. We were otivated to use anthropoorphic constraints by observing that any of the

2 JOURNAL OF LATEX CLASS FILES, VOL. 6, NO., JANUARY 7 see Fig. 4. We show that we can iprove perforance by a second stage where we select the candidate that best satisfies the anthropoorphic constraints. ( ( M M Input Iage D Pose 3D Pose Caera (3 Fig.. Method overview. ( On a test iage, we first estiate the D joint locations and obtain an initial 3D pose by the ean pose in the training data. This initializes an alternating direction ethod which recursively alternates the two steps (i.e. steps and 3. ( Estiate the the caera paraeters fro the D pose and current estiate of the 3D pose. (3 Re-estiate the 3D pose using the D pose and the current estiates of the caera paraeters. The algorith converges when the difference of the estiates is sall. unrealistic 3D poses often violate the. Hence we suppleent the sparse basis representation with hard lib length ratio constraints to discourage incorrect poses... Robust Measureent Error: L -nor The difficulty of 3D pose estiation is that we frequently get large errors, or outliers, in the positions of soe D joints. We use an L nor to copute the easureent error between the detected D poses and the projections of the 3D poses. We argue that this is better than using the standard L nor because the L is uch ore robust to large errors, or outliers, in the D pose estiates. The greater robustness of the L nor is well-known in the statistics literature[7]...3 Algorith to iize the objective function We forulate an objective function by cobining the easureent error with the sparsity penalty and the anthropoorphic constraints. Our inference algorith iizes this objective function to jointly estiate the 3D pose and the caera paraeters. We first initialize the 3D pose and then estiate the caera paraeter and the 3D pose alternatively (with the other fixed. See Fig.. The estiations (for 3D pose and caera separately are done using the alternating direction ethod (ADM which yields a fast algorith capable of dealing with the constraints...4 Multiple candidate proposals and selection For a single iage, our best estiate of the 3D pose is usually good [4] but not perfect. There are two ain reasons for this. Firstly, the proble is highly non-convex so our estiation algorith can get trapped in a local iu. Secondly, our basis functions are learnt fro 3D pose datasets of liited size, which ay cause soe errors. To address this issue, we odify our approach to output a set of K 3D poses, where K takes a default value of eight. Our experient shows that one of our top eight candidates is typically very close to the groundtruth, but the best candidate ay not be the one that iizes our objective function,..5 Selecting the Best Pose: Teporal Soothness If we have a video then we can obtain candidate 3D proposals for each frae and select the by iposing teporal soothness. This assues that the 3D pose does not change uch between adjacent fraes. We select the 3D pose by iizing an objective function which iposes teporal consistency and agreeent with the 3D pose priors. In suary, the ain novel contributions of this paper are the use of sparsity to obtain a prior for 3D poses which can effectively reduce the 3D pose lifting abiguities. Our ethod for suppleenting this with anthropoorphic constraints is also novel (but different fors of lib length constraints have been explored in history [6]. The use of the L nor to penalize easureent errors is new for this application (but well-known in the statistics literature [7]. Our use of the ADM algorith to ipose non-linear constraints is novel for this application. Our work on estiating the K best poses builds on prior work, e.g., [8], but they do not extend this to 3D poses and video sequences. The first part of this work (single 3D pose estiation was first presented in our preliary work [4] but in less detail. The paper is organized as follows: We first review related work in section. Section 3 and section 4 describe the details of iage/video based pose estiation, respectively. The basis learning ethod is discussed in section 5. Sections 6 and 6.4 give the experient results. We conclude in section 7. Appendix A presents the optiization ethod.. R ELATED W ORK Related Work on 3D Pose Estiation Existing work on 3D pose estiation can be classified into four categories by their inputs. The first class takes iages and caera paraeters as inputs. We only list a few of the here due to space liitations. Please see [9] for a ore coprehensive overview. Lee et al. [] first paraeterize the body parts by truncated cones. Then they optiize the rotations of body parts to iize the silhouette discrepancy between the odel projections and the iage by a sapling algorith. The ost challenging factor for 3D pose estiation fro a single caera is the twofold forwards/backwards flipping abiguity for each body part which leads to an exponential nuber of local ia. Rehg, Morris and Kanade [] coprehensively analyze the abiguities and propose a two-diensional scaled prisatic odel for figure registration which has fewer abiguity probles. Schisescu and Triggs [] propose to apply inverse kineatics to systeatically explore the coplete set of configurations which shows iproved perforance over the baselines. Then in a later work, they [3] propose to reduce the nuber of local ia by building roadaps of nearby ia linked by transition pathways which are found by searching for the codiension- saddle points. Sio-Serra et al. [4] first estiate the D joint locations and odel each joint by a Gaussian distribution. Then they propagate

3 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 3 the uncertainty to the 3D pose space and saple a set of 3D skeletons there. They learn a SVM to resolve the abiguity by selecting the ost feasible skeleton. In a later work [5], they propose to detect the D and 3D poses siultaneously by first sapling the 3D poses fro a generative odel then reweighting the saplers by a discriative D part detector odel. They repeat the process until convergence. The second class uses anually labelled body joints in ultiple iages as inputs. The use of ultiple iages eliates uch of the abiguity of lifting D to 3D. Valadre et al. [6] first apply rigid structure fro otion to estiate the caera paraeters and the 3D poses of the torsos (which are assued to be rigid, and then requires huan input to resolve the depth abiguities for non-torso joints. Siilarly, Wei et al. [7] propose rigid body constraints to reove the abiguity. They assue that the pelvis and the left and right hip joints for a rigid structure, and require that the distance between any two joints on the rigid structure reain unchanged. They estiate the 3D poses by iizing the discrepancy between the 3D pose projections and the D joint detections without violating the rigid body constraints. The third class takes the joints in a single iage as inputs. For exaple, Taylor [5] assues the lib lengths are known and calculates the relative depths of the libs. Barron and Kakadiaris [8] extend this idea by estiating the lib length paraeters. Both approaches [5] [8] suffer fro sign abiguities. Pons-Moll, Fleet and Rosenhahn [9] propose to tackle the abiguities by seantic pose attributes. These attributes represent Boolean geoetric relationships between body parts which can be directly inferred fro iage data using a structured SVM odel. They saple ultiple poses fro the distribution and select the best one by the attributes. Raakrishna et al. [3] represent a 3D pose by a linear cobination of PCA bases. They greedily add the ost correlated basis into the odel and estiate the basis coefficients by iizing an L -nor error between the projection of the 3D pose and the D pose. They also enforce a constraint on the su of the lib lengths of the 3D poses. This constraint is weak because the individual lib lengths are not necessarily correct, even if the su is. Akhter and Black [3] propose to learn an even ore strict prior, i.e. pose-conditioned joint angle liits fro a large otion capture dataset. They also use sparse bases to represent poses. But different fro ours, they do not use the robust reconstruction loss ter neither the lib lengths constraints. The fourth class [3] [33] [34] [35] [3] [36] [37] requires only a single iage or iage features. Mori et al. [3] atch a test iage to the stored exeplars, and transfer the atched D pose to the test iage. They lift the D pose to 3D by [5]. Gregory et al. [34] propose to learn a set of hashing functions that efficiently index the training 3D poses. Bo et al. [38] use twin Gaussian Process to odel the correlations between iages and 3D poses. Elgaal et al. [33] learn a viewbased silhouette anifold by Locally Linear Ebedding (LLE and the apping function fro the anifold to 3D poses. Agarwal et al. [35] present a ethod to recover 3D poses fro silhouettes by direct nonlinear regression of the joint angles fro the silhouette shape descriptors. These approaches do not explicitly estiate caera paraeters and require a lot of training data fro different viewpoints in order to generalize to other datasets. Ionescu, Carreira and Schisescu [9] propose to siulate the Kinect systes to first label the iage pixels and then regress the 3D joint locations fro the derived features. In [39], the authors apply deep networks to regress 3D huan poses and D joint detections in iages together under a ulti-task fraework. In [36], the authors propose a univeral network to regress the pixelwise segentations, D poses and the 3D poses. The authors in [37] propose an dualsource approach to cobine the D and 3D pose estiation datasets which iproves the results when using only one data source. In a recent work [3], the authors propose to first detect the D poses in an iage and then fit a 3D huan shape odel by iizing the projection errors. Our ethod only requires a single iage or a video sequence as inputs. Unlike [3] [33] [34] [35], we explicitly estiate the caera paraeters which reduces dependence on training data. Our ethod is siilar to [3] but there are five distinctive differences: (i we do not require huan intervention. We obtain the D joint locations by applying a D pose detector [6] on the iage instead of by anual labeling; (ii we use the L -nor penalty instead of the L - nor because it is ore robust [7] to inaccurate D joint detections; (iii we enforce eight lib length constraints, which is ore effective in reoving iplausible poses. By contrast, they [3] enforce a single constraint, i.e., the su of the lib lengths, which is insufficient to eliate incorrect poses; (iv we add an explicit L -nor regularization ter on the basis coefficients in our forulation to encourage sparsity; while they greedily add a liited nuber of bases into the odel. They need to re-estiate the basis coefficients every tie a new basis is added; (v We learn the bases on training data which cobines all the actions while their approach splits the training data into classes, applies PCA to each class and finally cobines the principal coponents as bases. Our approach is easier to generalize to other datasets because it does not require people to anually split the training data.. Related Work on M-best Models Meltzer et al. [4] observe that axiu a posteriori (MAP estiates often do not agree with the groundtruth. There are several reasons accounting for this phenoenon. Firstly, the algorith coputing the MAP estiate ay get stuck in a local iu. Secondly, the odel itself is only an approxiation and ay depend on paraeters which are learnt inaccurately fro a sall training dataset. To address the proble, several work [4][4][43][44][3] propose to copute ultiple 3D huan poses for postprocessing. For exaple, Schisescu and Jepson [4] present a ixture soother for non-linear dynaical systes which can accurately locate ultiple trajectories. They use dynaic prograg or axiu a posteriori for picking the final solutions. This is siilar to ours except that our work focuses on how to generate ultiple solutions for a single static iage. Kazei et al. [44] first generate a set of high scoring D poses and then reorder the by training a ranksv using a ore coplicated scoring function. Batra et al.

4 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 4 [43] generate a diverse set of proposal solutions progressively by augenting the energy function with a ter easuring the siilarity to previous solutions. Siilarly, Park and Raanan [4] propose an iterative ethod for coputing M-best D pose solutions fro a part odel that do not overlap, by iteratively partitioning the solution space into M sets and selecting the best solution fro each set. But it doesn t deal with 3D poses neither the videos as our ethod. Our work for producing ultiple solutions has two odules: candidates generation and candidate selection. In ters of coputing ultiple solutions, our work is related to [43] which also generates diverse solutions under the fraework of Markov Rando Fields. However, our work differs fro [43] in ters of that we propose a novel diversity ter which can be naturally integrated into our 3D pose estiation odel. It is also related to [4]. But in [4], generating ultiple solutions is natural because of the tree structure based inference (each root node location results in a solution. So the focus of [4]lies in how to select M-best fro the. In contrast, our work focuses on obtaining ultiple solutions by adding a diversity ter. In ters of selecting the best candidate, our use of the lib length constraints is novel. However, the use of teporal coherence cues and dynaic prograg has been explored extensively in previous work [3] [4]. 3 POSE ESTIMATION FROM A SINGLE IMAGE This section describes how our algorith can estiate candidate solutions for the 3D pose and caera paraeters fro a single iage. The input is the estiates of the D pose produced by a state-of-the-art detection algorith [45]. The first two sections describe the aterial that was first presented in our preliary work which outputs a single 3D pose estiate [4] while including additional details. The third section describes an extension which outputs ultiple 3D poses and selects the best of these candidates. 3. The 3D Pose Representation We describe the D and 3D pose representations and the caera odel in section 3.., the easureent error in section 3.., the sparse linear cobinations of bases in section 3..3, the anthropoorphic constraint in section 3..4 and the caera paraeter estiation in section The Representation and Caera Projection We represent D and 3D poses by n joint locations x R n and y R 3n, respectively. These can be expressed in atric fors by X R n and Y R 3 n respectively, where the i th colun are the D and 3D locations of the i th joint. We assue that the D and 3D poses have already been eancentered, i.e. the ean value of each row of X and Y is zero. We assue that people are not close to the caera which enables us to use a weak perspective caera ( odel. The T caera projection atrix is denoted by M = R 3 where T =. The scale paraeters have been iplicitly considered in the and. In other words, is not necessarily to be one. Then the D projection x of a 3D pose T y is given by: x = My, where M = I n M, in which I n is an identity atrix and is the Kronecker product operator. 3.. The Measureent Error: L or L nor The easureent error quantifies the difference between the D pose and the projection of the 3D pose. In this paper we consider two different easureent errors, specified by the L and L nors respectively: x My, L nor and x My, L nor. ( The L -nor is the ost widely used easureent error in the coputer vision literature. But, as discussed earlier, there can be large errors, or outliers, in the D pose estiation due to occlusion and other factors. Fig. gives an exaple of an outlier easureent. The right foot location (estiated by [6] is very inaccurate and biases the 3D estiate to the wrong solution if the L nor is used, while the L nor gives a better estiate. Hence we prefer to use the L nor because it is ore robust to outliers [7]. In the experiental section we show that the L nor gives better results Sparse Linear Cobination of Bases We represent a 3D pose y as a linear cobination of a set of bases B = {b,, b k }, i.e., y = k i= α ib i + µ (or y = Bα + µ, where the α are the basis coefficients and µ is the ean pose. The bases and the ean pose are learned fro a dataset of 3D poses as described in Section 5. Cobining this with the easureent error gives a penalty: x M (Bα + µ. ( In addition, we apply an L sparsity penalty θ α on the coefficients α so that typically only a few bases are activated for each 3D pose. This is aied to reduce the effective diension of the 3D pose space. Although huan poses are highly variable geoetrically it is clear that they do not for a linear space so not all linear cobination of bases should be allowed. In fact researchers have shown that 3D poses can be odelled by a low diensional non-linear space [46]. This leads to an objective penalty, which cobines the easureent error with the sparsity prior: α x M (Bα + µ + θ α (3 where θ > is a paraeter which balances the projection error and the sparsity penalty. The sparsity penalty can be thought of as a sparsity prior on the set of 3D poses when cobined with the requireent that each pose is a linear su of bases. In our experients, we set the paraeter θ by cross-validation. More specifically, it is set to be. in all experients. There is, however, a proble with using Eq. (3 by itself. We have observed that there are 3D configurations y for which the objective function takes sall values, but which are unlike huan poses. Hence the sparsity prior is not sufficient and needs to be strengthened.

In contrast, using L -nor returns a reasonable pose which does not have obvious errors despite the right foot joint.. See Section 3.

5 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 5 (a (b (c Fig.. (a The estiated D joint locations where the right foot location is inaccurate. (b-c are the estiated 3D poses using the L -nor and L -nor projection error, respectively. Using L -nor biases the estiation to a copletely wrong pose. In contrast, using L -nor returns a reasonable pose which does not have obvious errors despite the right foot joint.. See Section Anthropoorphic Constraints It is known that the lib length ratios of different people are siilar in spite of the differences in their heights. This is soeties called an anthropoorphic, or structural, prior and has been explored to decrease the abiguities for huan pose estiation [5] [6] [3]. By itself it is not sufficient because it allows sign abiguities and cannot distinguish, for exaple, between an ar pointing forward or backward. We use the anthropoorphic prior to strengthen our sparsity prior. This requires forulating it as a anthropoorphic constraint which we can incorporate into our objective function. More specifically, we require that the lengths of the eight libs of a 3D pose should coply with certain proportions. The eight libs are the Left Upper Ar (LUA, Right Upper Ar (RUA, Left Lower Ar (LLA, Right Lower Ar (RLA, Left Upper Leg (LUL, Right Upper Leg (RUL, Left Lower Leg (LLL, and Right Lower Leg (RLL, respectively. These lib proportions are coputed fro the statistics of the poses in the training dataset (they are independent of individual subjects. We now proceed with the forulation. We define a joint selection atrix E j = [,, I,, ] R 3 3n, where the j th block is an identity atrix of diension 3 3 and the other blocks are zeros. We can verify that the product of E j and y returns the 3D location of the j th joint in pose y. Let C i = E i E i. Then C i y is the squared length of the i th lib whose ends are the i -th and i -th joints. We noralize the squared lib length of right lower leg to one and copute the average squared lengths of the other seven libs (say L i correspondingly fro the training data. We propose the following constraints C i (Bα + µ = L i. Input ( ( (3 (4 (5 (6 Fig. 3. Top-six 3D pose estiations of a saple iage. The plots in blue and red are the ground-truth and estiated 3D poses, respectively. The fifth estiation is the best aong the candidates. See section 3.3. and by solving the following proble: ( X T Y, s.t. T =. (4, T 3. The Inference Algorith: Estiating the 3D Pose and the Caera Paraeters Given the discussions above, we obtain the coplete objective function which depends on both the basis coefficients and the caera paraeters: α,m x M (Bα + µ + θ α s.t. C i (Bα + µ = L i, i =,, 8 T =, where M = I n M and M = ( T T (5. We iize the objective function by alternating between M (with α fixed and α (with M fixed. We first initialize the 3D pose to be the ean pose of the training dataset and optiize M. Then with the estiated M we optiize the basis coefficients α. Both the basis coefficient estiation and caera paraeter estiation probles are not convex because of the quadratic equality constraints. We solve the proble by using an alternating direction ethod (ADM [47]. Briefly, we define an augent Lagrangian function which contains prial variables (the 3D pose coefficients and the caera paraeters and dual variables (Lagrange ultipliers which enforce the equality constraints. The ADM updates the variables by extreizing an augented Lagrangian function with respect to the prial and dual variables alternately. Although there is no guarantee of global optiu, we alost always obtain reasonably good solutions (see section 3.3 for how we address this nonconvexity issue by producing ultiple solutions Robust Caera Estiation Another coponent of the approach is to estiate the caera paraeters M given the estiated 3D pose y and the corresponding D pose x by iizing the L -nor projection error. Ideally the equality relationship between( the D and 3D T poses: X = M Y should hold, where M = is the projection atrix of a weak perspective caera. Note that the scale paraeters have been iplicitly considered in the and. So we propose to estiate the caera paraeters T 3.3 Producing Several Candidate Poses The objective function in Eq. (5 is non-convex in M and α so we cannot guarantee that our algorith has found the global iu. This otivates us to extend our approach to output a diverse set of K solutions. After that we describe how to select the best 3D pose fro these candidates. We require that the K 3D poses Y returned by the odel are dissiilar to each other to avoid redundancy. We use the squared Euclidean distance between two poses y i and y j as the dissiilarity easure, i.e., (y i, y j = y i y j. Note

6 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 6 Percentage Rank of the Best Estiation Percentage x=8 Error Dist. of st Estiate Error Dist. of Best Estiate Estiation Error Fig. 4. All results are based on the HuanEva dataset for three subjects. Left figure: The rank distribution of the best poses aong the eight candidates. Right figure: The error distribution when choosing the first candidate vs. choosing the best candidate (by oracle. X-axis is the average joint error of the three subjects and the y- axis represents the percentage of cases whose errors are saller than X. The estiation error units are illietres (s. See Section 3.3. that we use L -nor rather than the L -nor here because two poses are dissiilar even when only one joint of the two poses are dissiilar. To estiate the K th candidate pose y K given the D pose x and the first K poses {y i i =,, K }, we solve an augented iization proble which iizes a linear cobination of the original objective function and the y K s siilarity to the existing poses: α K x M (Bα + µ + θ α + θ s.t. C i (Bα + µ = L i, i =,, 8, (6 where θ is a paraeter which balances the loss ter and the siilarity ter. The proble can be solved by a sall odification of our original ADM based optiization algorith. We set the paraeter θ the sae as the θ in the previous odel. The paraeter θ is set by cross-validation. In particular, it is set to be. in all experients. We now select the best 3D pose fro the set of K candidates. We observe that the 3D pose estiate that iizes the objective function in Eq. (5 is soeties not the best solution. Fig. 3 shows an exaple where the fifth candidate rather than the first one is the best aong the six candidates. Fig. 4 (left shows the rank distributions of the best pose aong the candidates. We can see fro the left ost green bar that, for only about 3% of testing saples, the first estiate is the best estiate (closest to the groundtruth. In other words, the best one is not in the first position (rank one for nearly 7% of the cases. The right figure shows the estiation error distributions of selecting the first pose vs. selecting the best pose (using oracle fro the candidates. We can see that the perforance can be significantly iproved if we can select the correct estiate fro the candidates. Why is the best solution (as evaluated by groundtruth not always the 3D pose that iizes the objective function? This ay occur because of the liitations of the objective function. But it can also happen that because of the nature of our ADM algorith the anthropoorphic constraints have i= not fully been enforced. Hence the 3D pose candidates ay partially violate the anthropoorphic constraints. Fig. 7 (blue bars shows that the lib length errors for the eight libs are not zero although they are sall. This otivates selecting the candidate 3D pose which best satisfies the anthropoorphic constraints. We show, in the experiental section, that this iproves our results. 4 POSE ESTIMATION FROM A VIDEO Now suppose we have a video sequence as input. This gives another way to iprove 3D pose estiation using our K best candidates. For each iage frae, we estiate K candidates and use teporal inforation to select the best sequence. We define an objective function which encourages siilarity aong the 3D poses in adjacent fraes [48] [3] and uses unary ters siilar to those defined for static iages. We copute the Euclidean distance between two neighboring poses as the pairwise ter (y i, y j. This ter discourages sharp pose changes which can be helpful for difficult iages especially when their neighbouring estiations are accurate. Suppose a video sequence consists of T fraes. For each frae I t, i =,, T, we first obtain the K-best estiations yj t, j =,, K using the proposed ethod. Then we infer the best pose yj t t, j t K for each frae by iizing an objective function: T T j = f(yj t t + θ 3 (yj t t, y t+ j t+. (7 (j,,j T t= Bα + µ y i Here f(, easures how well a 3D pose obeys the anthropoorphic constraints (unary ter while the (, function easures the siilarity between adjacent poses (the pairwise ter, θ 3 specifies the trade off between the unary ter and the pairwise ter, which is set by cross validation. t= We use dynaic prograg to estiate j by iizing the objective function in Eq. (7. This exploits the one-diensional nature of the proble and is efficient since we only enforce teporal soothness between adjacent tie fraes. Experients show that this siple extension can yield iproveents in the 3D pose estiation results. 4. The Unary Ter We investigated two candidate unary ters in this work. The first is the easureent error between the projected 3D pose and the estiated D pose. We found this was not effective because the easureent errors of the top K candidates are very siilar. The reason is that incorrect 3D poses can have sall easureent errors because the caera paraeters copensate (i.e., bad 3D poses with bad caera paraeters can still give sall projection/easureent errors. Secondly, we easured how well the 3D pose satisfies the anthropoorphic constraints. More precisely, we coputed the absolute difference between the estiated lib length and the ean lib length L obtained during training. This was effective and we used the second ethod in our experient. Note that we also use anthropoorphic constraints when selecting the best solution for a single iage fro a set of K proposals, i.e. we use these constraints for both static iages and video sequences to select fro a set of candidates.

7 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 7 Reconstruction error PCA Classwise PCA Sparse bases 5 5 Nuber of bases (a Probability PCA Classwise PCA Sparse bases 4 6 Activated basis nuber Fig. 5. Coparison of the three basis learning ethods. (a 3D pose reconstruction errors using different nuber of bases. In this experient, the 3D poses are noralized so that the length of the right lower leg is one.(b Distribution of the nuber of activated bases for representing a 3D pose. The y-axis is the percentage of the cases whose nuber of activated bases is less than x. See section BASIS LEARNING We now describe how we learn the bases using sparse coding in experients. We copare with two ost popular basis learning ethods including PCA and classwise PCA. 5. Our Basis Learning Method We learn the bases fro a set of of 3D skeletons Y = [y,, y l ] by optiizing the epirical cost function: f l (B l (b l c(y i, B. (8 i= B = [b,, b k ] R 3n k is the basis dictionary to be learned with each colun representing a pose basis, and c(.,. is a loss function such that c(y, B is sall if B represents the pose y well. Also c(.,. iposes sparsity so that each pose is typically represented by only a sall nuber of bases. Note that overcoplete dictionaries with k 3n are allowed. As previous work, e.g., see [49], and consistent with our objective function (Eq.(3 the loss function c(y, B is given by: c(y, B = y Bα + θ α (9 To prevent B fro being arbitrarily large we constrain its coluns, i.e. the nors of each basis, to have an L -nor less than or equal to one. B,α s.t. l l i= y i Bα i + λ α i b T j b j, j =,, k ( This proble is not convex with respect to B and α but convex with respect to each of the two variables when the other is fixed. We use [49] to solve this optiization proble which alternates between the two variables. 5. Other Basis Learning Methods We copare our approach with two classic basis learning ethods proposed in the literature. The first ethod [46] applies PCA to the training otion capture data and uses the principal eigenvectors as the bases. Note that the axiu nuber of bases is liited by the diension of the poses (3n in our case because the eigenvectors are orthogonal to each other. As discussed in [3], it is probleatic to directly apply PCA to the poses for all 3D actions because PCA is ost suitable to data which coes fro a single-ode Gaussian distribution. Usually this is a very strong assuption. Hence Raakrishna et al. [3] split the training dataset into different classes using the action labels and assue that the data of each action class follows the Gaussian distribution. They apply PCA on each class, and cobine the principal coponents in each class as the bases. We nae this approach classwise PCA. Since the bases learned for different classes are learned separately, there ay be redundancy when copared with those jointly learned bases. We evaluate the three different basis learning ethods (i.e. PCA, classwise PCA and sparse coding in a 3D pose reconstruction setting. We reconstruct each 3D pose y by solving an L -nor regularized least square proble: α y Bα + λ α. ( We copute the reconstruction error y Bα as the evaluation etric. The average reconstruction errors using different nuber of bases are shown in Fig. 5 (a. Note that the axiu nuber of bases for PCA and classwise PCA ethods is 36 (which is the diension of a 3D pose and 44 (36 * 4 classes, respectively. The reconstruction error of PCA bases is the largest (slightly above.5 because the poses do not follow Gaussian distribution and the nuber of PCA bases is sall. Although the reconstruction errors of classwise PCA bases gradually decrease as ore bases are introduced, they are still larger than those of the sparse bases. One of the ain reasons ight be that the classwise PCA ethod does not encourage basis sharing between action classes. Hence the bases ight contain redundancy. In contrast, the sparse bases are shared between classes as they are learned fro the training data of different action classes together. In addition, Fig. 5 (b shows that fewer bases are activated using sparse bases. This also justifies their representative power. 6 EVALUATION ON SINGLE IMAGES We conduct two types of experients to evaluate our approach. The first type is synthetic where we assue the groundtruth D joint locations are known and recover the 3D poses. We systeatically evaluate: (i the influence of the three factors in the odel, i.e. the L -nor easureent error, the anthropoorphic constraints and the L -nor sparsity regularization on the basis coefficients; (ii the influence of the D pose accuracy; (iii the influence of the relative huan-caera angles; (iv the generalization capabilities of the learned bases. The second type of experients is real: we estiate the D joint locations by running a state-of-the-art D pose detector [45] and then estiate the 3D poses. We copare our ethod with the state-of-the-art ones [3] [4] [5]. We also observe that our approach can refine the original D pose estiations by projecting the inferred 3D pose back to the iages.

8 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 8 We use body joints, i.e., the left and right shoulders, elbows, hands, hips, knees and feet, for quantitative evaluation, which is consistent with the D pose detector [45]. We learn bases for all experients and approxiately 4 bases are activated for representing a 3D pose. 6. The Datasets We evaluate our approach on two benchark datasets: the HuanEva dataset [5]and the H3.6M dataset [5]. The HuanEva dataset contains ulti-view videos, 3D poses and the caera paraeters. Following the previous work, e.g.,[4], for fair coparisons, we use the walking and jogging actions of the three subjects for evaluation (the fourth subject is withheld by the authors and learn the bases on the training subset of the poses independently for each action. We report results on the validation sequences. Note that we also evaluate the generalization ability of the learned bases to other actions in section The H3.6M dataset [5] is a new large dataset which provides illions of 3D poses and iages. It includes subjects perforg 5 actions, such as eating, posing and walking. We use the data of subjects S, S5, S6, S7 and S8 for training and the data of S9 and S for testing. 6. Synthetic Experients: Known D Poses We assue the D poses x are known and recover the 3D poses y fro x. We use ean 3D joint error [5] as evaluation etric which is the average error over all joints. The results are reported using unit of illietres (s. All synthetic experients are on the HuanEva dataset. 6.. Necessity of Basis Representation We first discuss the necessity of learning bases to represent 3D poses. We design a baseline which represents a 3D pose by the nearest neighbor. Intuitively, we treat all the l training poses as basis and represent a testing pose by the nearest training pose. Since the training poses should approxiately satisfy the lib length constraints, we reove those constraints. We also reove the sparsity ter because only one basis will be activated. The 3D pose which can iize the L -nor projection error is the final estiate. The ean square error on the HuanEva dataset for this ethod is about 7s while the result for our proposed ethod is about 4s. We think the ain reason for the degraded perforance is because the training poses differ fro the testing poses and the nearest neighbor ethod does not have the capability to represent the unseen poses. In contrast, the ethods which use bases to represent poses such as ours have ore flexibility. 6.. Influence of the Three Factors We evaluate the influence of the three factors in the proposed ethod: the robust L -nor easureent error, the anthropoorphic constraints and the sparsity regularization ter. We copare our approach with seven baselines. The first is sybolized as LS which uses the L -nor easureent error function and Sparsity ter on the basis coefficients. The second baseline is LA which uses the L -nor easureent average ean squared error L LS LA LAS L LS LA LAS Fig. 6. 3D pose estiation errors of the baselines and our ethod (LAS. The units for estiation errors are s. lib length error LAS LS LUA LLA RUA RLA LUL LLL RUL RLL Fig. 7. Average lib length error of the LAS and LS. error function and Anthropoorphic constraints. The reaining baselines are sybolized as L, LA, LAS, L and LS whose eanings can be siilarly understood by their naes. We solve the non-convex optiization probles in LA and LAS by the ADM ethod used to solve our ethod (LAS. To solve the optiization probles in the other baselines, we use CVX, a package for solving convex progras [53]. Fig. 6 shows the results on the HuanEva dataset. First, the four baselines without the sparsity ter (i.e., L, LA, L and LA achieve uch larger estiation errors than those with the sparsity ter (i.e., LS, LAS, LS and LAS. The significant gains deonstrate that the sparse bases encode iportant structural priors in 3D huan skeletons and can prevent overfitting to D poses given enough bases, the D projection error could always be decreased to zero but the resulting 3D pose ight still have large errors. Using sparse bases sees to help prevent it fro happening. Second, enforcing the eight lib length constraints further iproves the perforance, e.g., LAS outperfors LS and our approach outperfors LS. Fig. 7 shows that the lib lengths of the estiated poses are ore accurate by enforcing the anthropoorphic constraints. Third, using the L -nor reconstruction error outperfors L -nor, e.g., LAS is better than LAS and our approach is slightly better than LAS. However, the difference is sall because the poses in this experient are accurate which does not reveal the potential influences of the L -nor penalty function. It is interesting to see that L is worse than L. The reason is that ignoring the anthropoorphic constraints and the sparsity ter will result in iplausible 3D poses. In this case, using the robust ter will tolerate inconsistent atches between the 3D pose projections and the D poses (which are accurate in this experient.

9 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 9 Percentage LAS-noise- LAS-noise-7 LAS-noise-. LAS-noise-7 LS-noise- LS-noise Estiation Error Fig. 8. Results when different levels of noises are added to D poses. The x-axis is the estiation error and the y- axis is the percentage of cases where the estiation error is less than the corresponding x value Influence of Inaccurate D Poses We evaluate the robustness of our approach to inaccurate D pose estiations. We generate outlier D poses by adding seven levels (agnitudes of noises to the accurate D poses to siulate D pose estiation errors. In particular, for each D pose, we randoly select a body joint, generate a rando D spatial shift orientation, and add the corresponding transforation (of a certain agnitude to the selected joint. The agnitude of the i th ( i 7 level of noises is 8i pixels. We estiate the 3D poses fro the corrupted D joint locations and report the estiation accuracy. Fig. 8 deonstrates the results for LAS, LAS and LS on the HuanEva dataset. We do not report the results for the other five baselines because the estiation errors are very large copared to these three. We can see that LAS perfors uch better than the other two. For exaple, the estiation errors are saller than for 6% of data for LAS even when the largest (7th level of noises are added. In contrast, this nuber is decreased to only 4% for LAS. The results verify that L - nor is ore robust to inaccurate D poses than L -nor. LS perfors worst aong the three which also shows the iportance of anthropoorphic constraints especially when the D joint locations are inaccurate Influence of Huan-Caera Angles The degree of abiguities in 3D pose estiation depends on the relative angle between huan and caera. Generally speaking, abiguity is largest when people face the caeras and is the sallest when people turn sideways. We quantitatively evaluate the approach s disabiguation ability on various huan-caera angles. We synthesize ten virtual caeras of different panning angles and project the 3D poses to D using the virtual caeras. Then we estiate the 3D poses fro the D projections in each caera and copare the estiation results. More specifically, we first transfor the 3D poses into a local coordinate syste, where the x-axis is defined by the line passing the two hips, the y-axis is defined by the line of spine and the z-axis is the cross product of the x-axis and y-axis. Then we rotate the 3D poses around y-axis by a particular angle, ranging fro to 8, and project the to D by a weak perspective caera. Note that the y-axis is the axis where ost viewpoint variations happen in real world Pose Estiation Error Our approach Raakrishna ethod Huan Caera Angles Fig. 9. 3D pose estiation errors when the huancaera angle varies fro to 8 degrees. We copare with Raakrishna s ethod [3]. See Section Percentage yaw pitch roll Caera Rotation Angle Estiation Error Percentage GroundTruth Caera Estiate Caera Initialize With 3 Clusters D Pose Estiation Error Fig.. Left figure: Error distribution of the estiated Caera rotation angles. The units are degrees. Right figure: 3D pose estiation errors when caera paraeters are ( set by ground truth, ( estiated by initializing the 3D pose with ean pose, or (3 estiated by initializing the 3D pose with 3 cluster centers for parallel optiization (but only the best result is reported. The y-axis is the percentage of the cases whose estiation error is less than x. The units for x are s. See section iages. Hence we only report the perforance for the y-axis rotations for space liitations. Fig. 9 shows that the average estiation errors of the ethod proposed in [3] increase quickly as huan oves fro profile (9 degrees towards frontal pose ( degree. However, our approach is ore robust against viewpoint changes due to the structural prior iposed by the sparse bases and the strong lib length constraints Influence of Caera Paraeter Estiation Fig. (left shows the estiation errors of the caera rotation angles (i.e., yaw, pitch and roll. We can see that the errors are sall for ost cases. Fig. (right shows the 3D pose estiation errors using the estiated caeras and ground truth caeras, respectively. Note that when using ground truth caera paraeters, we do not update the in each iteration. We can see that caera estiation results can affect the 3D pose estiation to soe extent. The initialization of the 3D pose influences the 3D pose estiation accuracy ore accurate 3D pose initializations can iprove the final result. So we cluster the training poses into 3 finer clusters and initialize the 3D pose with each of the centers respectively. We optiize the 3D poses for the 3 initializations in parallel and keep the one which is closest to ground truth. The perforance can be further iproved using finer initializations as shown in Fig..

10 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY Generalization Capabilities of the Bases We conduct three types of experients to validate the generalization capabilities of our approach: ( cross-subject, ( cross-action and (3 cross-datasets experients. For the cross-subject experient (on the HuanEva dataset including all the six actions, we use the leave-one-subjectout criteria, i.e., training on the two subjects and testing on the reaining one. The average error is about 43.s. This is coparable with the previous experient setup (the result is 4 when training and testing on all subjects. Siilarly for the cross-action experient (on the HuanEva dataset, we use the leave-one-action-out criteria. In this experient, we use the sequences of all the six actions provided in the dataset in addition to the walking and jogging sequences. In particular, we train on the poses of the five actions and test on the reaining one. We repeat the above process for all the six configurations and report the average estiation error. In this experient, The average estiation error is about 48.4s which is slightly higher than training/testing on the sae actions. This slight perforance degradation is reasonable as the bases are learned fro different actions which ight have different sets of poses. For the cross-datasets experient, we train on the H3.6M dataset (including all actions and test on the HuanEva dataset. The average estiation error is about 57.3s. The estiation error is larger than the previous one and the reason ight be because the two datasets have slightly different annotations. For exaple, the hip joints ight correspond to slightly different parts of the huan body. Another reason ight be because the two datasets have different sets of actions. But overall, this is still a reasonable perforance. 6.3 Real Experients: Unknown D Poses We first estiate the D joint locations in the test iages by running a D pose detector [45]. Then we estiate the 3D poses fro the D joint locations. We copare our ethod with the state-of-the-arts in section We also observe that by projecting the estiated 3D poses and caera paraeters to D, we can actually iprove the D pose estiation results. This is stated in section Coparison to the State-of-the-arts We copare our approach with the state-of-the-art ones [4] [5] [5] on the HuanEva and the H3.6M datasets. Table shows the ean squared errors and the standard deviations. Note that the results are not directly coparable because of different experient setups. First, it is fair to copare the results of ours (using D detector [6] with the ethods [5] [4] as they use the sae D pose detectors. Second, our ethod using the state-of-the-art D pose detector outperfors our ethod using [6] which is ainly due to the iproveent fro the D joint location estiation. Method [38] achieves siilar perforance as our ethod by assug that the silhouettes are known. Table shows the results on the H3.6M dataset. We can see that our ethod achieve coparable perforance as the state-of-the-arts. TABLE Real experient on the HuanEva dataset: coparison with the state-of-the-art ethods [4] [5]. We present results for both walking and jogging actions of all three subjects and caera C. The nubers in each cell are the ean 3D joint errors and standard deviation, respectively. We use the unit of illieter as in [4] and [5]. The length of the right lower leg is about 38. See Section Walking S S S3 Average Ours (D[6] 54.3 ( ( ( Ours (D[45] 4.3 ( ( ( [5] 65. ( ( ( [4] 99.6 ( ( (4..76 [5] [38] 38. ( (3. 4. ( Jogging S S S3 Average Ours (D[6] 54.6 ( ( (. 44. Ours (D[45] 39.7 ( ( ( [5] 74. ( ( ( [4] 9. ( ( ( [38] 4. ( ( ( Evaluation on D Pose Estiation We observe that projecting the estiated 3D pose by the caera paraeters can iprove the original D pose estiations. The reason is that the sparse bases and the anthropoetric constraints could bias the estiated 3D pose to a correct configuration in spite of the errors in D joint locations. In experients, we project the estiated 3D poses to D and copare it with the original D pose estiation [6] and [3]. For [3], we project its estiated 3D pose to D iage. We report the results using two criteria. The first is the probability of correct pose (PCP an estiated body part is considered correct if its segent endpoints lie within 5% of the length of the ground-truth segent fro their annotated location as in [6]. The second criterion is the Euclidean distance between the estiated D pose and the groundtruth in pixels as in [4]. Table 3 shows the estiation accuracy on each of the eight body parts and the overall accuracy. We can see that our approach perfors the best on six body parts. In particular, we iprove over the original D pose estiators by about.3 (.74 vs..74 using the first PCP criteria. Our approach also perfors best using the second criterion. 6.4 Evaluation on the Videos We now evaluate the influence of integrating the teporal consistency into our odel. In particular, we quantitatively investigate the influence of the two factors (i.e., the unary ter and pairwise ter in the Video-Based Pose Estiation (VBPE ethod on the HuanEva dataset. We divide the long videos into short snippets with each snippet having five fraes. We also experient with other length choices but it does not ake uch difference unless it has fewer than three fraes or ore than ten fraes which will degrade the perforance. The balancing paraeter between the unary and pairwise ters is set by cross-validation. In particular, in our experient, this is set to be. on the HuanEva dataset.

11 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 TABLE Real experient on the H3.6M dataset: coparison with the state-of-the-art ethods. Directions Discussion Eating Greeting Phoning Photo Posing Purchases LinKDE [5] * Li et al. [54] Tekin et al. [55] Zhou et al. [56] SMPLify [3] Ours (D detector [45] Sitting SittingDown Soking Waiting WalkDog Walking WalkTogether Average LinKDE [5] * Li et al. [54] Tekin et al. [55] Zhou et al. [56] SMPLify [3] Ours (D detector [45] * The result is obtained on the testing dataset. TABLE 3 D pose estiation results. We report: ( the Probability of Correct Pose (PCP for the eight body parts and the whole pose, (3 and the Euclidean distance between the estiated D pose and the groundtruth in pixels. PCP LUA LLA RUA RLA LUL LLL RUL RLL Overall Pixel Diff. Yang et al. [6] Raakrishna et al. [3] Ours Fig. shows the advantages of VBPE (green line over the single Iage-Based-Pose-Estiation (IBPE, red line ethod. First, the VBPE outperfors the IBPE which verifies that estiating poses on videos by considering both the new unary ter and the pairwise ter can iprove the perforance. The average estiation error of VBPE is decreased by about 5% copared with IBPE. Secondly, relying on the pairwise ter alone defined on the teporal consistency (agenta line offers soe benefits over IBPE. It iproves the estiation results for iages having large estiation errors (between 4 and 6. Third, using only the unary ter defined on lib length (blue provides larger gains. This observation shows the iportance of the anthropoorphic constraints. It also suggests that the optiizer for IBPE could possibly get trapped in local optiu and return a 3D pose that does not well satisfy the anthropoorphic easureents. Percentage First Pose Unary Pairwise Cobined Estiation Error Fig.. The 3D pose estiation results on videos. The results using unary ter, pairwise ter and both of the two ters are reported. The red line shows the result on static iages. The error units are s. See section CONCLUSION We address the proble of estiating 3D huan poses fro a single iage or a video sequence. We first tackle the abiguity of lifting a D pose to 3D by proposing a sparse basis based representation of 3D poses and anthropoorphic constraints. Second, we use an L -nor easureent error which akes the approach robust to inaccurate D pose estiates. Third, the proble of local optiu is alleviated by generating several probable but diverse solutions and selecting the correct one using teporal consistency cues. APPENDIX A OPTIMIZATION We sketch the ajor steps of ADM for solving our pose estiation (Eq. (6 and caera paraeter estiation (Eq. (4 probles. The k and l are the nuber of iterations. A. 3D Pose Estiation Given the currently estiated caera paraeters M and the detected D pose x, we estiate the 3D pose by solving the following L iization proble using ADM: α K x M (Bα + µ + θ α + θ i= Bα + µ y i s.t. C i (Bα + µ = L i, i =,, ( We introduce two auxiliary variables β and γ and rewrite Eq. ( as: α,β,γ γ + θ β + θ K i= Bα + µ y i s.t. γ = x M (Bα + µ, α = β, C i (Bα + µ = L i, i =,,. (3

12 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 The augented Lagrangian function of Eq. (3 is: L (α, β, γ, λ, λ, η = γ + θ β + K θ i= Bα + µ y i + λ T [γ x + M(Bα + µ] + λ T (α β+ [ γ x + M(Bα + µ + α β ] η where λ and λ are the Lagrange ultipliers and η > is the penalty paraeter. ADM is to update the variables by iizing the augented Lagrangian function w.r.t. the variables α, β and γ alternately. A.. Update γ We discard the ters in L which are independent of γ and update γ by: γ k+ = arg γ + η [ ] k γ γ x M(Bα k + µ λk which has a closed for solution [57]. A.. Update β We drop the ters in L which are independent of β and update β by: β k+ = arg β + η ( k λ k β θ β + α k η k which also has a closed for solution [57]. η k where G is the Lagrange Multiplier and δ > is the penalty paraeter. We update Q and P alternately. Update Q: Q l+ = arg tr(ω i Q =, i =,, L (Q, P l, G l, δ l. (7 This is a constrained least square proble and has a closed for solution. Update P : We discard the ters in L which are independent of P and update P by: P l+ = arg P, rank(p P Q F (8 where Q = Q l+ + δ l G l. Note that P Q is equal F to P Q T + Q. Then (8 has a closed for solution F by the lea A.. Update G: We update the Lagrangian ultiplier G by: G l+ = G l + δ l (Q l+ P l+ (9 Update δ: We update the penaly paraeter by: δ l+ = (δ l ρ, δ ax, ( A..3 Update α We disiss the ters in L which are independent of α and update α by: α k+ = arg α z T W z s.t. z T Ω i z =, i =,, (4 is P = ax(ξ, ν ν T, where S is a syetric atrix and where z = [α T ] T ξ, and ν are the largest eigenvalue and eigenvector of S, B T M T MB + I + θ(k η k BB T respectively. W= [ ( T ] Proof: Since P is a syetric sei-positive definite γ k+ x + Mµ + λk η k MB β k+ + λk η k + D atrix and its rank is one, we can write P as: P = ξνν T, ( B and Ω i = T Ci T C ib B T Ci T C where ξ. Let the largest eigenvalue of S be ξ, then we iµ µ T Ci T C ib µ T Ci T C. have ν T Sν ξ, ν. Then we have: iµ L i D = θ K η i= (µ y ib. P S F = P F + S F tr(p T S Let Q = zz T. Then the objective function becoes z T W z = tr(w Q and Eq. (4 is transfored to: Q tr(w Q s.t. tr(ω i Q =, i =,,, Q, rank(q. (5 We still solve proble (5 by the alternating direction ethod [57]. We introduce an auxiliary variable P and rewrite the proble as: Q,P tr(w Q s.t. tr(ω i Q =, i =,,, P = Q, rank(p, P. Its augented Lagrangian function is: (6 where ρ and δ ax are constant paraeters. Lea A.: The solution to P P S F s.t. P, rank(p ( ξ + n i= ξ i ξξ = (ξ ξ + n i= ξ i n i= ξ i + (ξ, ( The iu value can be achieved when ξ = ax(ξ, and ν = ν. A..4 Update λ We update the Lagrangian ultiplier λ by: λ k+ = λ k + η k ( γ k+ x + M ( Bα k+ + µ (3 A..5 Update λ We update the Lagrangian ultiplier λ by: L (Q, P, G, δ = tr(w Q + tr(g T (Q P + δ Q P F λ k+ = λ k + η k ( α k+ β k+ (4

13 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 3 A..6 Update η We update the penalty paraeter η by: η k+ = (η k ρ, η ax, (5 where ρ and η ax are the constant paraeters. A. Caera Paraeter Estiation Given estiated D pose X and 3D pose Y, we estiate caera paraeters by solving the following optiization proble: ( X T Y, s.t. T =. (6, T We introduce an auxiliary variable R and rewrite Eq. (6 as: R,, R s.t. R = X ( T T Y, T =. (7 We still use ADM to solve proble (7. Its augented Lagrangian function is: L 3 (R,, (, H, [( ζ, τ ] = R + tr H T T T Y + R X + ζ( T [ ( + τ T Y + R X + ( ] T T where H and ζ are Lagrange ultipliers and τ > is the penalty paraeter. A.. Update R We discard the ters in L 3 which are independent of R and update R by: ( R k+ = arg R + τ ( k R R + k T ( k T Y X + Hk τ k which has a closed for solution [57]. A.. Update We discard the ters in L 3 which are independent of and update by: ( = arg T ( k T k+ F + ( T k + ζk τ k Y + R k+ X + Hk τ k This is a least square proble and has a closed for solution. A..3 Update We discard the ters in L 3 which are independent of and update by: ( ( k+ = arg T ( ( k+ + k+ T Y + R k+ X + Hk τ k T + ζk τ k This is a least square proble and has a closed for solution. F F A..4 Update H We update Lagrange ultiplier H by: (( ( H k+ = H k + τ k k+ T ( k+ T Y + R k+ X (8 F A..5 Update ζ We update the Lagrange ultiplier ζ by: ζ k+ = ζ k + τ k ( k+ T k+ (9 REFERENCES [] L. W. Capbell and A. F. Bobick, Recognition of huan body otion using phase space constraints, in ICCV, 995, pp [] Y. Yacoob and M. J. Black, Paraeterized odeling and recognition of activities, in ICCV, 998, pp. 7. [3] C. Wang, Y. Wang, and A. L. Yuille, An approach to pose-based action recognition, in CVPR, 3, pp [4] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet enseble for action recognition with depth caeras, in CVPR,, pp [5] I. Laptev, On space-tie interest points, IJCV, vol. 64, no. -3, pp. 7 3, 5. [6] Y. Yang and D. Raanan, Articulated pose estiation with flexible ixtures-of-parts, in CVPR,, pp [7] V. Ferrari, M. Marin-Jienez, and A. Zisseran, Progressive search space reduction for huan pose estiation, in CVPR, 8, pp. 8. [8] S. Ioffe and D. Forsyth, Huan tracking with ixtures of trees, in ICCV, vol.,, pp [9] C. Ionescu, J. Carreira, and C. Schisescu, Iterated second-order label sensitive pooling for 3d huan pose estiation, in CVPR, 4, pp [] D. Raanan, D. A. Forsyth, and A. Zisseran, Strike a pose: Tracking people by finding stylized poses, in CVPR, vol., 5, pp [] B. Sapp, D. Weiss, and B. Taskar, Parsing huan otion with stretchable odels, in CVPR,, pp [] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, Huan pose estiation using body parts dependent joint regressors, in CVPR, 3, pp [3] V. Raakrishna, T. Kanade, and Y. Sheikh, Reconstructing 3d huan pose fro d iage landarks, in ECCV,, pp [4] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao, Robust estiation of 3d huan poses fro single iages, in CVPR, 4. [5] C. J. Taylor, Reconstruction of articulated objects fro point correspondences in a single uncalibrated iage, in CVPR, vol.,, pp [6] H.-J. Lee and Z. Chen, Deteration of 3d huan body postures fro a single view, CVGIP, vol. 3, no., pp , 985. [7] P. J. Huber, Robust statistics. Springer,. [8] P. Yadollahpour, D. Batra, and G. Shakhnarovich, Diverse -best solutions in rfs, in Workshop on Discrete Optiization in Machine Learning, NIPS,. [9] G. Pons-Moll and B. Rosenhahn, Model-based pose estiation, in Visual analysis of huans. Springer,, pp [] M. W. Lee and I. Cohen, Proposal aps driven cc for estiating huan body pose in static iages, in CVPR, vol., 4, pp. II 334. [] J. M. Rehg, D. D. Morris, and T. Kanade, Abiguities in visual tracking of articulated objects using two-and three-diensional odels, The International Journal of Robotics Research, vol., no. 6, pp , 3. [] C. Schisescu and B. Triggs, Kineatic jup processes for onocular 3d huan tracking, in CVPR, vol.. IEEE, 3, pp. I I. [3], Building roadaps of ia and transitions in visual odels, IJCV, vol. 6, no., pp. 8, 5. [4] E. Sio-Serra, A. Raisa, G. Alenyà, C. Torras, and F. Moreno-Noguer, Single Iage 3D Huan Pose Estiation fro Noisy Observations, in CVPR,. [5] E. Sio-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer, A Joint Model for D and 3D Pose Estiation fro a Single Iage, in CVPR, 3. [6] J. Valadre and S. Lucey, Deteristic 3d huan pose estiation using rigid structure, in ECCV,, pp [7] X. K. Wei and J. Chai, Modeling 3d huan poses fro uncalibrated onocular iages, in ICCV, 9, pp

JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 4 [8] C. Barron and I. A. Kakadiaris, Estiating anthropoetry and pose fro a single iage, in CVPR, vol.,, pp. 669 676. [9] G. Pons-Moll, D. J. Fleet, and B.

446 455. [3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Roero, and M. J. Black, Keep it spl: Autoatic estiation of 3d huan pose and shape fro a single iage, in ECCV. Springer, 6, pp. 56 578.

Lee, Inferring 3d body pose fro silhouettes using activity anifold learning, in CVPR, vol., 4, pp. II 68. [34] G. Shakhnarovich, P. Viola, and T.

[36] A.-I. Popa, M. Zanfir, and C. Schisescu, Deep ultitask architecture for integrated d and 3d huan sensing, arxiv preprint arxiv:7.8985, 7. [37] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J.

14 JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 4 [8] C. Barron and I. A. Kakadiaris, Estiating anthropoetry and pose fro a single iage, in CVPR, vol.,, pp [9] G. Pons-Moll, D. J. Fleet, and B. Rosenhahn, Posebits for onocular huan pose estiation, in CVPR, 4, pp [3] I. Akhter and M. J. Black, Pose-conditioned joint angle liits for 3d huan pose reconstruction, in CVPR, 5, pp [3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Roero, and M. J. Black, Keep it spl: Autoatic estiation of 3d huan pose and shape fro a single iage, in ECCV. Springer, 6, pp [3] G. Mori and J. Malik, Recovering 3d huan body configurations using shape contexts, PAMI, vol. 8, no. 7, pp. 5 6, 6. [33] A. Elgaal and C.-S. Lee, Inferring 3d body pose fro silhouettes using activity anifold learning, in CVPR, vol., 4, pp. II 68. [34] G. Shakhnarovich, P. Viola, and T. Darrell, Fast pose estiation with paraeter-sensitive hashing, in ICCV, 3, pp [35] A. Agarwal and B. Triggs, Recovering 3d huan pose fro onocular iages, PAMI, vol. 8, no., pp , 6. [36] A.-I. Popa, M. Zanfir, and C. Schisescu, Deep ultitask architecture for integrated d and 3d huan sensing, arxiv preprint arxiv:7.8985, 7. [37] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall, A dual-source approach for 3d pose estiation fro a single iage, in CVPR, 6, pp [38] L. Bo and C. Schisescu, Twin gaussian processes for structured prediction, IJCV, vol. 87, no. -, pp. 8 5,. [39] S. Li and A. B. Chan, 3d huan pose estiation fro onocular iages with deep convolutional neural network, in ACCV. Springer, 4, pp [4] T. Meltzer, C. Yanover, and Y. Weiss, Globally optial solutions for energy iization in stereo vision using reweighted belief propagation, in ICCV, vol., 5, pp [4] C. Schisescu and A. Jepson, Variational ixture soothing for nonlinear dynaical systes, in CVPR, vol.. IEEE, 4, pp. II II. [4] D. Park and D. Raanan, N-best axial decoders for part odels, in ICCV. IEEE,, pp [43] D. Batra, P. Yadollahpour, A. Guzan-Rivera, and G. Shakhnarovich, Diverse -best solutions in arkov rando fields, in ECCV,, pp. 6. [44] V. Kazei and J. Sullivan, Using richer odels for articulated pose estiation of footballers. in BMVC,, pp.. [45] A. Newell, K. Yang, and J. Deng, Stacked hourglass networks for huan pose estiation, in ECCV. Springer, 6, pp [46] A. Safonova, J. K. Hodgins, and N. S. Pollard, Synthesizing physically realistic huan otion in low-diensional, behavior-specific spaces, TOG, vol. 3, no. 3, pp. 54 5, 4. [47] Z. Lin, R. Liu, and Z. Su, Linearized alternating direction ethod with adaptive penalty for low-rank representation, in NIPS,, pp [48] V. Ferrari, M. Marín-Jiénez, and A. Zisseran, d huan pose estiation in tv shows, in Statistical and Geoetrical Approaches to Visual Motion Analysis, 9, pp [49] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, in ICML. ACM, 9, pp [5] B. Daubney and X. Xie, Tracking 3D huan pose with large root node uncertainty, in CVPR,, pp [5] L. Sigal and M. J. Black, Huaneva: Synchronized video and otion capture dataset for evaluation of articulated huan otion, Brown Univertsity TR, vol., 6. [5] C. Ionescu, D. Papava, V. Olaru, and C. Schisescu, Huan3. 6: Large scale datasets and predictive ethods for 3d huan sensing in natural environents, TPAMI, vol. 36, no. 7, pp , 4. [53] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex prograg, version. beta, Sep. 3. [54] S. Li, W. Zhang, and A. B. Chan, Maxiu-argin structured learning with deep networks for 3d huan pose estiation, in ICCV, 5, pp [55] B. Tekin, X. Sun, X. Wang, V. Lepetit, and P. Fua, Predicting peoples 3d poses fro short sequences, arxiv preprint arxiv: 54.8, 5. [56] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, Sparseness eets deepness: 3d huan pose estiation fro onocular video, in CVPR, 6, pp [57] R. Liu, Z. Lin, and Z. Su, Linearized alternating direction ethod with parallel splitting and adaptive penalty for separable convex progras in achine learning. ACML, 3. Chunyu Wang is an associate researcher in Microsoft Research Asia. He received his Ph.D in coputer science fro Peking University in 6. His research interests are in coputer vision, artificial intelligence and achine learning. Yizhou Wang is a Professor of the Coputer Science Departent at Peking University, China. He received his Ph.D. in coputer science fro University of California at Los Angeles (UCLA in 5. He was a Research Staff of the Palo Alto Research Center (Xerox-PARC fro 5 to 8. His research interests include coputer vision, statistical odeling and learning. Zhouchen Lin received the Ph.D. degree in applied atheatics fro Peking University in. He is a Professor with the Coputer Science Departent, Peking University. Before, he was a Lead Researcher with the Visual Coputing Group, Microsoft Research Asia. His research interests include coputer vision, achine learning, pattern recognition, and nuerical coputation and optiization. He is an Associate Editor of Neurocoputing. Alan L. Yuille received the B.A. degree in atheatics and the Ph.D. degree in theoretical physics studying under Stephen Hawking fro the University of Cabridge, in 976 and 98, respectively. He joined the Artificial Intelligence Laboratory, MIT, fro 98 to 986, and followed this with a faculty position with the Division of Applied Sciences, Harvard, fro 986 to 995. Fro 995 to, he was a Senior Scientist with the Sith-Kettlewell Eye Research Institute, San Francisco. In, he accepted a position as a Full Professor with the Departent of Statistics, University of California, Los Angeles. He has over two hundred peer-reviewed publications in vision, neural networks, and physics, and has co-authored two books: Data Fusion for Sensory Inforation Processing Systes (with J. J. Clark and Two- and Three-Diensional Patterns of the Face (with P. W. Hallinan, G. G. Gordon, P. J. Giblin, and D. B. Muford. He received several acadeic prizes.

Robust Estimation of 3D Human Poses from a Single Image

Robust Estimation of 3D Human Poses from a Single Image arxiv:46.8v [cs.cv] 9 Jun 4 CBMM Meo No. 3 June, 4 Robust Estiation of 3D Huan Poses fro a Single Iage by Chunyu Wang,Yizhou Wang,Zhouchen Lin,Alan L. Yuille,Wen Gao Peking University, Beijing, China University