Ordinal Depth Supervision for 3D Human Pose Estimation

Size: px
Start display at page:

Download "Ordinal Depth Supervision for 3D Human Pose Estimation"

Transcription

1 Ordinal Depth Supervision for 3D Human Pose Estimation Georgios Pavlakos 1, Xiaowei Zhou 2, Kostas Daniilidis 1 1 University of Pennsylvania 2 State Key Lab of CAD&CG, Zhejiang University Abstract Our ability to train end-to-end systems for 3D human pose estimation from single images is currently constrained by the limited availability of 3D annotations for natural images. Most datasets are captured using Motion Capture (MoCap) systems in a studio setting and it is difficult to reach the variability of 2D human pose datasets, like MPII or LSP. To alleviate the need for accurate 3D ground truth, we propose to use a weaker supervision signal provided by the ordinal depths of human joints. This information can be acquired by human annotators for a wide range of images and poses. We showcase the effectiveness and flexibility of training Convolutional Networks (ConvNets) with these ordinal relations in different settings, always achieving competitive performance with ConvNets trained with accurate 3D joint coordinates. Additionally, to demonstrate the potential of the approach, we augment the popular LSP and MPII datasets with ordinal depth annotations. This extension allows us to present quantitative and qualitative evaluation in non-studio conditions. Simultaneously, these ordinal annotations can be easily incorporated in the training procedure of typical ConvNets for 3D human pose. Through this inclusion we achieve new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose. 1. Introduction Human pose estimation has been one of the most remarkable successes for deep learning approaches. Leveraging large-scale datasets with extensive 2D annotations has immensely benefited 2D pose estimation [55, 34, 31, 61], semantic part labeling [11, 57] and multi-person pose estimation [15, 30, 9]. In contrast, the complexity of collecting images with corresponding 3D ground truth has constrained 3D human pose datasets in small scale [19] or strictly in studio settings [42, 16]. The goal of this paper is to demonstrate that in the absence of accurate 3D ground truth, endto-end learning can be competitive by using weaker supervision in the form of ordinal depth of the joints (Figure 1). Aiming to boost end-to-end discriminative approaches, Training Testing Ordinal depth relations Z(right ankle) < Z(right hip) Z(left knee) > Z(right knee) Z(right elbow) > Z(right wrist) Z(left shoulder) < Z(right shoulder) Z(right knee) < Z(left hip) Z(left wrist) = Z(left elbow) Z(head) > Z(right ankle) Z(right hip) = Z(left hip) Z(right ankle) < Z(neck) Input image ConvNet 2D keypoints 3D human pose Figure 1: Summary of our approach. In the absence of accurate 3D ground truth we propose the use of ordinal depth relations (closer-farther) of the human body joints for endto-end training of 3D human pose estimation systems. different techniques attempt to augment the training data. Synthetic examples can be produced in abundance [13, 53], but there is no guarantee that they come from the same distribution as natural images. Multi-view systems for accurate capture of 3D ground truth can work outdoors [26], but they need to be synchronized and calibrated, so data collection is not practical and hard to scale. These limitations have favored reconstruction approaches, e.g., [5, 25], which employ reliable 2D pose detectors and recover 3D pose in a subsequent step using the 2D joint estimates. Unfortunately, even in the presence of perfect 2D correspondences, the final 3D reconstruction can be erroneous. This 2D-to-3D reconstruction ambiguity is mainly attributed to the binary ordinal depth relations of the joints (closer-farther) [45]. Leveraging image-based evidences, such as occlusion and shading, can largely resolve the ambiguity, yet this information is discarded by reconstruction approaches. Motivated by the particular power of ordinal depth relations at resolving reconstruction ambiguities and the fact that this information can be acquired by human annotators, we propose to use ordinal depth relations to train ConvNets 17307

2 for 3D human pose estimation. Since humans can easily perceive pose [24] and they are better at estimating ordinal depth than explicit metric depth [50], annotators can provide pairwise ordinal depth relations for a wide range of imaging conditions, activities, and viewpoints. We develop on the idea of ordinal relations demonstrating their flexibility and effectiveness in a variety of settings: 1) we use them to predict directly the depths of joints, 2) we combine them with 2D keypoint annotations to predict 3D poses, 3) we demonstrate how they can be incorporated within a volumetric representation of 3D pose [32]. In every case, the weak supervision signal provided by these ordinal relations leads to a competitive performance compared to fully supervised approaches that employ the actual 3D ground truth. Additionally, to motivate the use of ordinal depth relations for human pose, we provide ordinal depth annotations for two popular 2D human pose datasets, LSP [18] and MPII [2]. This extension allows us to provide quantitative and qualitative evaluation of our approach in non-studio settings. Simultaneously, these ordinal annotations for inthe-wild images can be easily incorporated in the training procedure of typical ConvNets for 3D human pose leading to new state-of-the-art results for the standard benchmarks of Human3.6M and HumanEva-I. These performance benefits underline the effectiveness of ordinal depth supervision for human pose problems and provide motivation for further exploration using the available annotations. Our contributions can be summarized as follows: We propose the use of ordinal depth relations of human joints for 3D human pose estimation to bypass the need for accurate 3D ground truth. We showcase the flexibility of the ordinal relations by incorporating them in different network settings, where we always achieve competitive results to training with the actual 3D ground truth. We augment two popular 2D pose datasets (LSP and MPII) with ordinal depth annotations and demonstrate the applicability of the proposed approach to 3D pose estimation in non-studio conditions. We include our ordinal annotations in the training procedure of typical ConvNets for 3D human pose and exemplify their effectiveness by achieving new stateof-the-art results on the standard benchmarks. 2. Related work Since the literature on 3D human pose estimation is vast, here we discuss works closely related to our approach and refer the interested reader to Sarafianos et al. [41] for a recent survey on this topic. Reconstruction approaches: A long line of approaches follows the reconstruction paradigm by employing 2D pose detectors to localize 2D human joints and using these locations to estimate plausible 3D poses [10, 17]. Zhou et al. [68, 69] use 2D heatmaps from a 2D pose ConvNet to reconstruct 3D pose in a video sequence. Bogo et al. [5] fit a statistical model of 3D human shape to the predicted 2D joints. Alternatively, a network can also handle the step of lifting 2D estimates to 3D poses [56, 28, 51]. Notably, Martinez et al. [25] achieve state-of-the-art results with a simple multilayer perceptron that regresses 3D joint locations, given 2D keypoints as input. Despite the success of this paradigm, it comes with important drawbacks. No imagebased evidence is used during the reconstruction step, the result is too reliant on an imperfect 2D pose detector and even for perfect 2D correspondences, the 3D estimate might fail because of the reconstruction ambiguity. In contrast, by using ordinal depth relations we can leverage rich imagebased information during estimation, without relinquishing the accuracy of reconstruction approaches, which can also be integrated in our framework (Section 3.4). Discriminative approaches: Discriminative approaches are orthogonal to the reconstruction paradigm since they estimate the 3D pose directly from the image. Prior work uses ConvNets to regress the coordinates of the 3D joints [22, 47, 48, 51, 44, 26], to regress 3D heatmaps [32], or to classify each image in the appropriate pose class [39, 40]. The main critique of these end-to-end approaches is that images with corresponding 3D ground truth are required for training. Our work attempts to relax this important constraint, by training with weak 3D information in the form of ordinal depth relations for the joints and 2D keypoints. Weak supervision was also used in recent work [64] by constraining the lengths of the predicted limbs. However, we argue that our supervision does not simply constraint the output of the network, but also provides novel information for inthe-wild images and further enhances training. Generating training examples: The limited availability of 3D ground truth for training 3D human pose ConvNets has also been addressed in various ways in recent works. The most straightforward solution is to use graphics to augment the training data [13, 53, 26]. Differently, Rogez and Schmid [39] propose a collage approach by composing human parts from different images to produce combinations with known 3D pose. In both cases though, most examples do not reach the detail and variety level that in-the-wild images have. Mehta et al. [26] record multiple views outdoors and estimate accurate 3D ground truth for every view. However, multi-view systems need to be synchronized and calibrated, so large-scale data collection is not trivial. 3D annotations: Prior works have also relied on humans to perceive and annotate 3D properties that are lost through the projection of a 3D scene on a 2D image. Bell et al. [3] and Chen et al. [12] annotate the ordinal relations for the apparent depth of pixels in the image. In the work of Xi- 7308

3 ang et al. [59, 58], humans align 3D CAD models with single images to provide viewpoint information. Concerning 3D human pose annotations, the famous poselets work from Bourdev and Malik [6] uses an interactive tool for annotators to adjust the 3D pose, making the procedure laborious. Maji et al. [23] provide 3D annotations for human pose, but only in the form of yaw angles for head and torso. The idea of ordinal depth relations is also explored by Pons-Moll et al. [35] where attributes regarding the relative 3D position of the body parts are included in their posebits database. Different to them, we provide annotations by humans for a much larger set of images (i.e., more than 15k images with our annotations compared to 1k for the posebits dataset), and instead of exploring an extensive set of pose attributes, we propose a cleaner training scheme that requires only 2D keypoint locations and ordinal depth relations. In recent work, Lassner et al. [21] estimate proposals of 3D human shape fits for single images which are accepted or rejected by annotators. Despite the rich ground truth in case of a good fit, many automatic proposals are of low quality, leading to many discards. Our work aims for a more balanced solution where 3D annotations have a weaker form, but the task is easy for humans, so that they can provide annotations on a large scale for practically any available image. Ordinal relations: There is a long history for learning from ordinal relations, outside the field of computer vision, with particular interest in the area of information retrieval, where many algorithms for learning-to-rank have been developed [7, 8, 46]. In the context of computer vision, previous works have used relations to learn apparent depth [70, 12] or reflectance [29, 63] of a scene. We share a common motivation with these approaches in the sense that ordinal relations are easier for humans to annotate, compared to metric depth or absolute reflectance values. 3. Technical approach In this section we present our proposed approach for different settings of 3D human pose estimation. First, in Section 3.1 we predict only the depths of the human joints, relying on ordinal depth relations and a ranking loss for training. Then, in Section 3.2 we combine the ordinal relations with 2D keypoint annotations to predict the 3D pose coordinates. In Section 3.3 we explore the incorporation of ordinal relations within a volumetric representation for 3D human pose [32]. Finally, Section 3.4 presents the extension of the previous networks with a component designed to encode a geometric 3D pose prior Depth prediction Our initial goal is to establish the training procedure such that we can leverage ordinal depth relations to learn to predict the depths of human joints. This is the simplest case, where instead of explicitly predicting the 3D pose, we only predict depth values for the joints. Let us represent the human body withn joints. For each jointiwe want to predict its depthz i. The provided data are in the form of pairwise ordinal depth relations. For a pair of joints(i,j), we denote the ordinal depth relation asr (i,j) taking the value: +1, if joint i is closer thanj, 1, if joint j is closer thani, 0, if their depths are roughly the same. The ConvNet we use for this task takes the image as input and predicts N depth values z i, one for each joint. Given ther (i,j) relation and assuming that the ConvNet is producing the depth estimates z i and z j for the two corresponding joints, the loss for this pair is: L i,j = log(1+exp(z i z j )), r (i,j) = +1 log(1+exp( z i +z j )), r (i,j) = 1 (z i z j ) 2, r (i,j) = 0. This is a differentiable ranking loss expression, which has similarities with early works on the learning-to-rank literature [7] and was also adopted by [12] for apparent depth estimation. Intuitively, it enforces a large margin between the values z i and z j if one of them has been annotated as closer than the other, otherwise it enforces them to be equal. Denoting withi the set of pairs of joints that have been annotated with an ordinal relation, the complete expression for the loss takes the form: L rank = L i,j. (2) (i,j) I An interesting property of this loss is that we do not require the relations for all pairs of joints to be available during training. The loss can be computed based only on the subset of pairs that have been annotated. Additionally, the relations do not have to be consistent, i.e., no strict global ordering is required. Instead, the ConvNet is allowed to learn a consensus from the provided relationships by minimizing the incurred loss. This is a helpful property in case there are ambiguities in the annotations Coordinate prediction for 3D pose Our initial ConvNet only predicts the depths of the human joints. To enable full 3D pose reconstruction, we additionally need to precisely localize the corresponding joints on the image. Given the ConvNet used in the previous section, the most natural extension is to enrich its output by predicting the 2D coordinates of the joints as well. Thus, we predict 2N additional values which correspond to the pixel coordinatesw = (x,y) of each joint. We consider this combination of 2D keypoints with ordinal depth as a form (1) 7309

4 marginalized 2D heatmap volumetric prediction Apart from direct regression of the 3D pose coordinates, recent work has investigated the use of a volumetric representation for 3D human pose [32]. In this case, the space around the subject is discretized, and the ConvNet predicts per-voxel likelihoods for every joint in the 3D space. The training target for the volumetric space is a 3D Gaussian centered at the 3D location of each joint. However, without explicit 3D ground truth, supervising the same volume is not trivial. To demonstrate the general applicability of ordinal relations, we adapt this representation, to make it compatible with ordinal depth supervision as well. To bypass the seemingly complex issue, we propose to preserve the volumetric structure of the output, but decompose the supervision a) in the 2D image plane and b) the z dimension (depth), as presented in Figure 2. Precisely, for every joint n, the ConvNet predicts score maps Ψ n, which can be transformed to a probability distribution, by applying a softmax operationσ. So, the jointnis located in position u = (x,y,z) with probability p(u n) = σ[ψ n ] u. The marginalized probability distribution in the 2D plane is: p(x,y n) = z p(u n), (4) marginalized depth and can be computed efficiently as a sum-pooling operation across all the slices of the volume. This operation is equivalent to adopting a weak perspective camera model. Similarly, the marginalized probability distribution for the depth dimension is: p(z n) = x,y p(u n), (5) Figure 2: Visualization of the volumetric output for an individual joint. The predictions are volumetric, but in the absence of accurate 3D ground truth, the supervision is applied independently on the 2D image plane and the depth dimension. The marginalized likelihoods are computed by means of sum-pooling operations. of weak 3D information and we refer to the corresponding ConvNet as the weakly supervised version. Let us denote with w n the ground truth 2D location for jointn, and withŵ n the corresponding ConvNet prediction. Assuming the availability of 2D keypoint annotations, the familiarl 2 regression loss can be applied: N L keyp = w n ŵ n 2 2. (3) n=1 By combining the ranking loss for the valuesz n and the regression loss for the keypoint coordinates w n, we can train the ConvNet end-to-end: L = L rank +λl keyp, where the valueλ = 100 is used for our experiments Volumetric prediction for 3D pose and can again be computed as a sum-pooling operation across all the pixels of a slice. This decomposition has the advantage that even if we do not have complete 3D ground truth, we can still supervise the ConvNet. The 2D image plane (values of equation 4) and the depth dimension (values of equation 5) are supervised independently, but they are connected by the underlying volumetric representation which enforces the 3D consistency. Our loss function takes the form: L = L rank + λl heat. The loss for the z-dimension, L rank, is the same ranking loss as before (equation 2), where we recover depth for each joint by taking the mean value of the estimated soft distribution: z n = zzp(z n). For the x-y dimensions, the target for each keypoint is a heatmap with a Gaussian centered around its ground truth location andl heat is anl 2 loss between the predicted and the ground truth heatmaps [52, 33]. We stress here that the alterations presented up to this point refer only to the supervision type, without interfering with the network architecture. This allows most of the stateof-the-art discriminative ConvNets [64, 44, 48, 32] to be used as-is, and be complemented with the proposed ordinal depth supervision when 3D ground truth is not available Integration with a reconstruction component The strength of the aforementioned networks is that they leverage image-based information to resolve the singleview depth ambiguities and produce depth estimatesz n that respect the ordinal depths of the human joints. However, the predicted depth values do not typically match the exact metric depths of the joints, since no full 3D pose example has been used to train the networks. This motivates us to enhance the architecture with our proposed reconstruction component, which takes as input the estimated 2D keypoints w n and the ordinal depth estimates z n, for all joints n, and reconstructs the 3D pose, S R n 3. This inputoutput relation is presented in Figure 3a. Conveniently, for 7310

5 x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3... x N y N z N input image pixel positions ordinal depths 3N (a) The reconstruction component. ConvNet 2D keypoints & depth estimates x2 3N 3N Reconstruction Component (b) Integration of the reconstruction component. S 1 S 2. S N 3D joint positions 3D pose Figure 3: (a) The reconstruction component is a multi-layer perceptron with two bilinear units [25]. The input is the concatenation of the pixel locations of the joints (x i,y i ), and the ordinal depths z i, while the output is the 3D pose coordinatess i. (b) Integration of the reconstruction module in the full framework. The ConvNet of Section 3.2 or 3.3 estimates 2D keypoint locations and depths which are used by the reconstruction module to predict a coherent 3D pose. the training of this component we require only MoCap data, which are available in abundance. During training, we simply project each 3D pose skeleton to the 2D image plane. To simulate the input, we use the projected 2D joint locations and a noisy version of the depths of the joints, such that the majority of their ordinal relations are preserved, while their values might not necessarily match the actual depth. Denoting withŝi the output 3D joints of the ConvNet and withs i the joints of the 3D pose that was used to generate the input, our supervision is anl 2 loss: N L 3D = S n Ŝn 2 2. (6) n=1 This module can be easily incorporated in an end-to-end framework by using as input the output of the ConvNet from Section 3.2 or Section 3.3. This is presented schematically in Figure 3b. The benefit from employing such a reconstruction module is demonstrated empirically in Section Empirical evaluation This section concerns the empirical evaluation of the proposed approach. First, we present the benchmarks that we employed for quantitative and qualitative evaluation. Then, we provide some essential implementation details of the approach. Finally, quantitative and qualitative results are presented on the selected datasets Datasets We employed two standard indoor benchmarks, Human3.6M [16] and HumanEva-I [42], along with a recent dataset captured in indoor and outdoor conditions, MPI- INF-3DHP [26, 27]. Additionally, we extended two popular 2D human pose datasets, Leeds Sports Pose dataset (LSP) [18] and MPII human pose dataset (MPII) [2] with ordinal depth annotations for the human joints. Human3.6M: It is a large-scale dataset captured in an indoor environment that contains multiple subjects performing typical actions like Eating and Walking. Following the most popular protocol (e.g., [68]), we train using subjects S1,S5,S6,S7, and S8 and test on subjects S9 and S11. The original videos are downsampled from 50fps to 10fps to remove redundancy. A single model is trained for all actions. Results are reported using the mean per joint error and the reconstruction error, which allows a Procrustes alignment of the prediction with the ground truth. HumanEva-I: It is a smaller scale dataset compared to Human3.6M, including fewer users and actions. We follow the typical protocol (e.g., [4]), where the training sequences of subjects S1, S2 and S3 are used for training and the validation sequences of the same subjects are used for testing. We train a single model for all actions and users, and we report results using the reconstruction error. MPI-INF-3DHP: It is a recent dataset that includes both indoor and outdoor scenes. We use it exclusively for evaluation, without employing the training data, to demonstrate robustness of the trained model under significant domain shift. Following the typical protocol ([26, 64]), results are reported using the PCK3D and the AUC metric. LSP + MPII Ordinal: Leeds Sports Pose [18] and MPII human pose [2] are two of the most widely used benchmarks for 2D human pose. Here we extend both of them, offering ordinal depth annotations for the human joints. For LSP we annotate all the 2k images, while for MPII we annotate the subset of 13k images used by Lassner et al. [21]. Annotators were presented with a pair of joints for each image and answered which joint was closer to the camera. The option ambiguous/hard to tell was also offered. We considered 14 joints, excluding thorax and spine joints of MPII, which are often not used for training (e.g., [55]). The questions for each image were continued until a global ordering could be inferred for all the joints. By enforcing a global ordering we conveniently do not encounter any contradicting annotations. More importantly though, this approach significantly decreased annotation time. If the relative questions had to be answered for all joints, then we would require ( 14 2) = 91 questions for each image. In contrast, with the procedure we followed, we could get a global ordering with roughly 17 questions per image in the mean case. This resulted in 5 times faster annotation time. Additionally, we observed that annotators were much more ef- 7311

6 Architecture Supervision Avg error depth ordinal supervision prediction direct regression coordinate weakly supervised regression fully supervised [32] one weakly supervised volume hourglass fully supervised [32] regression two weakly supervised hourglasses fully supervised [32] Table 1: Effect of training with the actual 3D ground truth, versus employing weaker ordinal depth supervision on Human3.6M. The results are mean per joint errors (mm). ficient when they were asked continuously about a specific pair of joints, instead of changing the pair of focus. As a result, we created groups of 50 images containing questions about the same pair of joints. This way we could get annotations at a rate of 3.5 secs per question, meaning that in total the procedure required roughly 1 minute per image. We clarify that our goal for this dataset is to provide a novel information source (ordinal depth) for in-the-wild images. We do not use it for evaluation, since it is not a mm level accuracy benchmark like Human3.6M or HumanEva- I. Furthermore, the goal is not to conduct a computational study concerning the level of accuracy that humans perceive 3D poses as this has been already examined in the past [24]. In contrast, we use these annotations to demonstrate that: a) they can boost performance of 3D human pose estimation for standard benchmarks, and b) they assist our ConvNets to proper generalize and make them applicable in non-studio conditions, or in cases with significant domain shift Implementation details For the ConvNets that predict 2D keypoints and/or depths, we follow the hourglass design [31]. When the output is in coordinate form (Sections 3.1 and 3.2), we use one hourglass with a fully connected layer in the end, while when we have volumetric target (Section 3.3), we use two hourglasses (unless stated otherwise). For comparisons with the state-of-the-art, we follow a mixed training strategy combining images with 3D ground truth from the respective dataset (Human3.6M or HumanEva-I), with LSP+MPII Ordinal images. For the LSP+MPII Ordinal examples, the loss is computed based on the human annotations (weak supervision), while for the respective dataset examples, the loss is computed based on the known ground truth (full supervision). We train the network with a batch size of 4, learning rate set to 2.5e-4, and using rmsprop for the optimization. Augmentation for rotation (±30 ), scale ( ) and flipping (left-right) is also used. The duration of the Avg error Human3.6M 71.9 Human3.6M + 2D keyp 66.6 Human3.6M + 2D keyp + Ord 62.1 Human3.6M + 2D keyp + Rec 59.1 Human3.6M + 2D keyp + Ord + Rec 56.2 Table 2: Ablative study on Human3.6M demonstrating the effect of incorporating additional data sources in the training procedure (2D keypoints and ordinal depth relations), as well as integrating a rconstruction component. The numbers are mean per joint errors (mm). PCK3D AUC Human3.6M Human3.6M + 2D keyp Human3.6M + 2D keyp + Ord Table 3: Ablative study on MPI-INF-3DHP demonstrating that supervision through our ordinal annotations is important for proper generalization. training depends on the size of the dataset (300k iterations for Human3.6M data only, 2.5M iterations for mixed Human3.6M and LSP+MPII Ordinal data, 1.5M iterations for mixed HumanEva-I and LSP+MPII Ordinal data). For the reconstruction component (Section 3.4), we follow the design of [25]. We train the network with a batch size of 64, learning rate set to 2.5e-4, we use rmsprop for the optimization, and the training lasts for 200k iterations Ablative studies Ordinal supervision: First, we examine the effect of using ordinal depth supervision versus employing the actual 3D groudtruth for training. For this part, we focus on Human3.6M which is a large scale benchmark and provides 3D ground truth to perform the quantitative comparison. To define the ordinal depth relations, the depth values for each pair of joints are considered. If they differ less than 100mm, then the corresponding relation is set to r = 0 (similar depth). Otherwise, it is set to r = ±1, depending on which joint is closer. Since for this comparison we want to focus on the form of supervision, this is the only set of experiments that uses ordinal depth relations inferred from 3D ground truth. For the remaining evaluations, all ordinal depth relations were provided by human annotators. Following the analysis of Section 3, we explore three different prediction schemes, i.e., depth prediction, coordinate regression and volume regression. For each one of them, we compare a version where ordinal supervision is used, versus 7312

7 Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT Avg Tekin et al. [49] (CVPR 16) Zhou et al. [68] (CVPR 16) Du et al. [14] (ECCV 16) Zhou et al. [66] (ECCVW 16) Chen et al. [10] (CVPR 17) Tome et al. [51] (CVPR 17) Rogez et al. [40] (CVPR 17) Pavlakos et al. [32] (CVPR 17) Nie et al. [60] (ICCV 17) Tekin et al. [48] (ICCV 17) Zhou et al. [64] (ICCV 17) Martinez et al. [25] (ICCV 17) Ours Table 4: Detailed results on Human3.6M [16]. Numbers are mean per joint errors (mm). The results of all approaches are obtained from the original papers. We outperform all other approaches across the table. Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT Avg Akhter & Black [1]* (CVPR 15) Ramakrishna et al. [38]* (ECCV 12) Zhou et al. [67]* (CVPR 15) Bogo et al. [5] (ECCV 16) Moreno-Noguer [28] (CVPR 17) Pavlakos et al. [32] (CVPR 17) Martinez et al. [25] (ICCV 17) Ours Table 5: Detailed results on Human3.6M [16]. Numbers are reconstruction errors. The results of all approaches are obtained from the original papers, except for (*), which were obtained from [5]. We outperform all other approaches across the table. employing the actual 3D ground truth for training. The detailed results are presented in Table 1. Interestingly, in all cases, the weaker ordinal supervision signal is competitive and achieves results very close to the fully supervised baseline. The gap increases only when we employ more powerful architectures, i.e., the volume regression case with two hourglass components. In fact, in this case the average error is already very low (below 80mm), and one would expect that for even lower prediction errors, the highly accurate 3D ground truth would be necessary for training. Improving 3D pose detectors: After the sanity check that ordinal supervision is competitive to training with the full 3D ground truth, we explore using ordinal depth annotations provided by humans, to boost the performance of a standard ConvNet for 3D human pose [32]. As detailed in Section 4.2, we follow a mixed training strategy, leveraging Human3.6M images with 3D ground truth and LSP+MPII Ordinal images with our annotations. Data augmentation using natural images with 2D keypoint annotations is a standard practice [48, 26, 36, 64, 44], but here we also consider the effect of our ordinal depth supervision. Optionally, the reconstruction component can be used at the end of the network, helping with coherent 3D pose prediction. The detailed results of the ablative study are presented in Table 2. Unsurprisingly, using more training examples improves performance. The supervision with 2D keypoints is helpful (line 2), however the addition of our ordinal depth supervision provides novel information to the network and further improves the results (line 3). The refinement step using the reconstruction module (lines 4 and 5) is also beneficial, and helps providing coherent 3D pose results. In fact, the last line corresponds to state-of-the-art results for this dataset, which we discuss in more detail in Section 4.4. Robustness to domain shift: Besides boosting current state-of-the-art models, we ultimately aspire to use our ordinal supervision for better generalization of the trained models so that they are applicable for in-the-wild images. To demonstrate this potential, we test our approach on the MPI- INF-3DHP dataset. This dataset is not considered exactly in-the-wild, but has a significant domain shift compared to Human3.6M. The complete results for this ablative experiment are presented in Table 3. Interestingly, the model trained only on Human3.6M data (line 1) has embarrassing performance, because of heavy overfitting. Using additional in-the-wild images with 2D keypoints (line 2) is helpful, but from inspection of the results, the benefit comes mainly from better 2D pose estimates, while depth prediction is generally mediocre. The best generalization comes after incorporating also the ordinal depth supervision (line 3), elevating the model to state-of-the-art results. 7313

8 Walking Jogging S1 S3 S3 S1 S2 S3 Avg Radwan et al. [37] Wang et al. [54] Simo-Serra et al. [43] Bo et al. [4] Kostrikov et al. [20] Yasin et al. [62] Moreno-Noguer [28] Pavlakos et al. [32] Martinez et al. [25] Ours Table 6: Results on the HumanEva-I [42] dataset. Numbers are reconstruction errors (mm). The results of all approaches are obtained from the original papers. Approach Studio Studio Outdoor All All GS no GS 3DPCK 3DPCK 3DPCK 3DPCK AUC Mehta et al. [26] Zhou et al. [64] Ours Table 7: Detailed results on the test set of MPI-INF- 3DHP [26]. The results for all approaches are taken from the original papers. No training data from this dataset have been used for training by any method Comparison with state of the art Human3.6M: We use for evaluation the same ConvNet with the previous section, which follows a mixed training strategy and includes the reconstruction component. The detailed results in terms of mean per joint error and reconstruction error are presented in Tables 4 and 5 respectively. Our complete approach achieves state-of-the-art results across all actions and metrics, with relative error reduction over 10% on average. Since most other works (e.g., [51, 64, 48, 25]) also use in-the-wild images with 2D keypoints for supervision, most of the improvement for our approach comes from augmenting training with ordinal depth relations for these examples. In particular, the error decrease with respect to previous work is more significant for challenging actions like Sitting Down, Photo or Sitting, with a lot of self-occlusions and rare poses. This benefit can be attributed to the greater variety of the LSP+MPII Ordinal images not just in terms of appearance (this also benefits the other approaches), but mainly in terms of 3D poses which are observed from our ConvNet in a weak 3D form. HumanEva-I: The ConvNet architecture remains the same, where HumanEva-I and LSP+MPII Ordinal images are used for mixed training. The reconstruction component is trained only on HumanEva-I MoCap. Our results are presented in Table 6 and show important accuracy benefit over previous approaches. On average, the relative error reduction is again over 10%, which is a solid improvement considering the numbers for this dataset have mostly saturated. MPI-INF-3DHP: For MPI-INF-3DHP, we report results using the same ConvNet we trained for Human3.6M, with Human3.6M and LSP+MPII Ordinal images. In Table 7 we compare with two recent baselines which are not trained on this dataset, and we outperform them, with particularly large margin for the Outdoor sequence. Figure 4: Typical qualitative results from MPI-INF-3DHP, from the original and a novel viewpoint Qualitative evaluation In Figure 4 we have collected a sample of 3D pose output for our approach, focusing on MPI-INF-3DHP, since it is the main dataset that we evaluate without touching the training data. A richer collection of success and failure examples is included in the supplementary material. 5. Summary The goal of this paper was to present a solution for training end-to-end ConvNets for 3D human pose estimation in the absence of accurate 3D ground truth, by using a weaker supervision signal in the form of ordinal depth relations of the joints. We investigated the flexibility of these ordinal relations by incorporating them in recent ConvNet architectures for 3D human pose and demonstrated competitive performance with their fully supervised versions. Furthermore, we extended the MPII and LSP datasets with ordinal depth annotations for the human joints, allowing us to present compelling results for non-studio conditions. Finally, these annotations were incorporated in the training procedure of recent ConvNets for 3D human pose, achieving state-of-the-art results in the standard benchmarks. Project Page: pavlakos/ projects/ordinal Acknowledgements: We gratefully appreciate support through the following grants: NSF-IIP (I/UCRC), ARL RCTA W911NF , ONR N , DARPA FLA program and NSF/IUCRC. 7314

9 References [1] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR, [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, , 5 [3] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 33(4):159, [4] L. Bo and C. Sminchisescu. Twin Gaussian processes for structured prediction. IJCV, 87(1-2):28 52, , 8 [5] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, , 2, 7 [6] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human pose annotations. In ICCV, [7] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, [8] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, [9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2D pose estimation using part affinity fields. In CVPR, [10] C.-H. Chen and D. Ramanan. 3D human pose estimation = 2D pose estimation + matching. In CVPR, , 7 [11] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, [12] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In NIPS, , 3 [13] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen. Synthesizing training images for boosting human 3D pose estimation. In 3DV, , 2 [14] Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. Kankanhalli, and W. Geng. Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV, [15] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, B. Andres, and B. Schiele. Articulated multi-person tracking in the wild. In CVPR, [16] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7): , , 5, 7 [17] E. Jahangiri and A. L. Yuille. Generating multiple hypotheses for human 3D pose consistent with 2D joint detections. In ICCVW, [18] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, , 5 [19] V. Kazemi, M. Burenius, H. Azizpour, and J. Sullivan. Multiview body part recognition with random forests. In BMVC, [20] I. Kostrikov and J. Gall. Depth sweep regression forests for estimating 3D human pose from images. In BMVC, [21] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In CVPR, , 5 [22] S. Li and A. B. Chan. 3D human pose estimation from monocular images with deep convolutional neural network. In ACCV, [23] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and appearance. In CVPR, [24] E. Marinoiu, D. Papava, and C. Sminchisescu. Pictorial human spaces: A computational study on the human perception of 3D articulated poses. IJCV, 119(2): , , 6 [25] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3D human pose estimation. In ICCV, , 2, 5, 6, 7, 8 [26] D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV, , 2, 5, 7, 8 [27] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics, 36, [28] F. Moreno-Noguer. 3D human pose estimation from a single image via distance matrix regression. In CVPR, , 7, 8 [29] T. Narihira, M. Maire, and S. X. Yu. Learning lightness from human judgement on relative reflectance. In CVPR, [30] A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, [31] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, , 6 [32] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR, , 3, 4, 6, 7, 8 [33] T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In ICCV, [34] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. In CVPR, [35] G. Pons-Moll, D. J. Fleet, and B. Rosenhahn. Posebits for monocular human pose estimation. In CVPR, [36] A.-I. Popa, M. Zanfir, and C. Sminchisescu. Deep multitask architecture for integrated 2D and 3D human sensing. In CVPR, [37] I. Radwan, A. Dhall, and R. Goecke. Monocular image 3D human pose estimation under self-occlusion. In ICCV, [38] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3D human pose from 2D image landmarks. In ECCV, [39] G. Rogez and C. Schmid. MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS,

10 [40] G. Rogez, P. Weinzaepfel, and C. Schmid. LCR-Net: Localization-classification-regression for human pose. In CVPR, , 7 [41] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris. 3D human pose estimation: A review of the literature and analysis of covariates. CVIU, 152:1 20, [42] L. Sigal, A. O. Balan, and M. J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1-2):4 27, , 5, 8 [43] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno- Noguer. A joint model for 2D and 3D pose estimation from a single image. In CVPR, [44] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. In ICCV, , 4, 7 [45] C. J. Taylor. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In CVPR, [46] M. Taylor, J. Guiver, S. Robertson, and T. Minka. SoftRank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages ACM, [47] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3D human pose with deep neural networks. In BMVC, [48] B. Tekin, P. Marquez Neila, M. Salzmann, and P. Fua. Learning to fuse 2D and 3D image cues for monocular body pose estimation. In ICCV, , 4, 7, 8 [49] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3D body poses from motion compensated sequences. In CVPR, [50] J. T. Todd and J. F. Norman. The visual perception of 3-D shape from multiple cues: Are observers capable of perceiving metric structure? Perception & Psychophysics, 65(1):31 47, [51] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR, , 7, 8 [52] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, [53] G. Varol, J. Romero, X. Martin, N. Mahmood, M. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In CVPR, , 2 [54] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of 3D human poses from a single image. In CVPR, [55] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, , 5 [56] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3D interpreter network. In ECCV, [57] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV, [58] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. ObjectNet3D: A large scale database for 3D object recognition. In ECCV, [59] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV, [60] B. Xiaohan Nie, P. Wei, and S.-C. Zhu. Monocular 3D human pose estimation by predicting depth on joints. In ICCV, [61] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In ICCV, [62] H. Yasin, U. Iqbal, B. Krüger, A. Weber, and J. Gall. A dualsource approach for 3D pose estimation from a single image. In CVPR, [63] T. Zhou, P. Krähenbühl, and A. A. Efros. Learning datadriven reflectance priors for intrinsic image decomposition. In ICCV, [64] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV, , 4, 5, 7, 8 [65] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis. 3D shape estimation from 2D landmarks: A convex relaxation approach. In CVPR, [66] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic pose regression. In ECCVW, [67] X. Zhou, M. Zhu, S. Leonardos, and K. Daniilidis. Sparse representation for 3D shape estimation: A convex relaxation approach. PAMI, [68] X. Zhou, M. Zhu, S. Leonardos, K. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR, , 5, 7 [69] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis. MonoCap: Monocular human motion capture using a CNN coupled with a geometric prior. arxiv preprint arxiv: , [70] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level vision. In ICCV,

Learning to Estimate 3D Human Pose and Shape from a Single Color Image Supplementary material

Learning to Estimate 3D Human Pose and Shape from a Single Color Image Supplementary material Learning to Estimate 3D Human Pose and Shape from a Single Color Image Supplementary material Georgios Pavlakos 1, Luyang Zhu 2, Xiaowei Zhou 3, Kostas Daniilidis 1 1 University of Pennsylvania 2 Peking

More information

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose Georgios Pavlakos 1, Xiaowei Zhou 1, Konstantinos G. Derpanis 2, Kostas Daniilidis 1 1 University of Pennsylvania 2 Ryerson University

More information

arxiv: v4 [cs.cv] 12 Sep 2018

arxiv: v4 [cs.cv] 12 Sep 2018 arxiv:1711.08585v4 [cs.cv] 12 Sep 2018 Exploiting temporal information for 3D human pose estimation Mir Rayat Imtiaz Hossain, James J. Little Department of Computer Science, University of British Columbia

More information

Exploiting temporal information for 3D human pose estimation

Exploiting temporal information for 3D human pose estimation Exploiting temporal information for 3D human pose estimation Mir Rayat Imtiaz Hossain, James J. Little Department of Computer Science, University of British Columbia {rayat137,little}@cs.ubc.ca Abstract.

More information

arxiv:submit/ [cs.cv] 14 Apr 2017

arxiv:submit/ [cs.cv] 14 Apr 2017 Harvesting Multiple Views for Marker-less 3D Human Pose Annotations Georgios Pavlakos 1, Xiaowei Zhou 1, Konstantinos G. Derpanis 2, Kostas Daniilidis 1 1 University of Pennsylvania 2 Ryerson University

More information

Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB

Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB Dushyant Mehta [1,2],OleksandrSotnychenko [1,2],FranziskaMueller [1,2], Weipeng Xu [1,2],SrinathSridhar [3],GerardPons-Moll [1,2], Christian

More information

A Dual-Source Approach for 3D Human Pose Estimation from Single Images

A Dual-Source Approach for 3D Human Pose Estimation from Single Images 1 Computer Vision and Image Understanding journal homepage: www.elsevier.com A Dual-Source Approach for 3D Human Pose Estimation from Single Images Umar Iqbal a,, Andreas Doering a, Hashim Yasin b, Björn

More information

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, Yichen Wei UT Austin & MSRA & Fudan Human Pose Estimation Pose representation

More information

Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images

Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images Andrei Zanfir 2 Elisabeta Marinoiu 2 Mihai Zanfir 2 Alin-Ionut Popa 2 Cristian Sminchisescu 1,2 {andrei.zanfir, elisabeta.marinoiu,

More information

arxiv: v1 [cs.cv] 31 Jan 2017

arxiv: v1 [cs.cv] 31 Jan 2017 Deep Multitask Architecture for Integrated 2D and 3D Human Sensing Alin-Ionut Popa 2, Mihai Zanfir 2, Cristian Sminchisescu 1,2 alin.popa@imar.ro, mihai.zanfir@imar.ro cristian.sminchisescu@math.lth.se

More information

arxiv: v3 [cs.cv] 2 Aug 2017

arxiv: v3 [cs.cv] 2 Aug 2017 Compositional Human Pose Regression Xiao Sun 1, Jiaxiang Shang 1, Shuang Liang 2, Yichen Wei 1 1 Microsoft Research, 2 Tongji University {xias, yichenw}@microsoft.com, jiaxiang.shang@gmail.com, shuangliang@tongji.edu.cn

More information

arxiv: v3 [cs.cv] 20 Mar 2018

arxiv: v3 [cs.cv] 20 Mar 2018 arxiv:1711.08229v3 [cs.cv] 20 Mar 2018 Integral Human Pose Regression Xiao Sun 1, Bin Xiao 1, Fangyin Wei 2, Shuang Liang 3, Yichen Wei 1 1 Microsoft Research, 2 Peking University, 3 Tongji University

More information

Integral Human Pose Regression

Integral Human Pose Regression Integral Human Pose Regression Xiao Sun 1, Bin Xiao 1, Fangyin Wei 2, Shuang Liang 3, and Yichen Wei 1 1 Microsoft Research, Beijing, China {xias, Bin.Xiao, yichenw}@microsoft.com 2 Peking University,

More information

arxiv: v3 [cs.cv] 28 Aug 2018

arxiv: v3 [cs.cv] 28 Aug 2018 Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB Dushyant Mehta [1,2], Oleksandr Sotnychenko [1,2], Franziska Mueller [1,2], Weipeng Xu [1,2], Srinath Sridhar [3], Gerard Pons-Moll [1,2],

More information

3D Pose Estimation using Synthetic Data over Monocular Depth Images

3D Pose Estimation using Synthetic Data over Monocular Depth Images 3D Pose Estimation using Synthetic Data over Monocular Depth Images Wei Chen cwind@stanford.edu Xiaoshi Wang xiaoshiw@stanford.edu Abstract We proposed an approach for human pose estimation over monocular

More information

Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation

Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation Mohamed Omran 1 Christoph Lassner 2 Gerard Pons-Moll 1 Peter V. Gehler 2 Bernt Schiele 1 1 Max Planck Institute

More information

arxiv: v2 [cs.cv] 11 Apr 2017

arxiv: v2 [cs.cv] 11 Apr 2017 3D Human Pose Estimation = 2D Pose Estimation + Matching Ching-Hang Chen Carnegie Mellon University chinghac@andrew.cmu.edu Deva Ramanan Carnegie Mellon University deva@cs.cmu.edu arxiv:1612.06524v2 [cs.cv]

More information

Compositional Human Pose Regression

Compositional Human Pose Regression Compositional Human Pose Regression Xiao Sun 1, Jiaxiang Shang 1, Shuang Liang 2, Yichen Wei 1 1 Microsoft Research, 2 Tongji University {xias, yichenw}@microsoft.com, jiaxiang.shang@gmail.com, shuangliang@tongji.edu.cn

More information

3D Pose Estimation from a Single Monocular Image

3D Pose Estimation from a Single Monocular Image COMPUTER GRAPHICS TECHNICAL REPORTS CG-2015-1 arxiv:1509.06720v1 [cs.cv] 22 Sep 2015 3D Pose Estimation from a Single Monocular Image Hashim Yasin yasin@cs.uni-bonn.de Institute of Computer Science II

More information

Human Pose Estimation using Global and Local Normalization. Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, Jingdong Wang

Human Pose Estimation using Global and Local Normalization. Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, Jingdong Wang Human Pose Estimation using Global and Local Normalization Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, Jingdong Wang Overview of the supplementary material In this supplementary material,

More information

Propagating LSTM: 3D Pose Estimation based on Joint Interdependency

Propagating LSTM: 3D Pose Estimation based on Joint Interdependency Propagating LSTM: 3D Pose Estimation based on Joint Interdependency Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee ( ) Department of Electrical Electronic Engineering, Yonsei University {kasinamooth, mayddb100,

More information

3D Human Pose Estimation in the Wild by Adversarial Learning

3D Human Pose Estimation in the Wild by Adversarial Learning 3D Human Pose Estimation in the Wild by Adversarial Learning Wei Yang 1 Wanli Ouyang 2 Xiaolong Wang 3 Jimmy Ren 4 Hongsheng Li 1 Xiaogang Wang 1 1 CUHK-SenseTime Joint Lab, The Chinese University of Hong

More information

arxiv: v2 [cs.cv] 23 Jun 2018

arxiv: v2 [cs.cv] 23 Jun 2018 End-to-end Recovery of Human Shape and Pose arxiv:1712.06584v2 [cs.cv] 23 Jun 2018 Angjoo Kanazawa1, Michael J. Black2, David W. Jacobs3, Jitendra Malik1 1 University of California, Berkeley 2 MPI for

More information

Recurrent 3D Pose Sequence Machines

Recurrent 3D Pose Sequence Machines Recurrent D Pose Sequence Machines Mude Lin, Liang Lin, Xiaodan Liang, Keze Wang, Hui Cheng School of Data and Computer Science, Sun Yat-sen University linmude@foxmail.com, linliang@ieee.org, xiaodanl@cs.cmu.edu,

More information

MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior

MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior TO APPEAR IN IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018 1 MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior Xiaowei Zhou, Menglong Zhu, Georgios

More information

arxiv: v1 [cs.cv] 18 Dec 2017

arxiv: v1 [cs.cv] 18 Dec 2017 End-to-end Recovery of Human Shape and Pose arxiv:1712.06584v1 [cs.cv] 18 Dec 2017 Angjoo Kanazawa1,3, Michael J. Black2, David W. Jacobs3, Jitendra Malik1 1 University of California, Berkeley 2 MPI for

More information

End-to-end Recovery of Human Shape and Pose

End-to-end Recovery of Human Shape and Pose End-to-end Recovery of Human Shape and Pose Angjoo Kanazawa1, Michael J. Black2, David W. Jacobs3, Jitendra Malik1 1 University of California, Berkeley 2 MPI for Intelligent Systems, Tu bingen, Germany,

More information

arxiv: v1 [cs.cv] 14 Jan 2019

arxiv: v1 [cs.cv] 14 Jan 2019 Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views arxiv:1901.04111v1 [cs.cv] 14 Jan 2019 Junting Dong Zhejiang University jtdong@zju.edu.cn Abstract Hujun Bao Zhejiang University bao@cad.zju.edu.cn

More information

arxiv: v1 [cs.cv] 30 Nov 2015

arxiv: v1 [cs.cv] 30 Nov 2015 accepted at CVPR 016 Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Kosta Derpanis, Kostas Daniilidis University of Pennsylvania

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

3D Human Pose Estimation from a Single Image via Distance Matrix Regression

3D Human Pose Estimation from a Single Image via Distance Matrix Regression 3D Human Pose Estimation from a Single Image via Distance Matrix Regression Francesc Moreno-Noguer Institut de Robòtica i Informàtica Industrial (CSIC-UPC), 08028, Barcelona, Spain Abstract This paper

More information

A Dual-Source Approach for 3D Pose Estimation from a Single Image

A Dual-Source Approach for 3D Pose Estimation from a Single Image A Dual-Source Approach for 3D Pose Estimation from a Single Image Hashim Yasin 1, Umar Iqbal 2, Björn Krüger 3, Andreas Weber 1, and Juergen Gall 2 1 Multimedia, Simulation, Virtual Reality Group, University

More information

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Monocular 3D Human Pose Estimation by Predicting Depth on Joints Monocular 3D Human Pose Estimation by Predicting Depth on Joints Bruce Xiaohan Nie 1, Ping Wei 2,1, and Song-Chun Zhu 1 1 Center for Vision, Cognition, Learning, and Autonomy, UCLA, USA 2 Xi an Jiaotong

More information

Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image Federica Bogo 2,, Angjoo Kanazawa 3,, Christoph Lassner 1,4, Peter Gehler 1,4, Javier Romero 1, Michael J. Black 1 1 Max

More information

ECE 6554:Advanced Computer Vision Pose Estimation

ECE 6554:Advanced Computer Vision Pose Estimation ECE 6554:Advanced Computer Vision Pose Estimation Sujay Yadawadkar, Virginia Tech, Agenda: Pose Estimation: Part Based Models for Pose Estimation Pose Estimation with Convolutional Neural Networks (Deep

More information

Human Pose Estimation with Deep Learning. Wei Yang

Human Pose Estimation with Deep Learning. Wei Yang Human Pose Estimation with Deep Learning Wei Yang Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2 What do we need to know to recognize a crime scene? 3

More information

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera By Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, Christian

More information

An Efficient Convolutional Network for Human Pose Estimation

An Efficient Convolutional Network for Human Pose Estimation RAFI, LEIBE, GALL: 1 An Efficient Convolutional Network for Human Pose Estimation Umer Rafi 1 rafi@vision.rwth-aachen.de Juergen Gall 2 gall@informatik.uni-bonn.de Bastian Leibe 1 leibe@vision.rwth-aachen.de

More information

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA) Detecting and Parsing of Visual Objects: Humans and Animals Alan Yuille (UCLA) Summary This talk describes recent work on detection and parsing visual objects. The methods represent objects in terms of

More information

arxiv: v2 [cs.cv] 24 Mar 2018

arxiv: v2 [cs.cv] 24 Mar 2018 Learning Monocular 3D Human Pose Estimation from Multi-view Images Helge Rhodin Frédéric Meyer3 arxiv:803.04775v [cs.cv] 4 Mar 08 Jörg Spörri,4 Erich Müller4 CVLab, EPFL, Lausanne, Switzerland 3 UNIL,

More information

arxiv: v1 [cs.cv] 17 Sep 2016

arxiv: v1 [cs.cv] 17 Sep 2016 arxiv:1609.05317v1 [cs.cv] 17 Sep 2016 Deep Kinematic Pose Regression Xingyi Zhou 1, Xiao Sun 2, Wei Zhang 1, Shuang Liang 3, Yichen Wei 2 1 Shanghai Key Laboratory of Intelligent Information Processing

More information

Multi-scale Adaptive Structure Network for Human Pose Estimation from Color Images

Multi-scale Adaptive Structure Network for Human Pose Estimation from Color Images Multi-scale Adaptive Structure Network for Human Pose Estimation from Color Images Wenlin Zhuang 1, Cong Peng 2, Siyu Xia 1, and Yangang, Wang 1 1 School of Automation, Southeast University, Nanjing, China

More information

Multi-Scale Structure-Aware Network for Human Pose Estimation

Multi-Scale Structure-Aware Network for Human Pose Estimation Multi-Scale Structure-Aware Network for Human Pose Estimation Lipeng Ke 1, Ming-Ching Chang 2, Honggang Qi 1, and Siwei Lyu 2 1 University of Chinese Academy of Sciences, Beijing, China 2 University at

More information

arxiv: v2 [cs.cv] 21 Mar 2018

arxiv: v2 [cs.cv] 21 Mar 2018 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning Diogo C. Luvizon 1, David Picard 1,2, Hedi Tabia 1 1 ETIS UMR 8051, Paris Seine University, ENSEA, CNRS, F-95000, Cergy, France

More information

arxiv: v1 [cs.cv] 6 Oct 2017

arxiv: v1 [cs.cv] 6 Oct 2017 Human Pose Regression by Combining Indirect Part Detection and Contextual Information arxiv:1710.02322v1 [cs.cv] 6 Oct 2017 Diogo C. Luvizon Abstract In this paper, we propose an end-to-end trainable regression

More information

arxiv: v1 [cs.cv] 16 Nov 2015

arxiv: v1 [cs.cv] 16 Nov 2015 Coarse-to-fine Face Alignment with Multi-Scale Local Patch Regression Zhiao Huang hza@megvii.com Erjin Zhou zej@megvii.com Zhimin Cao czm@megvii.com arxiv:1511.04901v1 [cs.cv] 16 Nov 2015 Abstract Facial

More information

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos Jie Song Limin Wang2 AIT Lab, ETH Zurich Luc Van Gool2 2 Otmar Hilliges Computer Vision Lab, ETH Zurich Abstract Deep ConvNets

More information

Multi-Scale Structure-Aware Network for Human Pose Estimation

Multi-Scale Structure-Aware Network for Human Pose Estimation Multi-Scale Structure-Aware Network for Human Pose Estimation Lipeng Ke 1, Ming-Ching Chang 2, Honggang Qi 1, Siwei Lyu 2 1 University of Chinese Academy of Sciences, Beijing, China 2 University at Albany,

More information

Pose Estimation on Depth Images with Convolutional Neural Network

Pose Estimation on Depth Images with Convolutional Neural Network Pose Estimation on Depth Images with Convolutional Neural Network Jingwei Huang Stanford University jingweih@stanford.edu David Altamar Stanford University daltamar@stanford.edu Abstract In this paper

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

An Efficient Convolutional Network for Human Pose Estimation

An Efficient Convolutional Network for Human Pose Estimation RAFI, KOSTRIKOV, GALL, LEIBE: EFFICIENT CNN FOR HUMAN POSE ESTIMATION 1 An Efficient Convolutional Network for Human Pose Estimation Umer Rafi 1 rafi@vision.rwth-aachen.de Ilya Kostrikov 1 ilya.kostrikov@rwth-aachen.de

More information

Deep globally constrained MRFs for Human Pose Estimation

Deep globally constrained MRFs for Human Pose Estimation Deep globally constrained MRFs for Human Pose Estimation Ioannis Marras, Petar Palasek and Ioannis Patras School of Electronic Engineering and Computer Science, Queen Mary University of London, United

More information

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation Submission ID: 2065 Abstract Accurate keypoint localization of human pose needs diversified features:

More information

arxiv: v1 [cs.cv] 29 Sep 2016

arxiv: v1 [cs.cv] 29 Sep 2016 arxiv:1609.09545v1 [cs.cv] 29 Sep 2016 Two-stage Convolutional Part Heatmap Regression for the 1st 3D Face Alignment in the Wild (3DFAW) Challenge Adrian Bulat and Georgios Tzimiropoulos Computer Vision

More information

arxiv: v1 [cs.cv] 17 May 2016

arxiv: v1 [cs.cv] 17 May 2016 arxiv:1605.05180v1 [cs.cv] 17 May 2016 TEKIN ET AL.: STRUCTURED PREDICTION OF 3D HUMAN POSE 1 Structured Prediction of 3D Human Pose with Deep Neural Networks Bugra Tekin 1 bugra.tekin@epfl.ch Isinsu Katircioglu

More information

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus Presented by: Rex Ying and Charles Qi Input: A Single RGB Image Estimate

More information

arxiv: v1 [cs.cv] 15 Mar 2018

arxiv: v1 [cs.cv] 15 Mar 2018 arxiv:1803.05959v1 [cs.cv] 15 Mar 2018 Mo 2 Cap 2 : Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera Weipeng Xu 1 Avishek Chatterjee 1 Michael Zollhoefer 1,2 Helge Rhodin 3 Pascal Fua

More information

Cross-View Person Identification by Matching Human Poses Estimated with Confidence on Each Body Joint

Cross-View Person Identification by Matching Human Poses Estimated with Confidence on Each Body Joint Cross-View Person Identification by Matching Human Poses Estimated with Confidence on Each Body Joint Guoqiang Liang 1,, Xuguang Lan 1, Kang Zheng, Song Wang, 3, Nanning Zheng 1 1 Institute of Artificial

More information

arxiv: v5 [cs.cv] 4 Oct 2017

arxiv: v5 [cs.cv] 4 Oct 2017 Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision Dushyant Mehta 1, Helge Rhodin 2, Dan Casas 3, Pascal Fua 2, Oleksandr Sotnychenko 1, Weipeng Xu 1, and Christian Theobalt

More information

arxiv: v2 [cs.cv] 8 Nov 2018

arxiv: v2 [cs.cv] 8 Nov 2018 3D Human Pose Estimation with 2D Marginal Heatmaps Aiden Nibali Zhen He Stuart Morgan Luke Prendergast La Trobe University, Australia anibali@students.latrobe.edu.au arxiv:1806.01484v2 [cs.cv] 8 Nov 2018

More information

Depth Sweep Regression Forests for Estimating 3D Human Pose from Images

Depth Sweep Regression Forests for Estimating 3D Human Pose from Images KOSTRIKOV, GALL: DEPTH SWEEP REGRESSION FORESTS 1 Depth Sweep Regression Forests for Estimating 3D Human Pose from Images Ilya Kostrikov ilya.kostrikov@rwth-aachen.de Juergen Gall gall@iai.uni-bonn.de

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Human Upper Body Pose Estimation in Static Images

Human Upper Body Pose Estimation in Static Images 1. Research Team Human Upper Body Pose Estimation in Static Images Project Leader: Graduate Students: Prof. Isaac Cohen, Computer Science Mun Wai Lee 2. Statement of Project Goals This goal of this project

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera

Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera Timo von Marcard 1, Roberto Henschel 1, Michael J. Black 2, Bodo Rosenhahn 1, and Gerard Pons-Moll 3 1 Leibniz Universität Hannover,

More information

arxiv: v4 [cs.cv] 2 Sep 2016

arxiv: v4 [cs.cv] 2 Sep 2016 Direct Prediction of 3D Body Poses from Motion Compensated Sequences Bugra Tekin 1 Artem Rozantsev 1 Vincent Lepetit 1,2 Pascal Fua 1 1 CVLab, EPFL, Lausanne, Switzerland, {firstname.lastname}@epfl.ch

More information

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling [DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji

More information

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document Franziska Mueller 1,2 Dushyant Mehta 1,2 Oleksandr Sotnychenko 1 Srinath Sridhar 1 Dan Casas 3 Christian Theobalt

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Introduction In this supplementary material, Section 2 details the 3D annotation for CAD models and real

More information

Pose estimation using a variety of techniques

Pose estimation using a variety of techniques Pose estimation using a variety of techniques Keegan Go Stanford University keegango@stanford.edu Abstract Vision is an integral part robotic systems a component that is needed for robots to interact robustly

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Stereo Human Keypoint Estimation

Stereo Human Keypoint Estimation Stereo Human Keypoint Estimation Kyle Brown Stanford University Stanford Intelligent Systems Laboratory kjbrown7@stanford.edu Abstract The goal of this project is to accurately estimate human keypoint

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

arxiv: v1 [cs.cv] 15 Jan 2019

arxiv: v1 [cs.cv] 15 Jan 2019 Feature Boosting Network For 3D Pose Estimation arxiv:1901.04877v1 [cs.cv] 15 Jan 2019 Jun Liu 1 jliu029@ntu.edu.sg Xudong Jiang 1 exdjiang@ntu.edu.sg Abstract Henghui Ding 1 ding0093@ntu.edu.sg In this

More information

LOCALIZATION OF HUMANS IN IMAGES USING CONVOLUTIONAL NETWORKS

LOCALIZATION OF HUMANS IN IMAGES USING CONVOLUTIONAL NETWORKS LOCALIZATION OF HUMANS IN IMAGES USING CONVOLUTIONAL NETWORKS JONATHAN TOMPSON ADVISOR: CHRISTOPH BREGLER OVERVIEW Thesis goal: State-of-the-art human pose recognition from depth (hand tracking) Domain

More information

RSRN: Rich Side-output Residual Network for Medial Axis Detection

RSRN: Rich Side-output Residual Network for Medial Axis Detection RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn,

More information

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation Wentao Liu, 1,2 Jie Chen,

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? Introduction What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? What are we trying to achieve? Example from Scott Satkin 3D interpretation

More information

Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps

Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps Yu Du 1, Yongkang Wong 2, Yonghao Liu 1, Feilin Han 1, Yilin Gui 1, Zhen Wang 1, Mohan Kankanhalli 2,3, Weidong Geng 1

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Chi Li, M. Zeeshan Zia 2, Quoc-Huy Tran 2, Xiang Yu 2, Gregory D. Hager, and Manmohan Chandraker 2 Johns

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray

More information

Single Image 3D Interpreter Network

Single Image 3D Interpreter Network Single Image 3D Interpreter Network Jiajun Wu* Josh Tenenbaum Tianfan Xue* Joseph Lim Antonio Torralba ECCV 2016 Yuandong Tian Bill Freeman (* equal contributions) What do we see from these images? Motivation

More information

Ensemble Convolutional Neural Networks for Pose Estimation

Ensemble Convolutional Neural Networks for Pose Estimation Ensemble Convolutional Neural Networks for Pose Estimation Yuki Kawana b, Norimichi Ukita a,, Jia-Bin Huang c, Ming-Hsuan Yang d a Toyota Technological Institute, Japan b Nara Institute of Science and

More information

arxiv: v4 [cs.cv] 2 Sep 2017

arxiv: v4 [cs.cv] 2 Sep 2017 RMPE: Regional Multi-Person Pose Estimation Hao-Shu Fang 1, Shuqin Xie 1, Yu-Wing Tai 2, Cewu Lu 1 1 Shanghai Jiao Tong University, China 2 Tencent YouTu fhaoshu@gmail.com qweasdshu@sjtu.edu.cn yuwingtai@tencent.com

More information

Multi-Person Pose Estimation with Local Joint-to-Person Associations

Multi-Person Pose Estimation with Local Joint-to-Person Associations Multi-Person Pose Estimation with Local Joint-to-Person Associations Umar Iqbal and Juergen Gall omputer Vision Group University of Bonn, Germany {uiqbal,gall}@iai.uni-bonn.de Abstract. Despite of the

More information

arxiv: v5 [cs.cv] 4 Feb 2018

arxiv: v5 [cs.cv] 4 Feb 2018 RMPE: Regional Multi-Person Pose Estimation Hao-Shu Fang 1, Shuqin Xie 1, Yu-Wing Tai 2, Cewu Lu 1 1 Shanghai Jiao Tong University, China 2 Tencent YouTu fhaoshu@gmail.com qweasdshu@sjtu.edu.cn yuwingtai@tencent.com

More information

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3 Understand

More information

Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild

Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild Yuxiang Zhou Epameinondas Antonakos Joan Alabort-i-Medina Anastasios Roussos Stefanos Zafeiriou, Department of Computing,

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation

Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang Department of Computer Science & Engineering, University

More information

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE Masatoshi Kimura Takayoshi Yamashita Yu Yamauchi Hironobu Fuyoshi* Chubu University 1200, Matsumoto-cho,

More information

Learning to generate 3D shapes

Learning to generate 3D shapes Learning to generate 3D shapes Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst http://people.cs.umass.edu/smaji August 10, 2018 @ Caltech Creating 3D shapes

More information

Strong Appearance and Expressive Spatial Models for Human Pose Estimation

Strong Appearance and Expressive Spatial Models for Human Pose Estimation 2013 IEEE International Conference on Computer Vision Strong Appearance and Expressive Spatial Models for Human Pose Estimation Leonid Pishchulin 1 Mykhaylo Andriluka 1 Peter Gehler 2 Bernt Schiele 1 1

More information

Structured Prediction using Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer

More information

arxiv: v1 [cs.cv] 29 Nov 2017

arxiv: v1 [cs.cv] 29 Nov 2017 DeepSkeleton: Skeleton Map for 3D Human Pose Regression Qingfu Wan Fudan University qfwan13@fudan.edu.cn Wei Zhang Fudan University weizh@fudan.edu.cn Xiangyang Xue Fudan University xyxue@fudan.edu.cn

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

A Joint Model for 2D and 3D Pose Estimation from a Single Image

A Joint Model for 2D and 3D Pose Estimation from a Single Image 213 IEEE Conference on Computer Vision and Pattern Recognition A Joint Model for 2D and 3D Pose Estimation from a Single Image E. Simo-Serra 1, A. Quattoni 2, C. Torras 1, F. Moreno-Noguer 1 1 Institut

More information