Rapid Skin: Estimating the 3D Human Pose and Shape in Real-Time

Rapid Skin: Estimating the 3D Human Pose and Shape in Real-Time Matthias Straka, Stefan Hauswiesner, Matthias Rüther, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {straka,hauswiesner,ruether,bischof}@icg.tugraz.at Abstract We present a novel approach to adapt a watertight polygonal model of the human body to multiple synchronized camera views. While previous approaches yield excellent quality for this task, they require processing times of several seconds, especially for high resolution meshes. Our approach delivers high quality results at interactive rates when a roughly initialized pose and a generic articulated body model are available. The key novelty of our approach is to use a Gauss- Seidel type solver to iteratively solve nonlinear constraints that deform the surface of the model according to silhouette images. We evaluate both the visual quality and accuracy of the adapted body shape on multiple test persons. While maintaining a similar reconstruction quality as previous approaches, our algorithm reduces processing times by a factor of 20. Thus it is possible to use a simple human model for representing the body shape of moving people in interactive applications. Keywords-human body; multi-view geometry; silhouette; Laplacian mesh adaption; real-time; I. INTRODUCTION Marker-less human pose and body shape estimation from images has numerous applications in video games, virtual try-ons, augmented reality and motion capture for the entertainment industry. Recent advances in real-time human pose estimation enable to create interactive environments where the only controller is the body of the user [1]. However, an estimate of the human pose alone sometimes is not sufficient. For example, displaying a realistic user controlled avatar that not only mimics the pose of the user but also his appearance requires a full representation of the body surface. Such an avatar is an important component in augmented reality applications such as a virtual mirror [2]. The main challenges of capturing the body shape lie in the articulation of the human body and the variation of size, age and visual appearance between different persons. For static objects it is fairly easy to generate a realistic and accurate model in real-time, even with a single, moving camera [3]. However, people will change their pose continuously in an interactive scenario. This requires pose estimation and shape adaption for every single frame. Several authors have tackled the task of body shape adaption by recording images from a multi-view camera setup and deforming a human body mesh such that it is consistent with the background-subtracted silhouette in each view [4] [9]. While most approaches yield convincing results, two common limitations remain: a previously scanned model of (a) (b) (c) (d) Figure 1. Our approach estimates the shape of the human body in realtime. We take a generic template mesh (a), correct its pose and size (b) and deform it according to multi-view silhouettes to obtain an accurate model (c). Projective texturing is used for realistic rendering (d). the actor is required, and processing times in the order of several seconds per frame have to be expected. In this paper, we present a novel approach that allows adapting a generic model of the human body to multi-view images (see Fig. 1). We improve over existing methods by introducing a constraint based mesh deformation and propose a real-time capable solver based on Gauss-Seidel iterations. We start with a polygonal mesh of the human body and create nonlinear constraints that align vertices with image features but keep the overall mesh smooth. The key for real-time operation is to process each constraint individually, which allows for fast and stable estimation of the three dimensional shape of the human body such that interactive applications become feasible. The main contributions in this paper are as follows: We derive constraints to deform a mesh such that it becomes consistent with multi-view silhouette contours and propose an automatic constraint weighting scheme. Our approach enables performing this deformation in real-time even for large meshes, which has not been possible before. We adapt the size of the mesh at runtime by changing the length of skeleton bones. This allows us to represent a wide range of people with different age, gender and size using only a single template mesh. We demonstrate a method to transform multi-view silhouette data to depth-maps which allows using realtime pose estimation methods such as [1] directly. Section II reviews existing work in the field of human

body shape adaption. In Section III, we present our shape adaption algorithm consisting of constraints and a real-time capable solver. In Section IV, we present how to use our algorithm for full human bodies in an interactive scenario. We evaluate our algorithm on several recorded sequences and provide qualitative measurements of both accuracy and speed in Section V. Finally, Section VI concludes the paper and gives an outlook for future work. II. RELATED WORK The idea of deforming a polygonal mesh such that it is consistent with images of the body silhouette is not new. In the current literature, several approaches can be found that make use of the Laplacian Mesh Editing (LME) framework [10]. The basic idea is to represent each vertex using delta coordinates, defined as the difference between the vertex position and the weighted sum of positions of neighboring vertices. Deformation of a mesh is then expressed as a sparse system of linear equations which allows modifying the position of selected vertices while using delta coordinates to enforce smooth deformations of the mesh. The LME framework is used in Gall et al. [8] and Vlasic et al. [9], who transform a model of the human body to align its pose to the recorded person, and then align vertices of the model with silhouette contours in multi-view camera images. Aguiar et al. [4] propose a similar method for mesh deformation, but omit the explicit pose estimation step. Instead, they track the mesh over multiple frames based on silhouette and texture correspondences. While the previously mentioned approaches omit the skeletal structure during surface adaption, [5] present a method to jointly optimize for bones and surface. Most LME-based approaches use global least-squares optimization. This prohibits real-time operation since solving the linear system can be slow for reasonably sized meshes. In Hofmann and Gavrila [11], an automatic pose and shape estimation method is presented that not only adapts a mesh to a single frame, but optimizes over a series of frames in order to obtain a stable body shape. A large database of human body scans makes it possible to build a statistical body model which guides shape deformation based on silhouette data [12], laser scans [13] or even single depth images [14]. Bottom-up methods create a new mesh from merged depth maps [15] or point clouds [16] and therefore do not require any previously known body scan. Straka et al. [2] present a method to capture a moving 3D human body without the use of an explicit model. They use image based rendering to create an interactive virtual mirror image of the user using multiple real cameras. However, it is not possible to obtain an explicit body shape using this method. None of the previously mentioned approaches is able to estimate pose and shape of the human body at interactive frame rates. Recently, it was shown how to perform pose estimation in real-time [1], but mesh deformation still requires several seconds. Our method is closely related to [8] and [9] as we follow their two-stage approach with separate pose estimation and shape deformation. The major difference compared to previous methods is the solver used for optimizing the deformed shape. Our method is inspired by position based physics simulations [17] which are able to compute realistic interactions between soft bodies in realtime. The key to real-time operation is to apply decoupled constraints on individual vertices of a deformable mesh and optimize for stable shape using an iterative method. We show that this decoupled optimization is suitable for mesh deformation guided by image space correspondences such that the final mesh resembles the content in the input images. III. REAL-TIME 3D SHAPE ESTIMATION In this section, we present our novel approach for realtime estimation of the shape of an object, which is represented by its silhouette in multiple images. The main idea is to iteratively deform a template mesh consisting of vertices and faces such that the projection of the mesh into the source images is identical to the silhouette of the object. For now, we assume that an initial mesh with the same topology is available in roughly the same pose as the object inside a calibrated multi-camera system. In Section IV, we show how to quickly initialize an articulated human body mesh such that it fulfills these requirements. A. Constraint-based Mesh Deformation We consider the problem of deforming a polygonal mesh M = {V, N, F} consisting of vertices V = {v i R 3 i = 1... V }, vertex normals N = {n i R 3 i = 1... V } and triangular faces F such that all vertices satisfy a set of constraints C j (V Φ j ) = 0 1 j M. (1) Each constraint is a function C j : R 3 V R with a set of parameters Φ j that encodes a relationship between selected vertices with other vertices of M or the scene. For example, a constraint can be responsible for aligning the mesh with image data. We use the parameters Φ j for storing constraint properties such as initial curvature or correspondences. Usually, these parameters are initialized before optimization. The vertex positions of the deformed mesh can be obtained by minimizing over all constraints: Ṽ = argmin V M k Cj j (V Φ j ) (2) j=1 where k j [0, 1] is a weighting term and. denotes the length of a vector. Note that such constraints need not be linear but only differentiable. Inspired by the Gauss-Seidel algorithm for linear systems of equations [18], we do not minimize (2) as a whole.

Match Silhouette Contour Camera center Viewing Ray Constraint 1-ring vertices v i 3D Mesh Normal direction Rim-Vertex Image Projected Mesh Figure 2. Silhouette constraints pull rim-vertices towards the silhouette contour in every camera image. Instead, we break it down into individual constraints and project each C j onto the vertices independently. We use a first-order Taylor series expansion to find a positioncorrection term V such that C j (V + V) C j (V) + V C j (V) V = 0 (3) where V C j denotes the gradient of constraint j. Solving for V yields the step for the iterative minimization C j (V) V = V C j (V) 2 VC j (V) (4) which is similar to the standard Newton-Raphson method. We use (4) to perform a weighted correction of the current vertex positions V V + k j V for every constraint C j. Analog to the Gauss-Seidel algorithm, we use updated values of V for subsequent calculations as soon as available. This requires less memory and allows the solution to converge faster while keeping time complexity linear in the number of constraints. By iterating constraint projection multiple times, we allow the effect of constraints to propagate along the surface of the mesh until all vertices of the deformed mesh reach a stable position. A similar strategy can be found in real-time physics simulation, where internal and external forces of simulated objects are integrated using iterative constraint projection [17]. B. Constraints The presented algorithm is capable of handling nonlinear constraints of any type. We propose to use two specific types of constraints for the task of template based shape estimation. First, silhouette constraints C sil allow to align rim vertices of a template mesh with silhouette contours in the images. The second type of constraint C sm is a smoothness constraint which acts as a regularization term. This allows (2) to be rewritten as M sil Ṽ = argmin V j=1 kj sil C sil j V (V) + k sm i=1 C sm i (V) (5) with two distinct sets of constraints. We now describe these constraints in detail and show how to choose the weights k sil j automatically. Figure 3. vertices. Calculation of delta coordinates using the 1-ring of neighboring Silhouette Consistency: In order to achieve silhouette consistency, we apply a method related to [4], [8], [9] to align rim vertices of the mesh with the silhouette contour. Rim vertices lie on the contour of the mesh when projected onto a camera image I c. In order to find rim vertices, we project vertices v i of mesh M into all camera views using the corresponding 3 4 projection matrices P c = K c [R c t c ] and rotate the corresponding vertex normals n i onto the image plane using rotation matrix R c R 3 3 : [ ] P c (1) [ v i ] vi c P = c (2) 1 P c (3) [ v i ] n c i = R c n i (6) 1 where P c (r) denotes the r th row of the projection matrix. We calculate vertex normals n i as the normalized mean of face normals adjacent to the vertex v i. A rim vertex in image I c is a vertex with a normal almost parallel to the image plane of camera c. For such vertices, we sample pixels from I c along a 2D line l(t) = vi c + t nc i (1 : 2) for intersections with the silhouette contour where τ t τ defines the search region in pixels. Note that it is important that only intersections with a contour gradient similar to the normal direction n c i are considered a match pc i R2. Simply matching the closest contour pixel, such as in [4], can lead to false matches, especially if the initialization of mesh M is inaccurate. Each successfully matched rim-vertex/contour pair (vi c, pc i ) yields a 2D correspondence in image space. We translate this correspondence into a constraint Cj sil which enforces that vertex v i is pulled towards the viewing ray R j, which is a 3D line from the projection center of camera c through the contour pixel p c i : C sil j (V R j, i) = d pl (R j, v i ) = 0 (7) where d pl denotes the shortest Euclidean distance between a point and line in 3D. In Fig. 2, we visualize the effect of silhouette constraints. Mesh Smoothing: Smoothness constraints are based on delta coordinates δ i R 3, which are calculated as δ i = w ij (v i v j ) (8) j N (i) where N (i) denotes the 1-ring of neighboring vertices of v i (see Fig. 3). Each weight w ij is calculated using the cotangent weighting scheme [10] with j w ij = 1 i. For δi

each vertex v i, we define a smoothness constraint Ci sm that ensures that the delta coordinate δ i of vertex v i stays close to its initial value, which is computed from the undeformed mesh M using (8): Ci sm (V δ i ) = w ij (v i v j ) δ i = 0. (9) 2 j N (i) Automatic Constraint Weighting: Each silhouette constraint Cj sil is weighted using a scalar kj sil. We propose to use a weighting scheme that takes into consideration the quality of silhouette contour matches and adapts the influence of constraints automatically. When a vertex is far away from a silhouette contour, there is a large uncertainty which contour pixel it should correspond to. In this case, we put more trust in the smoothness term. In contrast, when the distance between a projected vertex and the silhouette contour is small, we consider this a good match and keep the vertex close to the corresponding viewing ray R j. We encode this uncertainty into the silhouette constraint weights by applying an unnormalized Gaussian kernel to the initial Euclidean pixel distance between the projected vertex vi c and the matched contour pixel pc i kj sil = exp ( vc i pc i 2 2 α 2 of Csil j : ). (10) Therefore, good matches give the corresponding constraint Cj sil (V) a weight close to 1, while an increasing distance leads to smaller weights (α > 0 controls the width of the Gaussian lobe). All smoothness constraints Ci sm (V) are equally weighted with k sm = 1 throughout this paper. We perform multiple iterations of finding correspondences and deforming the mesh according to the resulting constraints. By using the proposed weighting scheme, we allow that rim vertices already close to the silhouette contour are kept close to their optimal position. Distant matches are initially affected more by the smoothness constraints. Thus, they eventually gain higher weights as they get aligned with the silhouette contour during optimization. C. The Iterative Solver In Laplacian Mesh Editing (LME), (8) is used as a regularization term and a few selected control vertices guide the shape deformation. Even for a large number of vertices, the deformed mesh can be computed efficiently when the set of control vertices does not change. In this case, the optimal solution to a linear system of equations can be precomputed via Cholesky decomposition once and a deformed mesh can be obtained through simple back substitution multiple times when the positions of control vertices change [10]. However, the set of control vertices changes continuously when deforming a mesh using iteratively updated image correspondences. Thus, no pre-computations are possible and the optimization has to be performed from scratch every Algorithm 1 Constraint projection algorithm. Require: V = {v i... v V } 1: {Φ 1... Φ M } initialize(v) 2: for number of outer iterations N o do 3: {Φ 1... Φ M, k 1... k M } update(v, Φ 1... Φ M ) 4: for number of inner iterations N i do 5: for j = 1... M do 6: V V k j C j(v Φ j) V C j(v Φ j) 2 V C j (V Φ j ) 7: end for 8: end for 9: end for time. In contrast to LME, there are hundreds of control vertices in shape deformation, which are often applied to neighboring vertices. In addition, it is usually possible to obtain initial vertex positions close to the optimal deformation when adapting a mesh to images. Therefore, we argue that an iterative solver is suitable to optimize (5). By using nonlinear constraints and an update step weighting that is similar to [17], we achieve high quality deformation results. In Section V, we show that our solver requires fewer iterations than the iterative Conjugate Gradient method [18] with linear constraints only. Our iterative solver for initializing and updating constraint parameters Φ 1... Φ M and projecting constraints C 1... C M is outlined in Algorithm 1. In Line 1, we set up all constraints using the initial vertex positions estimates (i.e. we calculate δ i ). The solver contains two loops: the outer loop (Line 2) is entered N o times and controls how often constraint parameters are updated (i.e. matching of rimvertices with the silhouette contour) while the inner loop in Line 4 projects the constraints. Since constraints are projected independently of each other, the number of inner iterations N i influences how far the effect of each constraint can propagate along the surface of the mesh. We do not multiply correction steps by k j directly, but use a modified weight k j = 1 (1 k j ) 1/N i which allows projecting constraints with linear dependence on N i [17]. The constraint projection in Line 6 prohibits parallelization because each calculation depends on the updated values V of the previous projection. When a parallel processing architecture such as a GPU is available, it is possible to calculate the update step V from the same vertex positions V for all constraints in parallel. However, the number of inner iterations N i needs to be increased since the convergence rate is slower compared to the Gauss-Seidel type solver. Vertex positions can be updated in parallel as well, but it has to be ensured that a vertex is not updated by multiple constraints at the same time. IV. ESTIMATING THE HUMAN BODY SHAPE One application of shape estimation is to deform a template mesh such that it fits the shape of a human body

g k g g k-1 k-1 g k g k+1 L L T k T k+1 (a) g k+1 Figure 4. (a) Aligning limbs using local rotation and length transformations. (b) The SCAPE mesh [19] in its default pose and its skeleton. recorded by a synchronized multi-camera system. In this section, we show how to initialize a generic model such that we can apply our constraints and solver. We first estimate the 3D pose of the human body from multi-view camera images. Then, we transform the mesh such that it has roughly the same body dimensions and posture. Finally, we deform the mesh until it best fits to image data. A. Pose Estimation The availability of an affordable depth sensor (Kinect) has led to major improvements in real-time pose estimation. Shotton et al. [1] show how to translate pose estimation to a depth-map labeling problem which can efficiently be solved using randomized decision forests in real-time. The output of such an algorithm is a set of joint positions g k R 3, which belong to a skeleton with K joints. Our algorithm can be initialized from such joint positions. For each joint k, we determine homogeneous transformation matrices T k R 4 4 that allow to transform our template mesh such that it has a pose similar to the user. We calculate T k directly from g k as a global transformation T G and local limb transformations T L k : c k T k = T G (11) j=1 T L c k (j) where c k is the mapping that represents the order of joints along the kinematic chain from the root node to joint k. The global transformation aligns the upper body of the skeleton by means of rotation, scale and translation. Each local limb transformation T L k rotates and scales the bone between joint k and its parent joint such that it is aligned with g k. In Fig. 4a we demonstrate this alignment process, which automatically adapts the template skeleton to the actual size of the body. B. The Articulated Body Model (b) Template based shape estimation requires a mesh M 0 of the human body. To handle arbitrary poses, the model must support deformation by an underlying articulated skeleton. In this work, we use the static SCAPE mesh model [19] in its default pose as in Fig. 4b. Any other watertight mesh is suitable for this purpose as well. The skeleton with K joints is embedded into the mesh and linear skinning weights ρ i,k are calculated using a rigging algorithm [20], which links each vertex to one or multiple joints. Linear blend skinning is used to transform the mesh M 0 into the mesh M with the current pose of the user: K [ ] 0 vi v i = ρ i,k T k (12) 1 k=1 where vertex positions v i can be obtained as a linear combination of the template vertex positions v 0 i that are transformed by weighted joint transformations T k. C. Shape Estimation We use the transformed mesh M for the initialization of both vertex positions and constraints in our shape estimation method. We no longer consider the underlying bone structure when deforming the mesh, since we have observed often that there is a non-negligible offset between the real joint position and the estimate given by the skeleton tracker. Usually, our shape estimation algorithm corrects such offsets without visible artifacts. V. EXPERIMENTS We evaluate our approach on multiple video sequences of moving persons, either recorded with our own multicamera setup or simulated through rendering of artificial data. Besides visual quality, we evaluate our algorithm in terms of reconstruction quality and run-time, and compare it to related approaches. Specifically, we compare the mesh adapted with our method to the output of related methods based on linear Laplacian Mesh Editing (LME) such as [8], [9]. We set up the linear systems of equations for LME mesh deformation using the same template mesh and rim-vertex/contour correspondences as used with our approach. As suggested in [10], we solve for optimal vertex positions in least squares sense using a sparse Cholesky decomposition. In addition, we compare our method to the iterative conjugate gradient algorithm [18], which is an alternative for solving least squares linear equations. A. Experimental Setup Our recording hardware consists of a studio environment with ten synchronized cameras connected to a single computer [2]. Each camera delivers a color image with 640 480 pixels at 15 frames per second. Silhouettes of the user are obtained through color-based background segmentation. Based on the image resolution, we set the search region for silhouette contour matches to τ = 30 pixel and use α = 40 pixel for the calculation of weights k sil j.

Convergence (%) 100 90 80 12000 Vertices 6000 Vertices 70 60 0 5 10 15 20 Number of iterations (a) Constraint based (b) Conjugate gradient (c) Cholesky Figure 6. Evaluation of the number of contour matching iterations. The dotted line represents the value N o = 8 used in this paper. Figure 5. Convergence quality after 2 and 8 solver iterations of constraint based deformation (a) and least squares conjugate gradient (b). The mesh obtained by solving the linear system using a Cholesky decomposition is shown in (c). For estimating the human body pose, any algorithm that computes skeleton joint positions in real-time is suitable. For example, Straka et al. [21] compute the skeleton pose directly from silhouette images and Shotton et al. [1] use depth maps as input. We use the OpenNI framework [22] which includes a real-time pose estimation module similar to [1]. Instead of using a Kinect camera, which would require additional calibration and synchronization with our multi-view system, we generate a volumetric 3D model [2] and render a depth map from a virtual viewpoint. Note that [22] only supports typical Kinect poses, therefore our implementation would benefit from more advanced realtime pose estimation systems such as [1], [23], which are unfortunately not publicly available. B. Visual Quality Our solver and the conjugate gradient method require multiple iterations until a satisfying mesh deformation is obtained. In Fig. 5, we compare the quality of the resulting mesh (2 500 vertices) after two and eight solver iterations N i while keeping the rim-vertex/contour matches constant. The constraint based approach produces smooth results after two iterations already, while the conjugate gradient solver yields a noisy mesh. After eight iterations both approaches yield similar results, which are comparable to the mesh obtained by solving the LME system via Cholesky decomposition. The reason for the fast convergence of our algorithm is that we use nonlinear constraints and that the step size is automatically tuned according to the number of iterations. For high quality results, we iterate between contour matching and mesh deformation in an iterative closest point fashion. In Fig. 6, we analyze how many iterations are needed until the contour correspondences stabilize (at 100% vertices have converged to a stable position). We use N o = 8 iterations as a good trade-off between quality and speed in the following experiments. In Fig. 7a, we analyze the distribution of the remaining error by rendering silhouettes of an artificial human body. Therefore we render a known human mesh from virtual cameras that mimic our real camera setup. After applying (a) (b) Figure 7. (a) Deformation error measured via the Hausdorff distance. (b) Mesh overlaid on captured images. our mesh deformation algorithm, we can determine the offset between deformed vertex positions and ground-truth data using the Hausdorff distance. The error stays below 10 mm for the majority of the body surface. In concave areas such as the crotch region there is a higher error since these regions are not visible in silhouette images. Fig. 7b shows a wire frame representation of the deformed mesh, overlaid on recorded camera images. Related methods often present the deformation of a subject specific laser scan, which includes details such as the face and wrinkles of clothes [4], [8]. In contrast, we deform the same template mesh to multi-view silhouette images of a variety of people (see Fig. 8). This means that the mesh will only adapt to details that are visible in silhouette contours. However, we can recover additional details in a rendering stage through projective texturing (see Fig. 1d). The advantage of using a generic mesh is that we can estimate the body shape of previously unknown people without additional 3D scanning. Note that the quality of feet in our results is comparatively low as the majority of our cameras are pointed towards the upper body. C. Runtime Performance We analyze the runtime performance of constraint based mesh deformation on a single-threaded 3 GHz processor. In addition, we show that our approach can take advantage of current GPU architectures such as the NVIDIA GTX 480 by processing all constraints in parallel. For runtime measurements, we perform N o = 8 iterations of contour

Figure 8. A single template mesh can be deformed to people of different size and gender. The color images are background segmented camera images and the mesh is rendered from a similar viewing angle. Time per frame (s) 10 1 10 0 10 1 10 2 10 3 GPU Constraint Cholesky CPU Constraint Conj. Gradient 0 2,000 4,000 6,000 8,000 10,000 12,000 Number of Vertices Figure 9. Comparison of the runtime of different optimization methods with increasing number of vertices. Table I COMPARISON OF THE TIME REQUIRED TO DEFORM A HUMAN MESH TO MULTI-CAMERA DATA IN SECONDS PER FRAME. Authors Model Vertices Time Aguiar et al. [4] Scan 2 K 27 s Cagniart et al. [7] Scan/Visual Hull 10 K 25 s Hofmann & G. [11] parametric N/A 15 s Vlasic et al. [9] Scan 10 K 4.8 s Gall et al. [8] Scan N/A 1.7 s This work (CPU) SCAPE 12 K 0.15 s This work (GPU) SCAPE 12 K 0.02 s matching and use N i = 8 solver iterations for our method and conjugate gradients. In Fig. 9 we analyze the time required for the deformation of a mesh at different resolutions and compare the runtime to standard linear solvers. Note that we exclude the time for matching rim-vertices with silhouette contours, which is the same for all methods. Our method (GPU/CPU Constraint) clearly outperforms both linear solvers by a factor of about 20 in the sequential implementation and is more than 100 times faster when executed on a GPU. The bottleneck of linear solvers lies in time consuming matrix decompositions or matrix-vector products. The complete pipeline for mesh deformation includes human pose estimation and mesh initialization. The implementation of our approach is able to adapt a mesh with 12 000 vertices to multi-view silhouette data within 150 ms on a single CPU (or only 20 ms on a GPU). This allows for mesh deformation at the frame rate of our camera setup. Obviously, we can decrease this processing time even further when the number of vertices is reduced. Especially when texture is applied to the mesh, a few thousand vertices are sufficient for a realistic display. In Table I, we compare the runtime of our approach with existing methods. It is not possible to compare these methods directly nor is it fair to compare the run-time on different platforms. However, this paper presents the first method that eliminates the performance bottleneck of the solver. So far, only our system is able to achieve interactive frame rates when adapting the shape of a human body model to image data. D. Limitations The current implementation relies on a fairly accurate initialization of the skeleton joints. Small displacements of joints can be handled without loss of quality since the mesh automatically gets pulled towards the silhouette contour. However, if the displacement is too large or completely wrong, the search for silhouette contours will fail and no silhouette constraints can be generated for affected vertices. Our approach cannot adapt the body shape if the user wears substantially different clothing than the template mesh (e.g. a skirt). In this case, a specialized template with similar clothing is needed. VI. CONCLUSIONS We have presented a novel method which allows us to automatically estimate the shape of the human body from multi-view images in real-time. This is achieved by deforming a generic template mesh such that rim-vertices are aligned with silhouette contours in all input images. In contrast to existing approaches, we optimize the mesh by using an iterative solver which allows integrating nonlinear constraints. We have shown that the execution time of our solver outperforms previous work by a factor of 20 or more while we maintain a comparable visual quality of the deformed mesh. Thus, we are able to estimate the pose and shape of a human body in an interactive environment. This opens up the possibility for a variety of applications including live 3D video transmission and augmented reality

applications where the user can control his own personal avatar. Related work shows adapted body surface preferably using subject specific laser-scans [4], [8]. We have demonstrated that our constraints are sufficient to deform a generic mesh [19] to fit a variety of persons as long as they wear tight fitting clothing. This makes our method particularly suited for multi-user environments where no person-specific template mesh is available or building such a model is the desired task. In this paper, we have focused on mesh deformation based on silhouettes. However, our method is capable of adapting a mesh to different input data as well. For example, it is possible to create constraints that deform a mesh to fit oriented point clouds [16] or depth maps [14]. Recently, it has been shown how to jointly optimize the mesh surface and the underlying skeleton in a linear way [5], which is compatible with our constraint definitions. Therefore, future work will focus on including such skeleton constraints in our algorithm to make the deformation process even more robust. ACKNOWLEDGMENT This work was supported by the Austrian Research Promotion Agency (FFG) under the BRIDGE program, project #822702 (NARKISSOS). Furthermore, we would like to thank the reviewers for their valuable comments and suggestions. We also want to thank everyone who was spending her or his time during the evaluation of this work. REFERENCES [1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, Real-time human pose recognition in parts from single depth images, in Proc. of CVPR, 2011. [2] M. Straka, S. Hauswiesner, M. Rüther, and H. Bischof, A free-viewpoint virtual mirror with marker-less user interaction, in Proc. of SCIA 2011, LNCS 6688, A. Heyden and F. Kahl, Eds., 2011, pp. 635 645. [3] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, Kinectfusion: Real-time dense surface mapping and tracking, in Proc. of IEEE ISMAR, 2011. [4] E. d. Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, and S. Thrun, Performance capture from sparse multi-view video, ACM Transactions on Graphics, vol. 27, no. 3, 2008. [5] M. Straka, S. Hauswiesner, M. Rüther, and H. Bischof, Simultaneous shape and pose adaption of articulated models using linear optimization, in Proc. of ECCV 2012, Part I, LNCS 7572, 2012, pp. 724 737. [6] L. Ballan and G. M. Cortelazzo, Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes, in Proc. of 3DPVT, 2008. [7] C. Cagniart, E. Boyer, and S. Ilic, Probabilistic deformable surface tracking from multiple videos, in Proc. of ECCV 2010, Part IV, LNCS 6314, 2010, pp. 326 339. [8] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel, Motion capture using joint skeleton tracking and surface estimation, in Proc. of CVPR, 2009. [9] D. Vlasic, I. Baran, W. Matusik, and J. Popović, Articulated mesh animation from multi-view silhouettes, ACM Transactions on Graphics, vol. 27, no. 3, 2008. [10] M. Botsch and O. Sorkine, On linear variational surface deformation methods, IEEE Trans. on Visualization and Computer Graphics, vol. 14, no. 1, pp. 213 230, 2008. [11] M. Hofmann and D. M. Gavrila, 3D human model adaptation by frame selection and shapetexture optimization, Computer Vision and Image Understanding, vol. 115, no. 11, pp. 1559 1570, 2011. [12] A. Kanaujia, N. Haering, G. Taylor, and C. Bregler, 3D human pose and shape estimation from multi-view imagery, in Proc. of CVPR Workshops, 2011. [13] N. Hasler, C. Stoll, B. Rosenhahn, T. Thormählen, and H.-P. Seidel, Estimating body shape of dressed humans, Computers & Graphics, vol. 33, no. 3, pp. 211 216, 2009. [14] A. Weiss, D. Hirshberg, and M. J. Black, Home 3D body scans from noisy image and range data, in Proc. of ICCV, 2011, pp. 1951 1958. [15] K. Li, Q. Dai, and W. Xu, Markerless shape and motion capture from multiview video sequences, IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 3, pp. 320 334, 2011. [16] Y. Furukawa and J. Ponce, Dense 3D motion capture from synchronized video streams, in Proc. of CVPR, 2008. [17] M. Müller, B. Heidelberger, M. Hennix, and J. Ratcliff, Position based dynamics, Journal of Visual Communication Image Representation, vol. 18, no. 2, pp. 109 118, 2007. [18] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed. SIAM, 1994. [19] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, SCAPE: shape completion and animation of people, in Proc. of the ACM SIGGRAPH, 2005. [20] I. Baran and J. Popović, Automatic rigging and animation of 3D characters, in Proc. of the ACM SIGGRAPH, 2007. [21] M. Straka, S. Hauswiesner, M. Rüther, and H. Bischof, Skeletal graph based human pose estimation in real-time, in Proc. of BMVC, J. Hoey, S. McKenna, and E. Trucco, Eds., 2011. [22] (2012) OpenNI. [Online]. Available: http://www.openni.org/ [23] C. Stoll, N. Hasler, J. Gall, H. Seidel, and C. Theobalt, Fast articulated motion tracking using a sums of gaussians body model, in Proc. of ICCV, 2011, pp. 951 958.