Articulated Tracking with a Dynamic High-Resolution Surface Model

Size: px

Start display at page:

Download "Articulated Tracking with a Dynamic High-Resolution Surface Model"

Penelope Robbins
5 years ago
Views:

1 Articulated Tracking with a Dynamic High-Resolution Surface Model Aaron Walsman Tanner Schmidt Dieter Fox Paul G. Allen School of Computer Science and Engineering University of Washington {awalsman, tws10, fox@cs.washington.edu I. I NTRODUCTION In order for robots to interact with complex deformable objects, a vision system must produce a perceptual representation that combines both spatial and semantic information, and must be fast enough to keep up with the object s physical motion. This is especially true of systems that interact directly with humans where safety is critical and mistakes can be dangerous. We present a real-time system that tracks the surface and kinematic pose of deformable objects using a model-based optimization. Our primary goal is to provide accurate surface estimation that not only adapts to match a particular subject, but does so dynamically, tracking complex surface details such as folds and wrinkles as they appear and disappear. Our approach fits a skeletal model and high-resolution polygon mesh to a point cloud. The skeleton is designed to capture the underlying kinematic structure of the subject and estimate it s large-scale motion, while the polygon mesh captures volume differences between subjects and more complex surface details. This gives a robotic system both a lowdimensional pose that can be used for gesture and activity recognition, as well as a dense surface estimate which can be used for precise physical interaction. Furthermore, because our mesh comes from a predefined template, it is semantically consistent across capture sessions with different subjects. This means that we can tell not only where the surface of the subject is, but which regions of this surface correspond to different body parts. Our skeleton fitting approach is similar to other generative methods for articulated model tracking [3, 9, 10, 12, 20, 23]. This requires model initialization, but means that we can easily track new subject types as long as an appropriate template model is available. This also means we can avoid the expensive data collection efforts required by approaches with a discriminative component [8, 13, 19, 22, 21]. Our surface fitting technique was inspired by dynamic surface reconstruction methods [7, 14, 17]. These use reconstruction techniques such as volumetric SDF fusion [5] while simultaneously estimating deformation parameters that warp the surface into place. Rather than build a new mesh for each capture session, we instead use a template model that is attached to the kinematic skeleton using traditional smooth binding approaches from computer graphics [15] and further deform it to fit an observed point cloud. The motivating applications behind our work are robotic Fig. 1. Our model tracking a point cloud. Top left: Colored point cloud input. Top Right: Estimated skeleton and surface mesh without surface tracking. Bottom Left: High resolution mesh tracking the dynamic shape of the subject. Bottom Right: The high resolution mesh with projected colors. tasks that require precise spatial information about humans and deformable objects. There are several robotics applications in personal assistance, health care and rehabilitation that are hampered by a lack of reliable human pose and surface estimation. In addition to robotics, this approach has further applications in augmented reality, performance capture and interactive games. II. T EMPLATE M ODEL Similar to Ye and Yang [23] and Schmidt et al. [20] our technique uses an iterative gradient-based approach to fit a kinematics model to observed data. We assume that the tracking sequence starts with an initial estimate of the skeletal pose. From that initialization, we iteratively optimize the pose to fit each incoming frame. We then use a second optimization to update the vertex positions of a triangle mesh representing the object s surface.

2 A. Dual Quaternion Kinematics Our model consists of a kinematic skeleton with an attached mesh. The skeleton forms a tree of rigid link frames connected by flexible joints. Each link has a position and orientation represented as a transformation in 3D space relative to the camera. We use dual quaternions [4] to parameterize this space of rigid transformations SE(3). For the sake of space, we omit a thorough coverage of the mathematical details and instead refer readers to [6]. We use single axis hinge and prismatic joints to describe the offset between frames meaning that the pose of each joint can be described by a single parameter θ j. For joints with multiple degrees of freedom such as the shoulder or wrist, we use multiple overlapping joints. The pose of the entire skeleton can be described as a single vector Θ containing all of these joint parameters. B. Surface Representation and Binding The model s surface is represented as a high resolution triangle mesh. This consists of set of 3D vertex positions V = { v 1... v V, vi R 3 as well as a triangle list F. Each triangle is represented as a set of three integers referencing vertex indices F = { f 1... f F, fk Z 3. The mesh is attached to the skeleton using dual quaternion binding [15] which provides a way to smoothly blend the influence of links between different regions of the mesh. This technique is also used by Dynamic Fusion [17] to attach the reconstructed mesh to a set of warp node transforms. Dual quaternion binding requires a bind pose dual quaternion Hj 0 for each link in the skeleton as well as a weight matrix Ω describing the influence of each frame on each vertex. The bind pose represents the pose for which the kinematic skeleton matches the default pose of the mesh. Each column ω i corresponding to vertex i is constrained to sum to one. Given this information the binding function transforms a vertex by computing the offset between the bind pose and the current pose H j of each frame and then constructing a linear blend H iσ of these offsets for each vertex v i based on the weights. H iσ = ( ) H j H 0 1 j ωij (1) j The skinned vertex position v i can be computed by multiplying this transformation by the vertex position in the default model v 0 i. v i = H iσ v 0 i (2) Our human mesh originated from a model from on the website CG Trader [2] by the NoneCG group [18] and used with permission. We heavily modified this mesh and constructed the skeleton hierarchy and skin weights weights using Autodesk s Maya software [1]. C. Dynamic Shape Parameters Dynamic shape warping is represented by a set of offsets Φ containing a three vector φ i R 3 for each vertex describing a translation away from its default position. We can augment Equation 2 above to compute the warped position v i : v i = H iσ (v 0 i + φ i ) (3) III. OPTIMIZATION Using this model, we formulate pose and shape tracking as a damped-least-squares optimization process. We will first discuss our data association and residual computation before going into more detail on the kinematic and shape optimization steps. A. Data Association and Residual Given the model described in Section II the task of estimating pose and shape requires estimating the joint angles Θ and the vertex offsets Φ. This is achieved by first generating a residual term describing the distance between the model and the observations, computing the derivative of that residual with respect to the parameters Θ and Φ and then using an optimization step to compute an offset to these parameters that reduces the error. Our observations take the form of a point cloud P = { p1... p P, pk R 3. The error term is computed by first constructing an assignment between vertices and observations. The data assignment is computed bidirectionally by finding the closest observation for each vertex and the closest vertex for each observation. Schmidt et al. [20] have shown that this improves robustness in certain conditions. Once we have the assignment we compute a point-plane error term [3] for each vertex based on these sources. If we let N = { n 1... n V be the vertex normals, P = { p1... p V be the closest observed point to each vertex and P = { p 1... p V be the average of the observation points for which each vertex in V is the closest, we can construct the residual for each vertex as r i = n T i (λ a ( p i v i ) + (1 λ a )( p i v i )) (4) B. Kinematic Optimization The residual in equation 4 a function of the vertex positions V. The position of each vertex is a function of the skeleton pose Θ, the bind pose H 0, the default mesh vertices V 0, the vertex offsets Φ and the skin weights Ω. Therefore, we use the chain rule to compute the Jacobian of this residual J r = r/ Θ as a product of two partial derivatives. This means that each row J ri of J r corresponding to vertex i can be written as: J ri = r i Θ = r i V V Θ Because we use point-plane error, the derivative of the residual with respect to the vertex position is the vertex normal. The derivative of the vertex position with respect to the skeleton pose Θ is more complex, but can be computed analytically. Due to space, we omit the details of the full derivative here. In addition to the per-vertex residual in equation 4 we add a residual per joint that encourages the skeleton to move towards its default pose. This takes the form r j = s j θ j where s j is a scalar that depends on the number of vertices affected by each joint. We construct the Jacobian J k for the full kinematics by concatenating the vertex residual J r with the the Jacobian of the pose prior which has rows of the form J θi = r j / Θ.

3 [ ] Jr J k = J θ We then use damped least-squares [16] to solve for a pose offset Θ. (J T J + λ d1 diag(j T J)) Θ = J T r The (J T J + λ d1 diag(j T J)) matrix is positive definite, meaning it can be solved efficiently using Cholesky decomposition [11]. Once Θ has been computed it is subtracted from the current pose Θ and the process is repeated. In practice we have found that ten to fifteen iterations of this optimization for each incoming frame is sufficient to match the pose of the target. The top right frame of Figure 1 shows the result of fitting the kinematic model with the default mesh onto a point cloud. C. Shape Optimization Once the pose has been fit, we update the shape deformation parameters Φ. These consist of a three vector φ i for each vertex. The shape optimization uses the residual from Equation 4 but incorporates additional regularization terms. r φi = r i + λ s φ i λ n φ i φ n 2 2 (5) n N (i) The first regularization term, weighted by λ s, penalizes the magnitude of the φ i deformation vector. This helps prevent the mesh from drifting off the skeleton. The second regularization term penalizes the difference between each vertex offset and those of its neighboring vertices which helps prevent surface discontinuities and creases. As before we compute the derivative of this residual with respect to each parameter of Φ, but make one important simplifying approximation. Technically the second regularization term introduces interdependence between each φ i and its neighbors, but in order to simplify the computation we treat each φ i as if it were independent. This means that instead of solving one large but sparse 3 V by 3 V linear system we can break it up into a separate 3 by 3 linear systems for each vertex and solve them in parallel. This means we compute a Jacobian J φi for each vertex as J φi = r φ i φ i = r i φ i + 2λ s φ i + 2λ n n N (i) (φ i φ n ) Fig. 3. Our performance on the EVAL dataset compared to Articulated ICP reported by [9], Ganapthi et al.[9], Schmidt et al.[20] and Ye and Yang[23]. Once we have computed our Jacobian J φi we use damped least squares as before and solve (J T φ i J φi + λ d2 diag(j T φ i J φi )) φi = J T i r φi for φi, and subtract it from φ i. The bottom left frame of Figure 1 shows the result of the shape deformation after the kinematic pose has been fit. In practice only two iterations of shape refinement are necessary for each incoming video frame. IV. EXPERIMENTS There are a number of existing datasets designed to test the capabilities of markerless motion capture systems, but to our knowledge there are no existing benchmarks for evaluating both pose and high resolution dynamic shape estimation simultaneously. We therefore evaluate our pose tracking on the EVAL dataset [9] without dynamic shape warping and do further experiments with our own data to quantify the performance of our mesh tracking. A. EVAL Dataset The EVAL dataset consists of twenty four RGBD sequences split evenly across three human subjects with varying body proportions. We use the standard evaluation criteria which is the percentage of frames in which the estimated joint position is within ten centimeters of the ground truth. Because the ground truth data relies on joint locations specific to a Fig. 2. Histograms of the distance from forward-facing vertices to their closest observation point computed over all frames for each of sixteen sequences. Here the results have been combined based on the distance from the camera. The x axis is from 0 to five centimeters binned to half-millimeter intervals, while the y axis is the percentage of vertices with a distance that falls into that range. Eighty percent of points have a value less than the dashed lines.

particular model, we follow the technique of [23] and use mean-subtraction to find the best placement of the tracked joints relative to our model.

4 Fig. 5. The point cloud is shown on the left. The center shows the mesh with no dynamic shape warping (Φ = 0). The fully deformed mesh is shown on the right. The center and right are colored to show distance to the closest point in the point cloud. particular model, we follow the technique of [23] and use mean-subtraction to find the best placement of the tracked joints relative to our model. Figure 3 shows our performance compared to the reported scores of other methods. While we do not perform as well as other state of the art techniques, our method was not designed to work with low resolution depth data and our results are comparable with the range of existing literature. B. Shape Fitting Experiments In order to test our shape estimation we generated a new dataset consisting of sixteeen high resolution RGBD sequences. These sequences were captured from four subjects at three different distances and feature basic motion such as waves, stretches, pointing and hand overs. These sequences also include data from three different distances: head and shoulders, upper torso and full body. To show the degree to which our shape optimization improves surface fitting, we tracked our model through each sequence twice. For one run the track is run normally but for the other the vertex offsets are disabled so that the model maintains its default shape throughout the sequence. For each run we record the distance from each visible model vertex to its closest observation point in each frame. Because we do not have dense ground truth labels for this data, these results do not show overall tracking performance. Instead this data shows how much the dynamic shape estimation improves fitting the point-cloud observations. Figure 2 shows these errors binned into histograms while Figure 5 shows one frame with both meshes colored to indicate this error. The shape estimation shows improvement in the form of higher concentration of error at low values for all sequences. However this improvement is most pronounced at closer ranges where more detail is visible. C. Qualitative Results Figure 4 shows several colored point clouds and their corresponding tracked meshes at varying distances from the camera. In all cases the tracker was given a rough initialization of the subject s pose at the start of the sequence, and the pose and shape tracked from that point forward. V. CONCLUSION We have demonstrated an approach for tracking deformable articulated objects using model-based optimization and shown that it can produce dense and accurate estimation of detailed deforming surfaces in real time. This system also provides useful pose estimates of the model s kinematic structure for gesture recognition and motion prediction. Fig. 4. A collection of still frames showing the results of our system. For each pair of images, the left shows the colored point cloud while the right shows our warped output mesh.

5 REFERENCES [1] Autodesk Incorporated. Maya. URL com/products/maya/overview. [2] CG Trader. Cg trader. URL [3] Yang Chen and Gérard Medioni. Object modelling by registration of multiple range images. Image and vision computing, 10(3): , [4] William Kingdon Clifford. Mathematical papers. Macmillan and Company, [5] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages ACM, [6] Konstantinos Daniilidis. Hand-eye calibration using dual quaternions. The International Journal of Robotics Research, 18(3): , [7] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG), 35(4):114, [8] Varun Ganapathi, Christian Plagemann, Daphne Koller, and Sebastian Thrun. Real time motion capture using a single time-of-flight camera. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages IEEE, [9] Varun Ganapathi, Christian Plagemann, Daphne Koller, and Sebastian Thrun. Real-time human pose tracking from range data. In European conference on computer vision, pages Springer, [10] Cristina Garcia Cifuentes, Jan Issac, Manuel Wüthrich, Stefan Schaal, and Jeannette Bohg. Probabilistic articulated real-time tracking for robot manipulation. IEEE Robotics and Automation Letters (RA-L), [11] Philip E Gill and Walter Murray. Newton-type methods for unconstrained and linearly constrained optimization. Mathematical Programming, 7(1): , [12] Daniel Grest, Jan Woetzel, and Reinhard Koch. Nonlinear body pose estimation from depth images. In Joint Pattern Recognition Symposium, pages Springer, [13] Thomas Helten, Andreas Baak, Gaurav Bharaj, Meinard Müller, Hans-Peter Seidel, and Christian Theobalt. Personalization and evaluation of a real-time depth-based full body tracker. In 2013 International Conference on 3D Vision-3DV 2013, pages IEEE, [14] Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. Volumedeform: Real-time volumetric non-rigid reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), [15] Ladislav Kavan, Steven Collins, Jiří Žára, and Carol O Sullivan. Geometric skinning with approximate dual quaternion blending. ACM Transactions on Graphics (TOG), 27(4):105, [16] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2): , [17] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of nonrigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages , [18] NoneCG. Nonecg. URL [19] Christian Plagemann, Varun Ganapathi, Daphne Koller, and Sebastian Thrun. Real-time identification and localization of body parts from depth images. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages IEEE, [20] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Dart: Dense articulated real-time tracking. Proceedings of Robotics: Science and Systems, Berkeley, USA, 2, [21] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1): , [22] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (TOG), 33(5):169, [23] Mao Ye and Ruigang Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , 2014.

Dynamic High Resolution Deformable Articulated Tracking

Dynamic High Resolution Deformable Articulated Tracking Aaron Walsman Weilin Wan awalsman@cs.washington.edu weiliw@uw.edu Tanner Schmidt tws10@cs.washington.edu Dieter Fox fox@cs.washington.edu Paul G.