Multi-camera Tracking of Articulated Human Motion Using Motion and Shape Cues

Multi-camera Tracking of Articulated Human Motion Using Motion and Shape Cues Aravind Sundaresan and Rama Chellappa Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA {aravinds, rama}@cfar.umd.edu http://www.cfar.umd.edu/users/aravinds/ Abstract. We present a framework and algorithm for tracking articulated motion for humans. We use multiple calibrated cameras and an articulated human shape model. Tracking is performed using motion cues as well as image-based cues (such as silhouettes and motion residues hereafter referred to as spatial cues,) as opposed to constructing a 3D volume image or visual hulls. Our algorithm consists of a predictor and corrector: the predictor estimates the pose at the t + 1 using motion information between images at t and t + 1. The error in the estimated pose is then corrected using spatial cues from images at t +1. In our predictor, we use robust multi-scale parametric optimisation to estimate the pixel displacement for each body segment. We then use an iterative procedure to estimate the change in pose from the pixel displacement of points on the individual body segments. We present a method for fusing information from different spatial cues such as silhouettes and motion residues into a single energy function. We then express this energy function in terms of the pose parameters, and find the optimum pose for which the energy is minimised. 1 Introduction The complex articulated structure of human beings makes tracking articulated human motion a difficult task. It is necessary to use multiple cameras to deal with occlusion and kinematic singularities. We also need shape models to deal with the large number of body segments and to exploit their articulated structure. In our work, we use shape models, whose parameters are known, to build a system that can track articulated human body motion using multiple cameras in a robust and accurate manner. A tracking system works better if there are more number of observations to estimate the pose and to that end our system uses different kinds of cues that can be estimated from the images. We use both motion information (in the form of pixel displacements), as well as spatial information (such as silhouettes, and motion residues, hereafter referred to as spatial cues). The motion and spatial cues are complementary in nature. We present a framework for unifying different spatial cues into a single energy image. The energy of a pose can be described in terms of this energy image. We can then obtain the pose that possesses the least energy using optimisation P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 131 140, 2006. c Springer-Verlag Berlin Heidelberg 2006

132 A. Sundaresan and R. Chellappa techniques. Much of the work in the past has focussed on using either motion or spatial parameters. In this paper we present an algorithm that fuses together information from these two kinds of cues. Since we use motion and spatial cues in our tracking algorithm, we are able to better deal with cases where the body segments are close to each other, such as when the arms are by the side of the body. Purely silhouette based methods typically experience difficulties in such cases. Silhouette or edge-based methods also have the weakness that they will not be able to deal with rotation about the axis of the body segment. Estimating the initial pose is a different problem from tracking and is difficult due to the large number of unknown parameters (joint angles). It is computationally intensive and typically requires several additional algorithms such as head detectors or hand detectors. Stochastic algorithms such as particle filtering or optimisation methods are required for the sake of robustness. While the methods we present in this paper can be used for initialisation as well, we concentrate on the tracking aspect. (a) 3D Scan (b) Super-quadric Fig. 1. Overview of the algorithm Fig. 2. 3D model comparison In our work, we use eight cameras that are placed around the subject. We use parametricshape models connected in anarticulated tree to represent the human body as described in 1.2. Our system, the block diagram of which is presented in Figure 1, consists of two parts: a predictor and corrector. We assume that the initial pose is known. The tracking algorithm is as follows. Compute 2D pixel displacement between frames at times t and t +1. Predict 3D pose at t +1basedon2Dmotionfrommultiplecameras. Compute an energy function that fuses information from different spatial cues. Use the energy function to refine estimate of pose at t +1. We represent the pose, ϕ t, in a parametric form as a vector of the position of the base-body (6 degrees of freedom) and the joint angles of the various articulated body segments (3 degrees of freedom for each joint.) δ represents the incremental pose vector. We summarise prior work in articulated tracking in 1.1. We then describe the models in 1.2 and the details of our algorithm in 2. We validate our algorithm using real images captured from eight cameras and the results are presented in 3.

Multi-camera Tracking of Articulated Human Motion 133 1.1 Prior Work We address the problem of tracking articulated human motion using multiple cameras. Gavrila and Davis [1], Aggarwal and Cai [2], and Moeslund and Granum [3], provide surveys of human motion tracking and analysis methods. a We look at some existing methods that use either motion-based methods or silhouette or edge based methods to perform tracking. Yamamoto and Koshikawa [4] analyse human motion based on a robot model and Yamamoto et al. [5] track human motion using multiple cameras. Gavrila and Davis [6] discuss a multi-view approach for 3D model-based tracking of humans in action. They use a generate-and-test algorithm in which they search for poses in a parameter space and match them using a variant of Chamfer matching. Bregler and Malik [7] use an orthographic camera model and use optical flow. Rehg and Morris [8] and Rehg et al. [9] describe ambiguities and singularities in tracking of articulated objects and Cham and Rehg [10] propose a 2D scaled prismatic model. Sidenbladh et al. [11] provide a framework to track 3D human figures using 2D image motion and particle filters with a constrained motion model that restricts the kinds of motions that can be tracked. Kakadiaris and Metaxas [12] use silhouettes from multiple cameras to estimate 3D motion. Plaenkers and Fua [13] use articulated soft objects with an articulated underlying skeleton as a model and use stereo and silhouette data for shape and motion recovery. Theobalt et al. [14] project the texture of the model obtained from silhouette-based methods and refine the pose using the flow field. Delamarre and Faugeras [15] use 3D articulated models for tracking with silhouettes. They use silhouette contours and apply forces to the contours obtained from the projection of the 3D model so that they move towards the silhouette contours obtained from multiple images. Cheung et al. [16] use shapes from silhouette to estimate human body kinematics. Chu et al. [17] use volume data to acquire and track a human body model. Wachter and Nagel [18] track persons in monocular image sequences. They use an IEKF with a constant motion model and use edges to region information in the pose update step in their work. Moeslund and Granum [19] use multiple cues for model-based human motion capture and use kinematic constraints to estimate pose of a human arm. The multiple cues are depth (obtained from a stereo rig) and the extracted silhouette, whereas the kinematic constraints are applied in order to restrict the parameter space in terms of impossible poses. Sigal et al. [20, 21] use non-parametric belief propagation to track in a multi view set up. Lan and Huttenlocher [22] use hidden Markov temporal models. DeMirdjian et al. [23] constrain pose vectors based on kinematic models using SVMs. Rohr [24] performs automated initialisation of the pose for single camera motion. Krahnstoever [25] addresses the issue of model acquisition and initialisation. Mikic et al. [26] automatically extract the model and pose using voxel data. Ramanan and Forsyth [27] also suggest an algorithm that performs rough pose estimation and can be used in an initialisation step. Sminchisescu and Triggs present a method for monocular video sequences using robust image matching, joint limits and non-self-intersection constraints [28]. They also try to remove kinematic ambiguities in monocular pose estimation efficiently [29].

134 A. Sundaresan and R. Chellappa Our method is different in that we use both motion and spatial cues to track the pose as opposed to using volume or visual based techniques or only optical flow. We use spatial and motion cues obtained from multiple views in order to obtain robust results that overcome occlusions and kinematic singularities. We also present a novel method to use spatial cues such as silhouettes and motion residues. It is also possible to incorporate edges in our method. We also do not constrain the motion or the pose parameters for specific types of motion (such as walking) and hence our method is general. 1.2 Models A good human shape model should allow the system to represent the human body in all of it s postures and yet be simple enough to minimise the number of parameters required to represent the body accurately. We use tapered super-quadrics in order to represent the different body segments. We can use more complex triangular mesh models if we can acquire the parameters of such models. We illustrate the 3D model used in our experiments in Figure 2. The dimensions of the super-quadrics are obtained manually with the help of the 3D scanned model in the figure. The motion of the different body segments are constrained by the articulated structure of the body. The base body (trunk) has 6 degree-of-freedom (DoF) motion. All other body segments are attached to the base body in a kinematic chain and have at most 3 DoF rotational motion with respect to the parent node. The body model also includes the locations of the joints of the different body segments besides the shape of the body segment. 2 Algorithm We compute the pose at time t + 1 given the pose at time t using the images at time t and t +1. The pose at t + 1 is estimated in two steps, the prediction step and the correction step. The steps required to estimate the pose at time t +1 are first listed and then described in detail in the sections that follow. 1. Pixel-body registration at time t using known pose at t. 2. Estimate pixel displacement between time t and time t +1. 3. Predict pose at time t + 1 using pixel displacement. 4. Combine silhouettes and motion residue for each body segment into an energy image for each image. 5. Correct the predicted pose at time t + 1 using the energy image obtained in step 4. 2.1 Pixel-Body Registration Pixel-body registration is the process of registering each pixel in each image to a body segment as well as obtain approximate 3D coordinates of the point. We thus obtain a 2D mask for each body segment that we can use while estimating the pixel displacement. We convert each body segment into a triangular mesh

Multi-camera Tracking of Articulated Human Motion 135 (a) View 1 (b) View 2 (a) Mask (b) Image Diff (c) MR (d) Flow Fig. 3. Pixel registration Fig. 4. Pixel displacement and Motion Residue and project it onto each image, and compute the depth at each pixel by interpolating the depths of the triangle vertices. We can thus fairly easily extend our algorithm to use triangular mesh models instead of super-quadrics. Since the depths of all pixels are known, we can compute occlusions. Figure 3 illustrates the projection of the body onto images from two cameras. Different colours indicate different body segments. We compute approximate 3D coordinates of pixels in a similar fashion. 2.2 Estimating Pixel Displacement As we use pixel displacement between frames to estimate 3D pose change, we are not dependent on specific optical flow algorithms. Figure 4 illustrates how we obtain the pixel displacement of a single body segment, the example being that of the left forearm shown in Figure 3 (d). We use a robust parametric model for the motion of the rigid objects so that the displacement, x i,at pixel x i is given by (x i, φ), where φ =[u, v, θ, s]. The elements of φ are the displacements along the x and y axes, rotation and scale respectively. We find that the above parametric representation is more intuitive and more robust than an affine model. We obtain that value of φ [φ 0 φ B, φ 0 + φ B ] that minimises the residue given by e T e where [e] j = I t (x ij ) I t+1 (x ij + (x ij, φ)), and {x ij : j =1, 2, } is the set of all points in the mask obtained in 2.1 and illustrated in Figure 4 (a). φ denotes zero motion and φ B denotes the bounds on the motion that we impose. Figure 4 (a) is the smoothed intensity image at time t. Figure 4 (b) is the difference between image at time t and t + 1, i.e., with zero motion, and has large values in the mask region signifying that there is some motion. Figure 4 (b) is the difference between image at time t and the image at time t + 1 warped according to the estimated motion and is called the motion residue for the optimal φ. The value of the pixels in the region of the mask is close to zero where the estimated pixel displacement agrees with the actual pixel displacement. The motion residue provides us with a rough delineation of the location of the body segment, even when the original mask does not exactly match the body segment.

136 A. Sundaresan and R. Chellappa 2.3 Pose Prediction The pose parameter we need to estimate is the vector ϕ, which consists of the 6- DoF parameters for the base-body and the 3-DoF joint angles for each of the remaining body segments. The state vector in our state-space formulation is ϕ t (1-2). State Update : ϕ t+1 = h(ϕ t )+δ t (1) Observation : f( x t, ϕ t, ϕ t+1 )=0 (2) In our case the function h(.) is linear (3) and the pixel position x(.), in (4), is a non-linear function of the pose, ϕ, and the incremental pose, δ. However, it is well approximated by a linear function locally. ϕ t+1 = ϕ t + δ t (3) f( x t, ϕ t, δ t )= x t (x(ϕ t + δ t ) x(ϕ t )) (4) Let us consider the observation, the measured (noisy) pixel displacement, x t = x t + η, whereη is the measurement noise, and x t is the pixel displacement. We expand f( x t, ϕ t, δ t ) in a Taylor series about f( x t, ˆϕ t, ˆδ t )as f ( x t, ˆϕ t, ˆδ ) t + f ( x t x x t)+ f (ϕ t ϕ t ˆϕ t )+ f ( δ t t δ ˆδ t )+O ( ). t (5) Thelefthandside(f( x t, ϕ t, δ t )) is 0. The first term f ( x t, ˆϕ t, ˆδ ) t is given ( by x t x(ˆϕ t + ˆδ ) t ) x(ˆϕ t ). f The second term can be simplified as x t ( x t x t )=1.( η) = η. f (.) The third term in (5) ϕ (ϕ t t ˆϕ t ) is negligible because the function f(.) is not very sensitive to the current pose, ϕ t and we expect the term ϕ t ˆϕ t to be also negligible. We assume, without loss of generality that δ t is a linear function of time t, sothatδ t = δ.t, whereδ is a constant. We note that (6) follows from the fact that the pixel velocity, x(ϕ t ) t, at a given point is a linear function of the rate of change of pose, δ [30]. f ( x, ϕ, δ t ) = x(ϕ + δ t) / δ t δ t t t = F (ϕ + δ t) δ/δ = F (ϕ t + δ t ) (6) ( The fourth term is f δ t ( x t, ˆϕ t,ˆδ = F t) ˆϕ t + ˆδ ) t We neglect the higher order terms in (5) and obtain the following linearised observation equation (7). ( x t x(ˆϕ t + ˆδ ) ( t ) x(ˆϕ t ) + η = F ˆϕ t + ˆδ t )(δ t ˆδ ) t (7) We solve (7) for δ t iteratively. We set ˆδ 0 t = 0 and perform the following until we obtain numerical convergence, which we do in a few iterations. We finally set ˆϕ t+1 = ˆϕ t + ˆδ N t. ( ) Set F (i) (i) = F ˆϕ t + ˆδ t.

Set x (i) t Update pose: Multi-camera Tracking of Articulated Human Motion 137 ( ( = x t x ˆϕ t + ˆδ (i+1) t = ˆδ (i) t ) ) (i) ˆδ t x(ˆϕ t ) + ( F (i)t F (i)) 1 F (i)t x (i) t. 2.4 Computing Spatial Energy Function We combine different types of spatial cues into an energy image for each body segment. This allows us to use the framework irrespective of which spatial cues are available. In our work we use silhouette information as well as the motion residue obtained during motion estimation. Figure 4 (d) is the motion residue for that segment, and provides us with the region that agrees with the motion of the mask. We combine the motion residue with the silhouette as shown in Figure 5. We can form energy images even if the quality of the silhouette is not very good. There are a number of outliers, but though these may affect other silhouette based algorithms, they do not affect our algorithm much. Old Position New Position (dx,dy) Original position Displaced Displaced and Rotated φ (a) Silhouette (b) Silhouette(c) Motion Residue (d) Energy (e) Object mask (f) 2D pose Fig. 5. Obtaining unified energy image for the forearm Once we have the pixel-wise energy image for each camera and a given body segment we compute the energy for different values of 2D parameters such as displacement and rotation. We have a mask for the body segment for the body segment for a given image as illustrated in Figure 5 (e). We can move this mask by a translation (dx, dy) or a rotation ϕ as illustrated in Figure 5 (f). We can find the energy of the mask in each position by summing the energy of all the pixels that belong to the mask. Thus we can express the energy as a function of (dx, dy, θ) in the neighbourhood of (dx, dy, θ) =(0, 0, 0). When the body segment moves in 3D space by a translation and rotation, we can project the new axis on to the image and find the corresponding 2D configuration parameters in each of the images. We can then find the energy of the 3D pose by summing the energies of the mask in the 2D configurations in each image. We minimise this energy function in the local neighbourhood. We use a Levenberg-Marquardt optimisation technique which is initialised to the current 3D position. We show the new position of the axis of the body segment after optimisation in Figure 6. The red line represents the initial position of the axis of the body segment and the cyan line represents the new position. We thus correct the pose using spatial cues.

138 A. Sundaresan and R. Chellappa Energy of image 1 Energy of image 2 Energy of image 3 Energy of image 4 Energy of image 6 Energy of image 8 Fig. 6. Minimum energy conﬁguration 3 Experimental Results and Conclusions In the experiments performed, we use grey-scale images from eight cameras with a spatial resolution of 648 484. Calibration is performed using Tomas Svoboda s algorithm [31] and a simple calibration device to compute the scale. We use images that have been undistorted based on the radial calibration parameters of the cameras. We use perspective projection model for the cameras. Experiments were conducted on diﬀerent kind of sequences and we present the results of two such experiments. The subject performs motions that exercise several joint angles in the body. Our results show that using only motion cues for tracking causes the pose estimator to lose track eventually, as we are estimating only the diﬀerence in the pose and therefore the error accumulates. This underlines the need for correcting the pose estimated using motion cues. We show the correction step of the algorithm prevents drift in the tracking. In Figure 7, we present results in which we have superimposed the images with the model assuming the estimated pose over the images obtained from two cameras. The length of the ﬁrst sequence is 10 seconds (300 frames), during which there is considerable movement and bending of the arms and occlusions at various times in diﬀerent cameras. The second sequence is that of the subject walking and the body parts are successfully tracked in both cases. Fig. 7. Tracking results using both motion and spatial cues

Multi-camera Tracking of Articulated Human Motion 139 We note that the method is fairly accurate and robust despite the fact the human body model used is not very accurate, given that it was obtained manually using visual feedback. Specifically, the method is sensitive to joint location and it is important to accurately estimate the joint location during the model acquisition stage. We also note that the method scales with respect to accuracy of the human body model. We also note that while we use super-quadrics to represent body segments, we could easily use triangular meshes instead, provided they can be obtained. We need to consider more flexible models that allow the location of certain joints, such as shoulder joints, to vary with respect to the trunk, to better model the human body. References 1. Gavrila, D.M.: The visual analysis of human movement: A survey. Computer Vision and Image Understanding: CVIU 73 (1999) 82 98 2. Aggarwal, J., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding 73 (1999) 428 440 3. Moeslund, T., Granum, E.: A survey of computer vision-based human motion capture. CVIU (2001) 231 268 4. Yamamoto, M., Koshikawa, K.: Human motion analysis based on a robot arm model. In: CVPR. (1991) 664 665 5. Yamamoto, M., Sato, A., Kawada, S., Kondo, T., Osaki, Y.: Incremental tracking of human actions from multiple views. In: CVPR. (1998) 2 7 6. Gavrila, D., Davis, L.: 3-D model-based tracking of humans in action: A multi-view approach. In: Computer Vision and Pattern Recognition. (1996) 73 80 7. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: CVPR. (1998) 8 15 8. Rehg, J.M., Morris, D.: Singularity analysis for articulated object tracking. In: Computer Vision and Pattern Recognition. (1998) 289 296 9. Rehg, J., Morris, D.D., Kanade, T.: Ambiguities in visual tracking of articulated objects using two- and three-dimensional models. International Journal of Robotics Research 22 (2003) 393 418 10. Cham, T.J., Rehg, J.M.: A multiple hypothesis approach to figure tracking. In: Computer Vision and Pattern Recognition. Volume 2. (1999) 11. Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3D human figures using 2D image motion. In: ECCV. (2000) 702 718 12. Kakadiaris, I., Metaxas, D.: Model-based estimation of 3D human motion. IEEE PAMI 22 (2000) 1453 1459 13. Plänkers, R., Fua, P.: Articulated soft objects for video-based body modeling. In: ICCV. (2001) 394 401 14. Theobalt, C., Carranza, J., Magnor, M.A., Seidel, H.P.: Combining 3D flow fields with silhouette-based human motion capture for immersive video. Graph. Models 66 (2004) 333 351 15. Delamarre, Q., Faugeras, O.: 3D articulated models and multi-view tracking with silhouettes. In: ICCV. (1999) 716 721 16. K.M. Cheung, S.B., Kanade, T.: Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In: IEEE Conference on Computer Vision and Pattern Recognition. (2003) 77 84

140 A. Sundaresan and R. Chellappa 17. Chu, C.W., Jenkins, O.C., Mataric, M.J.: Markerless kinematic model and motion capture from volume sequences. In: CVPR (2). (2003) 475 482 18. Wachter, S., Nagel, H.H.: Tracking persons in monocular image sequences. Computer Vision and Image Understanding 74 (1999) 174 192 19. Moeslund, T., Granum, E.: Multiple cues used in model-based human motion capture. In: International Conference on Face and Gesture Recognition. (2000) 20. Sigal, L., Isard, M., Sigelman, B.H., Black, M.J.: Attractive people: Assembling loose-limbed models using non-parametric belief propagation. In: NIPS. (2003) 21. Sigal, L., Bhatia, S., Roth, S., Black, M.J., Isard, M.: Tracking loose-limbed people. In: CVPR. (2004) 421 428 22. Lan, X., Huttenlocher, D.P.: A unified spatio-temporal articulated model for tracking. In: CVPR (1). (2004) 722 729 23. Demirdjian, D., Ko, T., Darrell, T.: Constraining human body tracking. In: ICCV. (2003) 1071 1078 24. Rohr, K.: Human Movement Analysis Based on Explicit Motion Models. Kluwer Academic (1997) 25. Krahnstoever, N., Sharma, R.: Articulated models from video. In: Computer Vision and Pattern Recognition. (2004) 894 901 26. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. International Journal of Computer Vision 53 (2003) 199 223 27. Ramanan, D., Forsyth, D.A.: Finding and tracking people from the bottom up. In: CVPR (2). (2003) 467 474 28. Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body tracking. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, USA. Volume 1. (2001) 447 454 29. Sminchisescu, C., Triggs, B.: Kinematic jump processes for monocular 3D human tracking. In: International Conference on Computer Vision & Pattern Recognition. (2003) I 69 76 30. Sundaresan, A., RoyChowdhury, A., Chellappa, R.: Multiple view tracking of human motion modelled by kinematic chains. In: International Conference on Image Processing, Singapore. (2004) 31. Svoboda, T., Martinec, D., Pajdla, T.: A convenient multi-camera self-calibration for virtual environments. PRESENCE: Teleoperators and Virtual Environments 14 (2005) To appear.