Visual Tracking of Human Body with Deforming Motion and Shape Average

Size: px

Start display at page:

Download "Visual Tracking of Human Body with Deforming Motion and Shape Average"

Basil Davidson
5 years ago
Views:

1 Visual Tracking of Human Body with Deforming Motion and Shape Average Alessandro Bissacco UCLA Computer Science Los Angeles, CA UCLA CSD-TR # Abstract In this work we present a novel approach for tracking human body in video sequences. We model the human skeleton as a kinematic chain of body parts which undergo a transformation composed of rigid motion and shape variation. Tracking is formulated as the problem of minimizing a cost functional with respect to the unknown position and shape of the body parts. 1 Introduction The problem of tracking humans in video streams has received great attention in the last years. The recent availability of commercial hardware for the capture, transmission and processing of full resolution video data has opened a wide range of new applications in this domain. Image-based human tracking might play a prominent role in the next generation of surveillance systems and human computer interfaces. Systems for measuring human motion from video data can also constitute a valuable support for various fields, ranging from kinesiology, rehabilitation in biomechanics, to technical training in sports and performing arts. Estimating the pose of the human body in a video stream is a difficult problem because of the significant variations in the appearance of the object throughout the sequence. Illumination, viewing conditions, relative position and orientation, self-occlusions, all contribute to make the task of matching human body parts between image frames remarkably difficult. In this work we propose an approach to visual tracking that models the human skeleton as a kinematic chain of rigid links. We formulate the problem as the minimization of a cost functional with respect to rigid motion and shape of the body parts. 2 Related work Different approaches have been proposed for the problem of visual tracking of human motion (see [1, 9] for a survey). They can be classified in two main types, depending on whether a priori models of the shape of the human body are used. Approaches that do not use shape models, such as in [21] and [14] usually rely on heuristic procedures to find correspondences of body parts between frames of video sequences. In [18] a variational approach exploiting motion information is used for detection and tracking of arbitrary objects. Model-based approaches can be divided in single view [12, 22, 25] and multi-view [7, 26, 8], 2D motion models [15] and 3D models [20, 5]. Most of these approaches require manual initialization in the first frame. Soatto et al. [23] propose a framework for modeling the motion of deforming objects. A nonrigid transformation is seen as the composition of a group action g on a particular object, on top of which a local deformation is applied. In this setting the notion of average shape is defined as the one that minimizes the deformations. Bregler and al. [24] have proposed a variety of methods to model non-rigid motions. Such models are built as linear combinations of a collection of key poses, learned using principal component analysis from motion capture data. Local representations of motion based on optical flow have been exploited in [3, 16], and view-based methods are proposed in [2, 10]. Other approaches are based on principal component analysis [27]. In [6] a mixed-state statistical model for the representation of motion has been proposed. In this Switching Linear Dynamic Model a stochastic finite-state automata at the highest level switches between local linear Gaussian models. Estimation and recognition is performed

2 with expectation maximization approaches, using particle filters [17, 4] or structured variational inference techniques [19]. 3 Modeling Human Body Motion In this paper we focus on the problem of estimating the pose of a human body in a video sequence. The ultimate goal is to build a system that, if properly initialized, can reliably track the configuration of an articulated object such as a human body from a sequence of monocular images. We do not consider the issue of model initialization, instead we assume that the configuration of the object in the first frame is given, for example, by manual initialization. We model the human skeleton as a kinematic chain of rigid bodies. Each body segment is represented by a rigid link with ellipsoidal or conic support, and the links are connected together by joints. To restrict the set of admissible motions and reduce the ambiguities in the estimation, we assume that each joint allows for one single degree of freedom, a rotation around its axis. This constraint is justified by the fact that typically the motion of the limbs in a walking gaits can be approximated as planar around an axis perpendicular to the direction of walking. Obviously the assumption of rigid object is not met by body parts. Various factors contribute to change the appearance of the limbs in the sequence, such as illumination, viewing angle, occlusions, and so forth. Because of these large variations any standard template matching technique is doomed to fail if directly applied to this problem. Our solution is to model the transformations undergone by body limbs as composition of rigid motions and shape variations. We formulate a cost functional written in terms of the position of the links of the kinematic chain and their shape, represented by a weight function. 3.1 Exponential maps for motion representation Among the possible representations of rigid motion, a sensible choice for our application is the one based on exponential maps. In this parameterization, arbitrary 3D motion can be encoded in a 6 dimensional vector ξ, called twist: ξ = v x v y v z ω x ω y ω z The twist describes a rotation around an arbitrary axis in the space: the axis and amount of rotation is given by the vector ω = ω x ω y, the location of the rotation axis and the amount of translation along this axis is given by the remaining three components v = v x v y. v z The matrix form G SE(3) of the rigid motion represented by the twist ξ is given by: G = eˆξ, ˆξ = ω z 0 ω z ω y v x ω z 0 ω x v y ω y ω x 0 v z Exponential maps have several advantages over other parameterizations of 3D rotations. They do not suffer the problem of singular configurations as the Euler angles, and as opposed to quaternion or matrix representations it is not necessary to constrain a number of parameters larger than the degrees of freedom to a set of admissible values. Derivatives of exponential maps with respect to their parameters can be computed in closed form but do not have simple expressions. We refer the reader to [11] for the details. 3.2 From kinematic chains to images Given the coordinates of a point on a link of kinematic chain, we want to compute its projection on the image plane. If the variation in depth of the points on the articulated object is small compared to the distance from the camera, we can approximate the transformation to scaled orthographic projection. This conditions are generally met in video sequences of walking people. In the following we consider the reference frames associated with links centered on the joints and having the z axis oriented along the direction of the joint axis. Let p o = [ x o y o z o 1 ] T be the homogeneous coordinates of a point relative to the reference frame of link l. The coordinates (x, y) of its projection onto the image plane are given by:

3 [ x y ] = g l (p o, Θ) = g l (p o, ξ 1,..., ξ L, s) (1) = sπeˆξ 1 g 12 eˆξ 2 g 23 eˆξ3 eˆξ l p o (2) [ ] where Π =, ξ 1 = [ ] v x v y v z ω x ω y ω z gives position and orientation of the first link, g ij is the transformation from reference of link i to reference of link j, and ξ i = [ ] ω z gives the rotation of link i around the axis of the associated joint. 3.3 Matching deforming regions Consider the problem of tracking the position of a rigid object in a sequence of images. A standard approach is to minimize the sum of squared differences between the intensity of the pixels in the model M and the corresponding pixels in the images I i. Here the model can be the first frame where the position of the object is given. Let Ω be the set of points on the model that belongs to the object and g(., Θ i ) the transformation that maps points on the model to corresponding points on the image I i. Then the function to minimize with respect to the motion parameters Θ i is: E = n f i=1 Ω (I i (g(p, Θ i )) M(p)) 2 dp (3) As previously mentioned, this simple solution cannot cope with the significant changes in appearance present in the case of motion of body parts. Our approach is to allow for deviations from the original template by introducing a weight map W (x) that defines the shape of the object. W (p) ranges between 0 and 1, being 1 for points inside the object and 0 for points outside. Then we can estimate the position of the object and its shape my minimizing the cost functional written in terms of the motion Θ i and the weight map W : E = n f i=1 Ω (I i(g(p, Θ i )) M(p)) 2 W (p)dp Ω W (p)dp (4) Notice that we have introduced a term at denominator in the cost function. This term prevents to have W 0 as solution to the minimization problem. 3.4 Modeling Self-Occlusions Visual tracking of complex articulated objects such as the human body demands for explicit modeling of selfocclusions. This can be done by introducing a visibility function V in the cost functional. Consider the case of an articulated object with L links, we have: E = n f = n f i=1 i=1 Ei(Θi, W ) = Ll=1 (I i (g l (p,θ i )) M l (p)) 2 W l (p)v l (p,θ i )dp Ll=1 W l (p)v l (p,θ i )dp Where W = (W 1, W 2..., W L ) and: { 1 if p link l is visible in pose Θ V l (p, Θ) = 0 otherwise In order to compute derivatives of the energy with respect to motion parameters Θ, we need to find an analytical expression for the visibility function V l (p, Θ). To this purpose we can use the signed distance function of a point p from a closed curve. It is defined as the minimum distance of p from a point on the curve, with the plus sign if p is outside and the minus sign if p is inside the curve. In the case at hand the shapes are simple and the distance can be analytically computed from the its parameters. Let d l (p) the signed distance function of the projection of p from the contour of the link l on the image. Then we can write V as: V l (p, Θ) = H(d j (p, Θ)) j F (l) where F (l) = {j : link j is in front of link l} and H(.) is the heaviside function: { 0 if x < 0 H(x) = 1 if x 0 Assuming that the order of visibility defined by F (.) does not change during the motion, we can compute the derivatives of V as: V l (p, Θ) Θ = k F (l) δ(d k (p, Θ)) d k Θ (p, Θ) 4 Tracking Algorithm j F (l),j k (5) H(d j (p, Θ)) The first step of the algorithm is to build a model of the appearance of the human body in the sequence. We use a kinematic chain manually initialized to match the pose of the subject in the first frame. We extract M l, the appearance model of link l, and its domain by picking the region in the first frame corresponding to the projection of link l on the image. (6)

4 Then we perform tracking by minimizing the energy functional in (5) with respect to Θ i and W. We use an alternating minimization scheme: given and initial guess for W minimize with respect to the motions Θ i, then fix Θ i to these optimal values and minimize with respect to W. The minimizations are performed using a gradient descent scheme. The gradient of (5) with respect to Θ i is: E = 1 ( L I i(p, Θ i, l) Θ i A l=1 L V l (x, Θ i)w l (p)dp + where E i(θ i, W ) A = L l=1 L l=1 ( I T i (g l(p, Θ g ) l i) (Θ i)p Θ i I i(p, Θ i, l) 2 V l (x, Θ i)w l (p)dp l=1 Θ i ) V l (p, Θ i)w l (p)dp Θ i W (p)v l (p, Θ i )dp I i (p, Θ, l) = I i (g l (p, Θ) M l (p) and: The derivative with respect to the support map W (.) is: Wl E(p k ) = i δ(d l (p k )) V l(p k, Θ i) A (7) ( ) I i(p k, Θ i, l) 2 E i(θ i, W ) This procedure is applied to blocks of n f frames of the sequence to be tracked. n f is a parameter of the algorithm and determines how many frames are used for estimating the shapes W l. Setting it to 1 is not a good idea because it means using the matching on one single image to update the shape of the object. Using large values such as the number of frames in the sequence is also not advisable because the appearance of body parts in distant frames can vary considerably and this would negatively affect the shape estimation. It must be pointed out that in this formulation we do not exploit the temporal continuity of the motions Θ i. That is, we would obtain the same results if we applied the algorithm to a sequence where the order of the frames is scrambled. In practice, since the derivative of (5) with respect to Θ i is independent from Θ j, we exploit continuity by performing the minimization separately for each Θ i and by using the optimal value of Θ i to initialize the minimization on Θ i+1. Also we do not constrain the parameters of the kinematic chain to represent configurations physically feasible for the human body. Adding to the cost function some terms that model spatial constraints or joint dynamics would possibly improve the results. (8) Figure 1: Example of kinematic chain model of the human body used for tracking 5 Experimental results In our preliminary experiments we tracked the position the body of two subjects performing a walking gait. The sequences have length of 100 and 43 frames. In both sequences we tracked the position of torso, one leg and one arm using a kinematic chain model with 5 links, pictured in figure 1. It can be seen that the chain model has links of two different shapes: ellipsoids and conic sections. For computational efficiency the projection on the image of a conic section is approximated with a trapezoid. The model has been manually initialized in the first frame of the sequence by specifying the geometry of the links and by clicking on their position in the image. Once this initialization step is complete the system performs the tracking task by minimizing (5), alternating between motions parameters Θ and shapes W l. In the minimization over the motion parameters numerical derivatives have been used. The results are obtained by performing the minimization on blocks of n f = 5 images. The results are shown in figures 5. In the first and forth row we can see some keyframes of the original sequences, in the second and fifth we have frames with superimposed the estimated kinematic chain and in the third and sixth superimposed are the estimated shape maps W l. We have also included movie clips of original and tracked sequences. References [1] J. K. Aggarwal and Q. Cai Human Motion Analysis: A Review 1999 [2] M. J. Black. Eigentracking: robust matching and tracking of articulated objects

Re- sults of the tracking. The first and forth rows show frames from the original sequences. The second and fifth rows display the estimated pose of the kinematic chain.

of Conference on Computer Vision and Pattern Recognition, volume 1, pages 326 332, 1999. [4] M. J. Black and A. D. Jepson.

5 Re- sults of the tracking. The first and forth rows show frames from the original sequences. The second and fifth rows display the estimated pose of the kinematic chain. The third and last show the estimated weight maps. [3] M. J. Black. Explaining optical flow events with parameterized spatio-temporal models. In Proc. of Conference on Computer Vision and Pattern Recognition, volume 1, pages , [4] M. J. Black and A. D. Jepson. A Probabilistic framework for matching temporal trajectories: Condensation-based recognition of gestures and expressions. In Proc. of European Conference on Computer Vision, volume 1, pages , [5] C. Bregler. Tracking people with twists and exponential maps. In Proc. International Conference on Computer Vision and Pattern Recognition, [6] C. Bregler. Learning and recognizing human dynamics in video sequences. In Proc. of the Conference on Computer Vision and Pattern Recognition, pages , [7] J. Deutscher, A. Blake and I. Reid Articulated motion capture by annealing particle filering In Proc. CVPR, pp , [8] D. M. Gavrila and L. S. Davis. Tracking of humans in action: a 3-d model-based approach

6 [9] D. M. Gavrila. The visual analysis of human movement: A survey. In Computer Vision and Image Understanding, volume 73, pages 82 98, [10] M. A. Giese and T. Poggio. Morphable models for the analysis and synthesis of complex motion patterns. In International Journal of Computer Vision, volume 38(1), pages , [11] F. S. Grassia. Practical parameterization of rotations using the exponential map. Journal of Graphics Tools, 3(3):29-48, [12] N. R. Howe, M. E. Leventon and W. T. Freeman Bayesian reconstruction of 3D human motion from single-camera video In Proc. of NIPS, 12, pp , 2000 [13] M. Isard and A. Blake. Condensation - conditional density propagation for visual tracking International Journal of Computer Vision 29(1), pp. 5 28, [14] I. A. Kakadiaris and D. Metaxas Model based estimation of 3D human motion with occlusion based on active multi-viewpoint selection. In Proc. of CVPR, pp , 1995 [22] H. Sidenbladh, M. Black and D. Fleet Stochastic tracking of 3d human figures using 2d image motion In Proc. of ECCV, II pp , [23] S. Soatto and A. Yezzi Deformotion: deforming motions and shape averages. In Proc. of the ECCV, LNCS, Springer Verlag, May [24] L. Torresani, D. Yang, G. Alexander and C. Bregler Tracking and Modelling Non-Rigid Objects with Rank Constraints In Proc. International Conference on Computer Vision and Pattern Recognition, [25] S. Wachter and H. Nagel Tracking of persons in monocular image sequences In CVIU, 74(3), pp , 1999 [26] C. Wren. Dynamic models of human motion [27] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities. In Computer Vision and Image Understanding, volume 73(2), pages , [15] M. K. Leung and Y. H. Yang First sight: A human body outline labeling system In IEEE Trans. on PAMI, 17(4),pp , [16] J. J. Little and J. E. Boyd. Recognizing people by their gait: the shape of motion [17] B. North and A. Blake and M. Isard and J. Rittscher. Learning and classification of complex dynamics. In IEEE Transaction on Pattern Analysis and Machine Intelligence, volume 22(9), pages , [18] N. Paragios and R. Deriche Geodesic active contours and level sets for the detection and tracking of moving objects In PAMI, 22(3), , 2000 [19] V. Pavlovic and J. Rehg and J. MacCormick. Impact of Dynamic Model Learning on Classification of Human Motion In Proc. International Conference on Computer Vision and Pattern Recognition, [20] J. M. Regh and T. Kanade. Model-based tracking of self-occluding articulated objects. In ICCV, [21] A. Shio and J. Sklansky Segmentation of people in motion In Proc. of IEEE Workshop on Visual Motion, pp , 1991

MODELING HUMAN GAITS WITH SUBTLETIES. Alessandro Bissacco Payam Saisan Stefano Soatto

MODELING HUMAN GAITS WITH SUBTLETIES Alessandro Bissacco Payam Saisan Stefano Soatto Department of Computer Science, University of California, Los Angeles, CA 90095 {bissacco,soatto}@cs.ucla.edu Department