Uncalibrated Motion Capture Exploiting Articulated Structure Constraints

Size: px

Start display at page:

Download "Uncalibrated Motion Capture Exploiting Articulated Structure Constraints"

Linette Campbell
5 years ago
Views:

1 International Journal of Computer Vision 51(3), , 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Uncalibrated Motion Capture Exploiting Articulated Structure Constraints DAVID LIEBOWITZ AND STEFAN CARLSSON Computational Vision and Active Perception Laboratory, Royal Institute of Technology, Sweden Received March 14, 2001; Revised December 19, 2001; Accepted July 30, 2002 Abstract. We present an algorithm for 3D reconstruction of dynamic articulated structures, such as humans, from uncalibrated multiple views. The reconstruction exploits constraints associated with a dynamic articulated structure, specifically the conservation over time of length between rotational joints. These constraints admit reconstruction of metric structure from at least two different images in each of two uncalibrated parallel projection cameras. As a by product, the calibration of the cameras can also be computed. The algorithm is based on a stratified approach, starting with affine reconstruction from factorization, followed by rectification to metric structure using the articulated structure constraints. The exploitation of these specific constraints admits reconstruction and selfcalibration with fewer feature points and views compared to standard self-calibration. The method is extended to pairs of cameras that are zooming, where calibration of the cameras allows compensation for the changing scale factor in a scaled orthographic camera. Results are presented in the form of stick figures and animated 3D reconstructions using pairs of sequences from broadcast television. The technique shows promise as a means of creating 3D animations of dynamic activities such as sports events. Keywords: motion capture, calibration, multiple views, 3D reconstruction 1. Introduction Systems for motion capture, i.e. dynamic 3D reconstruction of articulated structures such as humans in motion, typically employ a large set of cameras in a laboratory environment in order to maximise the stability of the reconstruction and handle the problem of self-occlusion. These systems are pre-calibrated, so the issue of motion capture is essentially data collection and reconstruction by triangulation. In real world situations such as sports events, on the other hand, the number of cameras viewing a specific situation is often limited to around two or three. The necessity of obtaining both an overview of the situation and close up details further limits the number of cameras useful for motion capture. These cameras are esentially never precalibrated, which in itself would present problems since they are often hand-held and moving around. In general these cameras are placed at very different positions in order to get complete coverage. This is beneficial for the problem of multi-view reconstruction since it implies large baselines between cameras. On the other hand it aggravates the problem of occlusion since cameras are essentially looking at different parts of the human body in motion. Considering the fact that events such as team sports naturally imply occlusions due to multiple actors, the limited number of visible features in multiple views can be a serious problem. A system for motion capture from real world sports events therefore has to face the problems of self-calibration and reconstruction from a limited number of views together with limitations on the number of simultaneously visible feature points in multiple views. Camera self-calibration from multiple views of general unconstrained structures requires at least 3 views for perspective cameras (Maybank and Faugeras, 1992). Sporting events are most often monitored by zooming cameras from relatively long distances, which implies that the camera geometry is very close to parallel projection. This further aggravates the problem of calibration, increasing the minimum number of cameras from 3 to 4 (Quan, 1996). It is therefore critical

2 172 Liebowitz and Carlsson Figure 1. One view of subset of selected points from a walking sequence in 3D. The length of body segments remain constant over time. This constraint is used for rectification of affine reconstructions from two views. to exploit all available scene constraints for calibration and reconstruction. Human bodies have traditionally been modelled as rigid link articulated structures. Knees, shoulders and elbows are rotational joints connected with (generally) rigid links. The lengths of body segments between rotational joints therefore remain constant over time (Fig. 1). We call this the fixed length or rigid link constraint, and it is the purpose of this paper to demonstrate that this constraint can be exploited in an efficient way in order to obtain metric reconstruction and camera calibration from two affine views of human motion. The method we propose uses a stratified approach for deriving metric from affine structure, computed from two uncalibrated views (Fig. 2), and proceeds as follows: Body locations at specific rotational joints are selected interactively from recorded image sequences (see Fig. 3). The 3D positions of the dynamic point sets over time are treated as a single static 3D structure. The point coordinates project to each camera as one view, as in Fig. 1. Two cameras will be considered for reconstruction in all cases. Affine 3D sequence structure is derived from two uncalibrated cameras. The rigid link or fixed body segment length constraint is used to upgrade the affine structure to metric. Figure 2. Stratified approach to sequence reconstruction and calibration. At each step, linear methods are used. This greatly simplifies the computations and effectively avoids the pitfalls of non-linear equation solving such as multiple solutions and iterative search. The specific constraints associated with articulated structures and motion capture has received some attention in earlier work. Much of the research in motion capture has focused on monocular sequences. Leventon and Freeman (1998), for example, take a statistical approach by using a motion capture database to build a probability model for motion sequences. Pose of the body in a novel sequence is then inferred by finding the most probable 3D configuration given the imaged co-ordinates. Taylor (2000) assumes known size body segments imaged by a scaled orthographic camera, and computes 3D orientation of the segments from foreshortening. Webb and Aggarwal (1983) considers the rotation axis of relative joints to be fixed for short times and is able to recover 3D structure from monocular sequences of articulated motion. In Bregler and Malik (1998) the constraint of having a kinematic chain is exploited in a single and multiple view system for automatic moption capture. This uses general minimisation procedures for finding structure and motion. The specific problems of having uncalibrated cameras are not discussed. Due to the general ambiguity in self calibration from limited number of views (Quan, 1996), (disussed in the next section) we therefore expect this approach to have problems of convergence unless extra constraints are introduced.

3 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 173 Figure 3. Body segments, such as the upper arm between the shoulder and elbow, and the upper leg, between the hip and knee, have a constant length when a person moves. The paper begins with a brief look at affine cameras and reconstruction in Section 2. Section 3 describes the general affine rectification transformation that defines a metric reconstruction of a scene This is followed by the calibration constraints that are available from knowledge of internal camera structure in Section 4. Section 5 introduces the relative length constraint, including degenerate configurations, and describes how it is used in motion capture. Finally, Section 6 presents the rigid link constraint applied to motion capture from broadcast video for both fixed and zooming cameras. 2. Affine Cameras and Reconstruction General affine cameras are characterised by an optic centre at infinity (Mundy and Zisserman, 1992). Using the affine camera model, world points are imaged by parallel projection. The rays through the camera centre, the world points and the image points are a parallel pencil. The imaged co-ordinates of a world point are thus independent of its depth in the scene. The affine camera projects a 3D point X to an image point x via ( ) r T ( ) x = K 1 t1 a X + = GX + t (1) r T 2 t 2 where the camera pose is described by the 3D translation vector t = (t 1, t 2, t 3 ) T and rotation matrix R with rows r T 1, rt 2 and rt 3. The matrix K a is associated with the internal parameters of the camera, and has the general form ( ) s k K a = 0 rs The parameter s is an isotropic scaling of image coordinates. The isotropic scaling can account for the changes in the imaged size of an object as a physical camera zooms or translates. Skew k and aspect ratio r model the skew and stretch of the imaging optic array. Cameras often have square pixels, with zero skew (k = 0) and unit aspect ratio (r = 1). K a can then be appropriately simplifed and the camera is termed a scaled orthographic camera. This property will be used in Section 6.3. Given two affine views of scene, world geometry can be reconstructed up to an affine transformation. This result, originating with Koenderink and van Doorn (1991), requires four points matched across views to define an affine basis for reconstruction. Further points may then be assigned 3D co-ordinates with respect to these basis vectors. A minimum of four points are also required to compute the affine fundamental matrix (Hartley and Zisserman, 2000) and thus affine epipolar geometry. The most widely used affine reconstruction method is the factorization algorithm of Tomasi and Kanade (1992). Factorization provides a robust and computationally efficient method of computing affine structure from point correspondences using singular value decomposition. It has also been shown to give a Maximum Likelihood reconstruction under Gaussian noise assumptions for image points (Reid and Murray, 1996). Using factorization, affine 3D structure is computed by the SVD of a measurement matrix W which, for two views, has the form ( ) x1 x 2 x n W = x 1 x 2 x n for n inhomogeneous matched points x i in the first view and x i in the second. The singular value decomposition

4 174 Liebowitz and Carlsson of W returns three matrices from which a pair of cameras G and G as well as 3D points A i can be extracted. Note that image co-ordinates are normalised in W to place the origin at the centroid of the points in each image. Since in parallel projection the 3D centroid of the point cloud is imaged at the 2D centroid, this centres the reconstruction coordinate system at the 3D centroid. In consequence, the translation parameters of the cameras are eliminated in (1), and G and G are 2 3 inhomogenous projection matrices. Attention in this paper is focused on affine cameras because they are useful to the motion capture application. However, an affine reconstruction can also be obtained from perspective cameras. A scene viewed by two or more uncalibrated perspective cameras can be reconstructed up to an arbitrary projective transformation (Faugeras, 1992). Numerous methods of upgrading the projective reconstruction to affine exist (Pollefeys et al., 1996; Hartley, 1992; Liebowitz and Zisserman, 1999). The relative length constraint described in Section 5 applies to any affine reconstruction. 3. From Affine Reconstruction to Metric Using Scene Constraints The general stratified approach to scene reconstruction from multiple perspective cameras distinguishes projective, affine and metric reconstructions (Faugeras, 1995). It is assumed here that an affine reconstruction is known, and must be rectified to metric. This section describes the decomposition of the affine transformation into similarity and pure affine components, and the simplification that results. Formally, a world point W and point A in an affine reconstruction are related by an affine transformation of 3D: W = HA + t (2) where H is a full rank 3 3 matrix and t is a translation 3-vector. In reconstruction for graphics purposes, similarity transformations may often be neglected. The overall scale, rotation and translation between the world and reconstruction can be chosen arbitrarily. The significant transformation is that which rectifies the reconstruction to within a similarity transformation of the world a metric reconstruction. To this end, the affine rectifying transformation, which has 12 degrees of freedom, is decomposed into two transformations. First, we neglect the translation vector, and then apply RQ decomposition (Hartley and Zisserman, 2000) to H. H may thus be written as a product of an orthonormal (rotation) matrix S and an upper triangular matrix U: H = SU (3) In addition, U may be regarded as a homogeneous matrix in the sense that its overall scale factor may be ignored. This scale factor is the isotropic scaling of the similarity transformation ss. U is thus a five degree of freedom purely affine transformation. Determining U and applying it to the affine structure results in metric structure. Any metric constraints on the world coordinates will therefore automatically induce constraints on the rectification U. To see this consider the rectified vector: W i = SUA i + t (4) Metric properties of a set of points can always be expressed in terms of inner products: (W i W j ) T (W k W l ) = (A i A j ) T U T U(A i A j ) (5) A similarity structure invariant can likewise always be expressed as a ratio of inner products. Any such ratio implies a linear constraint on the matrix U T U (W i W j ) T (W k W l )(A i A j )T U T U(A i A j ) (W i W j )T (W k W l ) (A i A j ) T U T U(A i A j ) = 0 (6) We will write = U T U (7) for this matrix. This is closely related to the absolute conic of projective geometry which has the property that it is invariant to 3D similarity transformations. It thus encodes only the affine distortion of the reconstruction, not the similarity component (Faugeras and Maybank, 1990). From the rectification U can easily be computed by Cholesky decomposition (Kreysig, 1993). Typical metric constraints involve lengths and angles for point configurations. In the following chapters we will exploit constraints such as conservation of

5 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 175 relative length over time that are valid for dynamic articulated structures. Internal parameters and autocalibration constraints may also be included in the computation of the rectification. This is addressed next. 4. Affine Camera Auto-calibration and Known Parameters Camera matrices in the affine reconstruction are upgraded to metric by U as well, with a metric camera given by G M = GU 1 Metric constraints on affine cameras therefore induce linear constraints on the inverse rectification 1 in a very similar way as in the previous section. In the original work on factorization (Tomasi and Kanade, 1992), rectification of affine structure relies on the orthographic projection camera model. For orthographic projection, the rows of the metric rectified cameras are orthonormal. This constraint applies equally to scaled orthography: the rows of each rectified camera are orthogonal and of equal length. Writing this for the camera G, a rectified camera is given by Hence ( ) ( ) g T GU 1 = 1 sr T U 1 = 1 g T 2 sr T 2 g T 1 U 1 (g T 2 U 1 ) T = g T 1 U 1 U T g 2 = g T 1 1 g 2 = 0 (8) and g T 1 U 1 (g T 1 U 1 ) T g T 2 U 1 (g T 2 U 1 ) T = g T 1 1 g 1 g T 2 1 g 2 = 0 (9) Both (8) and (9) are linear in the elements of 1. Given three or more cameras, 1 can be identified and metric structure obtained. Extensions of this approach to other camera models may be found (Poelman and Kanade, 1994; Shapiro et al., 1994; Weinshall and Tomasi, 1993), but scaled orthography is of most interest here since this model can be used to describe a camera with square pixels at some distance from the object of interest. A method for auto-calibration of a number of affine cameras with unknown but fixed internal parameters is described by Quan (1996). The auto-calibration constraint assumes unchanging camera parameters, and can be expressed in terms of the projection of G i 1 G T i = G j 1 G T j (10) for all camera pairs i, j. Quadratic constraints on the elements of 1 follow from dehomogenising (10) by dividing each side through by one of the entries of the matrix. Each pair of cameras provides two constraints, so a minimum of four cameras are required to solve for 1. With knowledge of some internal parameters or more views, the constraint can be linearised. The fixed camera autocalibration constraint is not generally applicable to motion capture, since the two cameras viewing the scene are different. The scaled orthographic camera model and constraints (8) and (9), however, will be used. 5. Rectification Constraints from Relative Length This section describes a constraint on the metric rectification of an affine reconstruction derived from a known ratio of world 3D lengths. It will be shown that the metric length of a line segment in an affine reconstruction can be expressed in terms of, and that this leads to a linear constraint on the elements of from two segments. Degenerate conditions are described and the application of the constraint to motion capture is introduced The Relative Length Constraint Consider two 3D line segments in an affine reconstruction, segment L A with endpoints A 1 and A 2, and segment L B with endpoints B 1 and B 2. The metric lengths of L A and L B, l A and l B, may be expressed by transforming the endpoints with U and taking inner products: la 2 = (UX A) T (UX A ) = X T A X A and lb 2 = (UX B) T (UX B ) = X T B X B (11) where X A = A 1 A 2 and X B = B 1 B 2. If the metric length ratio of L A and L B is λ, from (11): l 2 A = λ2 l 2 B X T A X A = λ 2 X T B X B (12)

6 176 Liebowitz and Carlsson where is the symmetric 3 3 matrix with elements = (13) Expanding (12) in terms of the elements of, X A = (X A1, X A2, X A3 ) T and X B = (X B1, X B2, X B3 ) T Figure 4. Parallel line segments provide no constraint on. X T A X A λ 2 X T B X B = 0 ( X 2 A1 λ 2 XB1 2 ) 1 + 2(X A1 X A2 λ 2 X B1 X B2 ) 2 + ( X 2 A2 λ2 X 2 B2) 3 + 2(X A1 X A3 λ 2 X B1 X B3 ) 4 + 2(X A2 X A3 λ 2 X B2 X B3 ) 5 + ( X 2 A3 λ2 X 2 B3) 6 = 0 This is a linear equation in the elements of, which we may write in vector form as c T Ω v = 0 (14) where c = (c 1,...,c 6 ) T is the vector of coefficients and Ω v = ( 1,..., 6 ) T is the vector of elements of. As before, is a homogeneous matrix with five degrees of freedom. Five independent constraints are thus required to solve for. Givenfive or more independent constraints, the coefficient vectors c 1 to c n can be combined in a constraint matrix. It follows from (14) that c T 1. c T n Ω v = CΩ v = 0 (15) and Ω v is the null vector of C, which is of rank five. In the presence of noise and with more than five constraints, C will generally be of rank six. In this case, singular value decomposition (SVD) provides an estimate of the null vector. The singular vector associated with the smallest singular value is the estimate of the null vector of C that minimises CΩ v. There are also degenerate configurations of lines segments which do not yield independent constraints. These are examined in the following section Degeneracies Constraint degeneracy occurs when a particular constraint fails to provide information about, or when a set of constraints are linearly dependent in (14). Three such conditions are enumerated below Parallel Line Segments. Length ratio in parallel directions is an affine invariant, so parallel line segments in an affine reconstruction preserve their metric length ratio. It is therefore unsurprising that no constraint on is obtained from the relative length of parallel segments. Explicitly, if L A and L B are parallel with length ratio λ, X A = λx B (see Fig. 4) and (12) becomes X T A X A λ 2 X T B X B = λ 2 X T B X B λ 2 X T B X B = 0 and is satisfied for all. The elements of the coefficient vector c will all be zero Pairs of Parallel Line Segments. Two constraints obtained from the relative lengths of segment pairs that are parallel are not independent. Suppose we have L A and L B with metric length ratio λ, and L C and L D with ratio κ, shown in Fig. 5, we can write two Figure 5. on. Parallel pairs of line segments provide one constraint

7 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 177 linear constraints on : X T A X A λ 2 X T B X B = 0 X T C X C κ 2 X T D X D = 0 However, if L A is parallel to L C and L B is parallel to L D, X C = αx A and X D = βx B. Additionally, for parallel directions length ratios are preserved, so α β = κ λ. The pair of constraints becomes X T A X A λ 2 X T B X B = 0 α 2 X T A X A α 2 λ 2 X T B X B = 0 These two equations are linearly dependent and only one constraint on is obtained Co-Planar or Parallel Plane Line Segments. Any number of constraints originating with coplanar line segments or line segments in parallel planes, as in Fig. 6, provide only two independent constraints on. Geometrically, the constraints fully define metric structure on a plane (Liebowitz and Zisserman, 1998), and indeed on the pencil of parallel planes, but provide no information about other directions. As always with degenerate conditions, care must be taken in implementing the constraints. Theoretically degenerate constraints may appear to provide valid constraints due to measurement noise, and near degenerate conditions provide unstable results The Rigid Link Constraint The relative length constraint can be applied in practice without quantitative knowledge of length ratios. This is the case when the qualitative observation can be made x x A B x E Figure 6. Coplanar or parallel plane line segments provided at most two constraints on. x x F C x D that objects of equal length are present in a scene, determining a unit length ratio constraint. Accordingly, human motion is treated below as the movement of an articulated, rigid link structure. Two cases will be considered (Fig. 7): 1. two fixed general affine cameras with different parameters observing an entire motion sequence, and 2. n pairs of scaled orthographic cameras with changing parameters, each imaging one frame of the motion. In the first case, sequences captured by a pair of unchanging cameras are equivalent to two images of the motion. Each view is regarded as having viewed the entire set of sampled postures from the motion in time in one image. The postures are treated as a static scene, requiring a single reconstruction and rectification transformation. Considered as a static 3D object, the set of postures then contains a number of fixed length objects: each body segment, such as the upper arm, remains a constant length through the set. This is the rigid link constraint. When the cameras are changing, however, the postures cannot be regarded as a static set in a single pair of images. Each frame must be reconstructed and rectified separately. In this case, rigid links are used for each reconstruction as well as between reconstructions to constrain all the rectifications. 6. Motion Capture The application of the rigid link constraint to motion capture is shown here in the context of broadcast footage of sports. In sports broadcasts, interesting events are typically shown from a second camera in the replay, providing two views of the action. The cameras at sports events are also often placed far from the participants, at the edge of the field or mounted on the stands, and use long focal lengths. The affine camera model is thus applicable, and an affine reconstruction can be created directly from matched points. The following sections describe the manual matching process, reconstruction and calibration steps used to rectify this reconstruction using rigid links. The initial method described assumes a pair of fixed cameras. This is a somewhat naive assumption for sports broadcasts, since cameras often pan and zoom while following a player. Despite this, however, a fixed camera assumption is valid in some situations, and can

8 178 Liebowitz and Carlsson 1. Fixed general affine cameras, sequence of n frames Affine structure from factorization 1 2 n Metric structure by rectification 1 2 n rigid link constraint 2. Scaled orthographic cameras with changing parameters 1 2 Affine structure from factorization n (1) (2) (n) 1 Metric structure by rectification 2 n rigid link constraint Figure 7. Factorization and rectification for (1) fixed cameras and (2) changing cameras. also result in realistic animation even when violated. This is followed by a method of jointly calibrating a number of pairs of cameras using rigid links. It is shown that, using the scaled orthography camera model, the change in scale factor resulting from a zooming camera can be eliminated, although rotation between cameras remains unknown Affine Reconstruction The first step in affine reconstruction from two views is to select the corresponding image points defining the body parts of interest. A 15 point model of the human skeleton is used, specifying the dominant joints: shoulders, elbows, hips, knees and ankles, as well as the tips of the feet and the head or neck. An example, two views of a triplejump event, appears in Fig. 8, with the skeleton points shown in two images. This footage is captured by a pair of cameras that are moving with the athlete (probably on tracks). The net effect is that the athlete remains in approximately the same position in the images throughout the sequence. The cameras are thus treated as a pair of fixed affine cameras viewing a motion as if on a treadmill.

Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 179 Figure 8. The triplejump. Frames 13, 26, 30 and 51 of an 93 frame sequence, shown in side view above and front view below.

9 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 179 Figure 8. The triplejump. Frames 13, 26, 30 and 51 of an 93 frame sequence, shown in side view above and front view below. Manually selected joints are shown in the first frame of each set. Translation of the athlete is lost, but the relative movement of the body parts can be reconstructed. The skeleton points are selected by hand in the images, frame by frame. Two difficulties typically occur. The first is occlusion, either self-occlusion of joints as a body moves, or occlusion from other people or objects in the scene. Self-occlusion is relatively benign, since the position of the occluded joint can often be inferred (by a human operator) from posture and the other view. Occlusion by other objects can be a severe problem, depending on the extent. A number of football players clustered in a goal area, for example, can obscure a significant part of the player of interest. The second difficulty is image dropout. Sequences are digitised from interlaced broadcast video, and are of poor quality. Rapidly moving features, like the hands, often smear in motion blur, making precise joint location difficult. All these features are present in the triplejump sequence. The athlete s hands move very fast and blur considerably in the side view, and the left shoulder, arm and hip are self-occluded in much of that sequence. Humans are, however, good at inferring the position of self-occluded joints, so reasonable guesses can be made about the occluded co-ordinates. Motion blur remains a source of inaccuracy of the order of a few pixels, but this is unavoidable given the nature of the data. Uncertain points in either view are flagged in the clicking process. Having selected the corresponding point pairs, an initial affine reconstruction is computed using the factorization algorithm. Points flagged as uncertain in the selection process can be neglected from this reconstruction. The affine fundamental matrix can then be computed from the cameras and used to recompute the uncertain points by projecting them to the epipolar line defined by the fundamental matrix and the corresponding point in the other view (Shapiro et al., 1995). The system thus tolerates points uncertain in one view, but clearly visible in the other. An affine reconstruction of the triplejump sequence is depicted in Fig. 9. Wireframe models for four frames corresponding to Fig. 8 are shown, with horizontal Figure 9. Four postures from the affine reconstruction of the triplejump sequences. The same four postures are shown from three viewpoints. Compare to the corresponding frames in Fig. 8.

10 180 Liebowitz and Carlsson translation introduced. Note the effect of affine distortion, particularly on the aspect ratio of the body. The affine reconstruction can now be metric rectified using the rigid link constraint Metric Rectification The rigid link constraint is present in two forms in human motion. This first is the constant length of body segments the upper and lower arms and legs and the distance between the hip joints through the motion. There are nine constraints on from each view pair for these particular segments. The second source of constraints is symmetry, the equal length of left and right arm and leg segments, providing four constraints per frame. Constraints are obtained both these sources using (12) with a length ratio of 1, resulting in a constraint matrix of the form in (15). In principal, two frames are sufficient to over constrain, but for interesting motions there are typically tens or hundreds of frames, so is significantly over constrained. Note also that, in applying (12), the rigid link constraints between frames can be written for the reconstruction of a single body segment between any two postures. There are, for a single segment appearing in n frames, ( n 2 ) constraints, of which n 1 are independent. The approach taken here is to write the n 1 constraints for each segment in successive pairs of frames in time. SVD of the constraint matrix and Cholesky decomposition of the resulting estimate of yield a rectification matrix U, and metric 3D structure follows. Metric rectified structure for the triplejump example appears in Fig. 10. The final step in reconstructing the motion is regularization of segment sizes. Since there is noise in the system, the metric humanoid figure does not have exactly fixed length body segments throughout the reconstructed sequence. The model for each frame of the sequence is thus replaced by a fixed size humanoid with median segment lengths while preserving the angles of body segments. The results appear in Fig. 11. Introducing the hard constraint of same length in this ad hoc way gives a reasonable appearance of the 3D motion. This should eventually be replaced by a bundle adjustment process involving a minimisation of reprojection errors imposing the fixed length constraints. Figure 10. Four postures from the metric reconstruction of the triplejump sequences. The same four postures are shown from three viewpoints. Compare to the corresponding frames in Fig. 9. Figure 11. The reconstruction of the triplejump sequence, with fixed length segments fitted to body parts.

Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 181 Figure 12. Five frames from the triplejump, equally spaced in time.

11 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 181 Figure 12. Five frames from the triplejump, equally spaced in time. In each row, frames from the two input image sequences appear alongside snapshots of an animated generic biped from various viewpoints. Stick figures representations of motion have limited usefullness on their own. A solid skeletal model provides a far more informative representation of the motion. Ultimately, of course, the movement can be mapped to a realistic, textured humanoid model. Accordingly, the motion capture data is exported to a format understood by a commercial animation package, the Character Studio plugin to 3D Studio Max.

12 182 Liebowitz and Carlsson Figure 13. Three images from the animated VRML model. Character Studio is able to take 3D positional motion capture data and convert it to a joint rotation hierarchy using an inverse kinematic system. The motion is mapped onto a generic biped figure, allow further processing including manual editing, frame decimation and animation of a more complex mesh figure. The generic biped triplejumper appears in Fig. 12. The motion data has also been applied to a generic VRML mesh based humanoid, shown in Fig. 13. The humanoid was created using Blaxxun Avatar Studio, a product allowing limited customization of generic humanoid meshes for use in online chat spaces with a simple set of animated gestures. The result is a low polygon mesh suitable for online use, although with limited mobility. The quality of the reconstruction of motion is difficult to evaluate. Clearly, the accuracy of calibrated, marker based, multi-camera systems cannot be replicated. However, when viewed as an animation, the motion appears to be a faithful reproduction of the action. The animation is smooth and duplicates not only the coarse motion of the arms and legs, but also captures some of the subtle elements of the movement, such as the asymmetry in the hip and shouder positions through the jump. Note that degeneracy plays an important role in human motion capture. A pure translation of limbs, for example, leaves them parallel, leading to degenerate constraints. This might occur for a person walking with arms held in a fixed position holding something. Furthermore, when walking or running in a straight line, human arms and legs tend to move roughly in a pair of parallel vertical planes. The calibration constraints from rigid links in the arms and legs provide little information outside of these planes. The use of the constant distance between the hip joints is thus important (the shoulders tend to be too independently mobile to provide a similar constraint) Changing Cameras Given two sequences acquired with changing cameras, the matched points in the pair of images at each time instance should, strictly speaking, be reconstructed separately. The affine reconstructions obtained from each frame pair are related to the world by a general affine transformation, including a similarity component. There is thus an arbitrary rotation, translation and scaling in addition to the affine distortion encoded by the rectification relating the metric rectifications of each frame. It is, however, still possible to write rigid link constraints on the rectification of each reconstruction. These are constraints on the rectifications for each reconstruction, denoted ( j), j = 1,...,m for a pair of sequences of length m. Before going on to describe the constraints, it is worthwhile considering how the fixed camera assumption is violated. The cameras filming sports events like football generally commonly zoom and pan (rotate) to follow a player or the ball. In many situations, the rotation is necessarily limited to keep the player in the field of view, since the camera is some distance away. There is, however, a significant translation in the image co-ordinates of body points. The effect of zooming is to increase the imaged size of the player. Consider the example of Fig. 14, which shows four frames from a sequence of a football goal. The affine, metric and final reconstructions assuming fixed cameras appear in Figs. 15, 16 and 17. Observe that in the intial affine reconstruction there is, in addition to the affine distortion, an overall increase in body size. This is due to the increase in zoom in the two sequences over time. Also, since there is some rotation of the cameras, a rotation of the body from posture to posture is introduced. It is interesting to note that despite this, the final motion reconstruction nevertheless appears to be quite accurate. This is because, while there is considerable zoom in the sequence, the rotation is small. Zoom introduces errors

Uncalibrated Motion Capture Exploiting Articulated

Frames 1, 20, 30 and 40 of a 42 frame sequence, shown

Four postures from the affine reconstruction of the

Metric reconstruction of the football sequence assuming

13 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 183 Figure 14. A football goal. Frames 1, 20, 30 and 40 of a 42 frame sequence, shown in two views. Figure 15. Four postures from the affine reconstruction of the sequences in Fig. 14 asssuming fixed cameras. Figure 16. Metric reconstruction of the football sequence assuming fixed cameras. Figure 17. Fixed length body segment reconstruction of the football sequence.

14 184 Liebowitz and Carlsson Camera scale factor Camera 2 Camera Frame number Figure 18. Scale factors of the two (scaled orthographic) cameras filming the football sequence. The scales are normalised to the first frame of camera 1. in scale which are ameliorated by both the application of the rigid link constraint to successive frames rather than frames widely spaced in time, and the regularization of structure to fixed length segments. Nevertheless, it will now be shown how the effect of zooming can be eliminated from the images. The approach taken is to jointly calibrate all the pairs of different cameras up to a common global scale factor. Regarding the cameras as scaled orthographies, the camera scale factors are then known. The image points can be scaled accordingly, compensating for the changing zoom. The effects of panning cannot be dealt with in this way, since the orientation between different pairs of cameras is not recovered. This would require some static scene object to be visible. In the changing camera case, the rigid link constraint can be applied in two contexts. Firstly, symmetry constraints that the right and left arm and leg segments are the same length are valid in each reconstruction. There are thus at most four constraints on each ( j) of the form (14). Secondly, the rigid link constraint may be applied across reconstructions: (X j ) T ( j) (X j ) (X k ) T (k) (X k ) = 0 where X j and X k represent the same rigid body segment in reconstructions j and k. This leads to a linear constraint on the vector of elements of both ( j) and (k) ( j) 2 ( j) X ( X j 1 X j ) ( j) ( X j ) 2 ( j) ( X j 1 X j ) ( j) ( X j 2 X j ) ( j) ( X j ) 2 ( j) 3 6 ) 2 (k) ) (k) ) 2 (k) ( X k ( X k 1 X k ( X1 k X 2 k ) (k) ( X2 k X 3 k 2 + ( X2 k ) (k) 5 + ( X3 k 3 ) 2 (k) 6 = 0 (16) The constraint can be written in vector form by constructing a vector of the elements of all the ( j) s ˆΩ v = ( (1) 1, (1) 2, (1) 3, (1) 4,..., (m) 4, (m) 5, (m) ) T 6 and writing the coefficients of these elements in (16) as c ( jk) = ( 0,...,0, c ( j) j) j) 1, c( 2,...,c( 6, 0,...,0, c (k) 1, c(k) 2,...c(k) 6, 0...) T pixels Frame 1 Frame 30 Frame 40 Figure 19. Zoom elimination in images from frames 1, 30 and 40 from the second camera in the football example. The selected image points for frame 1 appear on the left. In the middle and on the right (for frames 30 and 40) are large figures which are plots of the selected image points in those frames, and smaller figures obtained by scaling according to the recovered camera parameters.

15 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 185 The linear constraint in vector notation is then ( c ( jk) )T ˆΩ v = 0 Up to nine constraints of this form can be written for each pair of reconstructions from the arms, legs and hips. Each ( j) is over constrained, and rectification of each construction can be computed from the SVD of a constraint matrix whose rows are c ( jk) vectors C ˆΩ v = 0 Note that the ( j) s computed are not homogenous matrices in this context. The set of rectifications that are computed are such that the relative lengths of segments in the metric reconstructions are the same, and thus include a scale factor for each reconstruction. In fact it is the vector of elements of all the rectifications in the set, ˆΩ v, that is homogenous. The common scale of this vector is a global scale factor applicable to all the reconstructions. In practice, since there are far fewer constraints on each rectification than is the case for the single rectification with fixed cameras, it is useful to introduce camera parameters. Assuming that the cameras may be modelled as scaled orthographies, four equations on the dual of each rectification ( ( j) ) 1 can be obtained from (8) and (9) for each camera pair. These constraints are linear, but since they are in the inverse rectification, they cannot be simply combined with the rigid link constraints. A solution is to use an iterative method to minimize a cost function over all the rectification affinities U ( j). The cost function has two parts: C = C struc + C cam The first term expresses the orthogonality and equal scale of the camera rows for cameras G ( j) M = ( g j M1, g j ) T M2 = G ( j) ( U ( j)) 1 ( j) and G M = ( g j M1, g j ) T M2 = G ( j) ( U ( j)) 1 and is thus m ( C cam = gm1 j g j ) 2 ( M2 + g j M1 g j ) 2 M2 j=1 + (( g j ) Tg j M1 M2 ) 2 + (( g j M1 ) Tg j ) 2 M2 The second term, C struc, is composed of terms of the form ( U ( j) X j U (k) X k ) 2 for all applicable rigid links in the structure. The minimization can be initialized by a linear solution of ˆΩ v or from the fixed camera method. The set of resulting U ( j) s returns m metric reconstructions with a common scale factor. However, since each initial affine reconstruction is a general affine transformation away from world structure, each has its own unknown Euclidean rectification component. That is, there remains a set of unknown rotations and translations between metric reconstructions of individual frame pairs. But, since the scaling is global to all the reconstructions, the relative scaling between pairs of cameras is known. The effect of a zooming camera can thus be eliminated by scaling the image points in each sequence relative to the first camera. Assuming negligable rotation of the cameras through the sequence, it is possible to return to the fixed camera method having removed the effect of zooming, and thus compute a metric reconstruction of the entire motion. Returning to the football example, consider Fig. 18, a plot of the scale factors of the two cameras calibrate as above. The cameras appear to be zooming in unison, with a ratio of (The fixed ratio between the camera scale factors suggests that there is global control of the zoom on the cameras filming the match.) The selected image points can now be scaled proportionately to negate the effect of the changing scale factor, as in Fig. 19. Applying the fixed camera algorithm to the scaled image data yields the results in Figs Figure 20. Affine reconstruction of the football sequence after zoom elimination. Note that the large scale increase seen in Fig. 15 is gone.

186 Liebowitz and Carlsson Figure 21. Final reconstruction of the football sequence. Figure 22. Six frames from an animation of the biped representation of the football example. 7.

16 186 Liebowitz and Carlsson Figure 21. Final reconstruction of the football sequence. Figure 22. Six frames from an animation of the biped representation of the football example. 7. Conclusion This paper has introduced a new constraint on the rectification of affine reconstructions from a known metric ratio of lengths of a pair of line segments. The constraint has been successfuly applied to motion capture by modelling a moving human as an articulated rigid link structure. Examples of reconstructed motion from uncalibrated broadcast footage of sports events have been presented, with both fixed and changing cameras. A number of extensions are possible. As a system, the motion capture approach would benefit greatly from automated tracking to replace, or complement, the manual selection of joints. Completely automated tracking is unlikely to be completely successful given the quality of some digitised broadcast video. However, even partial tracking in conjunction with operator interaction will significantly speed up the process. The rigid link constraint could also be applied to marker data, where reflective markers are attached to the body to facilitate tracking. A partial solution to the occlusion problem can be found when a section of semi-occluded frames can be omitted from a sequence, but the body can still be reconstructed from the remaining frames. An occluded point, now a known distance from an unoccluded 3D point, can lie anywhere on a sphere centred on the unoccluded point. If the occluded point is visible in one view, the back-projected ray defined by the image point intersects the sphere in two points, one of which is the occluded 3D point. Future work will incorporate this feature in the algorithm.

17 Uncalibrated Motion Capture Exploiting Articulated Structure Constraints 187 Finally, bundle adjustment is desirable to fit a truly rigid linked model to the image data. This is a complex non-linear optimisation problem, also currently under investigation. References Bregler, C. and Malik, J Tracking people with twists and exponential maps. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. Faugeras, O What can be seen in three dimensions with an uncalibrated stereo rig? In Proc. European Conference on Computer Vision, LNCS, vol. 588, Springer-Verlag: Berlin, pp Faugeras, O.D Stratification of three-dimensional vision: Projective, affine, and metric representation. Journal of the Optical Society of America A, 12: Faugeras, O.D. and Maybank, S.J Motion from point matches: Multiplicity of solutions. International Journal of Computer Vision, 4: Hartley, R.I Estimation of relative camera positions for uncalibrated cameras. In Proc. European Conference on Computer Vision, LNCS, vol. 588, Springer-Verlag: Berlin, pp Hartley, R.I. and Zisserman, A Multiple View Geometry in Computer Vision. Cambridge University Press; Cambridge, UK. Koenderink, J.J. and van Doorn, A.J Affine structure from motion. J. Opt. Soc. Am. A, 8(2): Kreysig, E Advanced Engineering Mathematics. John Wiley and Sons: New York. Leventon, M.E. and Freeman, W.T Bayesian estimation of 3-D human motion from an image sequence. Technical Report TR-98-06, MERL. Liebowitz, D. and Zisserman, A Metric rectification for perspective images of planes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp Liebowitz, D. and Zisserman, A Combining scene and autocalibration constraints. In Proc. 7th International Conference on Computer Vision, Kerkyra, Greece. Maybank, S. and Faugeras, O.D A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2): Mundy, J. and Zisserman, A Geometric Invariance in Computer Vision. MIT Press: Cambridge, MA. Poelman, C. and Kanade, T A paraperspective factorization method for shape and motion recovery. In Proc. 3rd European Conference on Computer Vision, Stockholm, vol. 2, pp Pollefeys, M., Van Gool, L., and Oosterlinck, A The modulus constraint: A new constraint for self-calibration. In Proc. International Conference on Pattern Recognition, pp Quan, L Self-calibration of an affine camera from multiple views. International Journal of Computer Vision, 19(1): Reid, I.D. and Murray, D.W Active tracking of foveated feature clusters using affine structure. International Journal of Computer Vision, 18(1): Shapiro, L., Zisserman, A., and Brady, M Motion from point matches using affine epipolar geometry. In Proc. European Conference on Computer Vision, LNCS, vol. 800/801, Springer- Verlag: Berlin. Shapiro, L.S., Zisserman, A., and Brady, M D motion recovery via affine epipolar geometry. International Journal of Computer Vision, 16(2): Taylor, C.J Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. Tomasi, C. and Kanade, T Shape and motion from image streams under orthography: A factorization approach. International Journal of Computer Vision, 9(2): Webb, J.A. and Aggarwal, J.K Structure from motion of rigid and jointed object. Artificial Intelligence, 19: Weinshall, D. and Tomasi, C Linear and incremental acquisition of invariant shape models from image sequences. In Proc. 4th International Conference on Computer Vision, Berlin, IEEE Computer Society Press: Los Alamitos, CA, pp

calibrated coordinates Linear transformation pixel coordinates

1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial