Perception and Action using Multilinear Forms

Perception and Action using Multilinear Forms Anders Heyden, Gunnar Sparr, Kalle Åström Dept of Mathematics, Lund University Box 118, S-221 00 Lund, Sweden email: {heyden,gunnar,kalle}@maths.lth.se Abstract In this paper it is shown how multilinear forms can be used in the perception-action cycle. Firstly, these forms can be used to reconstruct an unknown (or partially known) scene from image sequences only. Secondly, from this reconstruction the movement of the camera can be calculated with respect to the scene, which solves the so called hand-eye calibration problem. Then action can be carried out when this relative orientation is known. The results are that it is sufficient to either use bilinear forms between every successive pair of images plus bilinear forms between every second image or trilinear forms between successive triplets of images. We also present a robust and accurate method to obtain reconstruction and hand-eye calibration from a sequence of images taken by uncalibrated cameras, based on multilinear forms. This algorithm requires no initialisation and gives a generic solution in a sense that is clearly specified. Finally, the algorithms are illustrated using real image sequences, showing that reconstruction can be made and that the hand-eye calibration is obtained. 1 Introduction The problem of reconstruction of an unknown scene from a number of its projective images has been studied by many researchers. Firstly, calibrated cameras where considered, see [13], making it possible to reconstruct the scene up to an unknown similarity transformation (Euclidean plus scale). The drawback of this approach is that the camera need to be calibrated before the image sequence is captured and when this have been done it is neither possible to change focus nor to zoom. Another drawback is that the algebra gets complicated because orthogonal matrices have to be used, giving algebraic constraints on each such matrix. Another approach is to model the projective transformation as an affine transformation, giving an affine reconstruction of the scene, see [21]. This approach gives a very simple algebraic formulation and the reconstruction and hand-eye calibration can be obtained using linear methods. The drawback is that the approximation, giving an affine camera, is often not accurate, e.g. when there are perspective effects present. When some action has to be made, the camera often has to move closer to the scene, giving large perspective effects, which makes this approximation very bad in perception-action tasks. Other similar attempts are orthographic projections, see [21] and paraperspective, see [15]. This work has been supported by the Swedish Research Council for Engineering Sciences (TFR), project 95-64-222

Recently, methods requiring no camera calibration have become popular, [2,14,19]. These methods are based on projective geometry and give a reasonable algebra. The first result obtained is that it is only possible to reconstruct the object up to an unknown projective transformation, see [18,2,14,5]. This drawback is not always important, especially in perception-action tasks, because this kind of reconstruction, together with the head-eye calibration is often sufficient in order to take some action, e.g. obstacle avoidance, positioning, picking up objects, navigation, etc. The first algorithms used the so called fundamental matrix, see [2], which is obtained from the bilinear form. Later on, higher order multilinear forms where introduced; trilinearities between triplets of views, see [16,6,3], and quadrilinearities between quadruples of views, see [22,8,3]. It soon became apparent that there was no need to go beyond quadrilinearities and after a while it was discovered that the trilinearities was sufficient to use. Recently, the algebraic and geometric properties of the bilinearities and the trilinearities have been exploited, see [10], and it has been shown that both geometrically and algebraically, the bilinearities are sufficient to use. These multilinear forms can be used both to reconstruct the scene and to obtain hand-eye calibration. There has also been intensive research on more robust methods, based on the multilinear forms, see [23,20,11]. These iterative methods requires no initialisation step and gives optimal reconstruction in a sense that will be specified later. The advantage of these methods is that they use all images and all available point matches with equal priority, i.e. no image are selected as a reference image. They can easily be extended to recursive algorithms, taking more and more images into account as they become available, see [1]. In this paper, we will present reconstruction methods based on multilinear forms and iterative robust methods. It will be shown, both theoretically and in experiments that an accurate reconstruction and hand-eye calibration are obtained. In this way the perception-action problem is treated. 2 Camera Model The image formation system (the camera) is modeled by the equation (1) Here denotes object coordinates in extended form and denotes extended image coordinates. The scale factor, called the depth, accounts for perspective effects and represents a rigid transformation of the object, i.e. denotes a! rotation matrix and a! translation vector. Finally, the parameters in represent intrinsic properties of the image formation system: and represent magnifications in the - and -directions in the light sensitive area, represents the skew, i.e. nonrectangular arrays can be modelled, and " # $ is called the principal point and is interpreted as the orthogonal projection of the focal point onto the image plane. The parameters in and are called extrinsic parameters and the parameters in

are called the intrinsic parameters. Observe that there are extrinsic and intrinsic parameters, totally, the same number as in an arbitrary! matrix defined up to a scale factor. If the extrinsic as well as the intrinsic parameters are unknown (1) can compactly be written (2) Since there is a freedom in the choice of coordinate system in the object, it is a general practice to chose, which will be done in the sequel. However, three degrees of freedom remains, since a projective change of coordinates by the matrix does not change. This ambiguity in the determination of the camera matrices corresponds to the fact that it is only possible to reconstruct the scene up to an unknown projective transformation. This ambiguity in representation will be called the projective ambiguity. It is furthermore convenient to use the notation # # (3) for the first three camera matrices. In the sequel we will assume that we have points (with known correspondences) in different images and that the intrinsic parameters are allowed to vary between the different imaging instants. Image coordinates in image are denoted by and the camera matrix for image number will be denoted by, i.e. # # # (4) 3 Multilinear Forms Consider the equations in (4). These equations can be written # (5) with........ #. (6) Since has a nontrivial nullspace, it follows that " $ (7)

The matrix in (6) contains one block with three rows for each image. Observe that all determinants of " $! " $ submatrices of in (6) are multihomogeneous of degree " # # # # $, that is of the same degree in every triplet of image coordinates. Definition 1. The subdeterminants of size " $! " $ from in (6) are called the multilinear constraints. The multilinear constraints obtained from submatrices containing all rows corresponding to two images and one row from each of the other images are called the bilinear constraints. The bilinear constraints between image and image can be written as a product of -, - and-coordinates in the other images and " # # # # #$ (8) Since the first factors consists of projective coordinates, some combination of these projective coordinates has a nonvanishing product and the bilinear constraints are equivalent to the constraint in (8), which is sometimes called the epipolar constraint. These constraints can be written # (9) where is called the fundamental matrix between images and. Given at least corresponding points in images, it is possible to solve linearly for the entries in by using 9. Then it is possible to calculate the camera matrices and up to the projective ambiguity. The multilinear constraints obtained from submatrices containing all rows corresponding to one image, two rows each from two other images and one row from each of the other images are called the trilinear constraints. The trilinear constraints between image, and can be written as a product of -, - and -coordinates in the other images and! subdeterminants of (10) Again the first factors consists of projective coordinates, and some combination of these projective coordinates has a nonvanishing product, thus the trilinear constraints are equivalent to the constraints expressed by subdeterminants from (10). The trilinear constraints are often expressed using the so called trilinear tensor,, with components defined by " $ " $ # (11) using the notations in (3), see [17,6,8]. There turns out to be four linearly independent (in ) trilinear constraints, obtained from subdeterminants of (10). This makes it possible to obtain four linear constraints on the coefficients of the trilinear tensor for

each corresponding point in images. Thus having corresponding points in images makes it possible to linearly recover the components of the trilinear tensor and from them the camera matrices (up to the previous mentioned ambiguity). The multilinear constraints obtained from submatrices containing two rows corresponding to each of three images and one row from each of the other images are called the quadrilinear constraints. The quadrilinear constraints between image,, and can be written as a product of -, - and -coordinates in the other images and! subdeterminants of (12) Again the first factors consists of projective coordinates, and some combination of these projective coordinates has nonvanishing product, thus the quadrilinear constraints are equivalent to the constraints expressed by subdeterminants from (12). The quadrilinear constraints are usually expressed using the quadrifocal tensor,, with components,, expressed in,,,, and. There turns out to be linearly independent (in ) quadrilinear constraints, see [22], obtained from subdeterminants of (12). This makes it possible to obtain linear constraints on the coefficients of the quadrilinear tensor for each corresponding point in images. Thus having corresponding points in images makes it possible to linearly recover the components of the quadrilinear tensor and from them the camera matrices (up to the previous mentioned ambiguity). Because of the large number of components of the trifocal and quadrifocal tensor ( and respectively) simplifications that reduces these numbers are often needed in calculations. One such simplification is the use of reduced affine coordinates in the images and in the object, see [9]. In this case three corresponding points are chosen as an affine basis in each image and all other coordinates used are affine coordinates with respect to these three basis points. Using this simplification one obtains the reduced fundamental matrix, the reduced trifocal tensor, and the reduced quadrifocal tensor containing,, and components respectively. 4 Dependencies between Multilinear Forms It is obvious that (7) describes all available constraints on the camera matrices in an image sequence. These constraints can be expressed by bilinear, trilinear and quadrilinear constraints. All higher order multilinear constraints are just products of these three different types. Theorem 2. All constraints on the camera matrices in a sequence of images can be expressed in the bilinear, trilinear and quadrilinear constraints between all combinations of, and images respectively. Obviously, this large amount of multilinear constraints is unsatisfactory to work with. In order to reduce the number, assume that the trilinear constraints between every successive triplet of images is known, in the form of the trilinear tensor. Then the each

triplet of camera matrices can be calculated up to the projective ambiguity. Then it is a simple task to adapt the whole sequence of camera matrices to each other and obeying all multilinear constraints. Theorem 3. All constraints on the camera matrices in a sequence of images can be expressed by the trilinear constraints between every successive triplet of images. The interesting question now is if we can reduce the representation to bilinear constraints. Introduce the notation (13) Then the three bilinear constraints between images, and can be expressed by # # # (14) with # # # (15) where the notation # " $#! have been used. Thus from the three bilinearities,, and can be recovered up to scale. Since a relative scale between these two vectors can be obtained (if they are not coplanar). Then the relative scale between, and can be calculated and the only ambiguity present in the determination of the camera matrices, and is the projective ambiguity, see also [12]. Note that this relative scale is naturally obtained from the trilinearities in (10) written as (16) This shows that geometrically, the bilinearities between view and and between view and is sufficient to represent the multiple view geometry. Theorem 4. All constraints on the camera matrices in a sequence of images can be expressed geometrically by the bilinear constraints between image and and between image and. Observe that this theorem is true under the hypothesis of general motion, i.e. three successive camera centers are not collinear. Turning to the algebraic point of view, the problem is not so easy. The first difficulty is that the ideal defined by the trilinearities,, is not the same as the ideal defined by the bilinearities,, observed in [4]. For instance for three views three arbitrary points

on the trifocal lines (the lines in the images connecting the epipoles from the other two images) in each image obeys all bilinear constraints, but not necessarily all trilinear ones. This corresponds to the fact that trilinearities can distinguish between different points on the trifocal plane, but the bilinearities can not. However, even if the ideals are different it is possible to calculate from in the following algebraic sense, see [10]. Theorem 5. The bilinear ideal is reducible and has a primary decomposition,, where is the ideal corresponding to an arbitrary point on each trifocal line. This theorem, together with the previous one shows that both geometrically and algebraically the bilinearities are sufficient to describe image sequences. However, it is not the best way to treat these problems numerically. 5 Reconstruction using a Subspace Approach In order to reconstruct a scene from an image sequence, one can use the bilinear (or even trilinear) constraints to build up the sequence of camera matrices and the find the reconstruction by triangulation. A numerically better way is to user all available multilinear constraints in (7). Such a reconstruction algorithm will be outlined in this section. 5.1 Notations Introduce the notation " # $ for the coordinates of point number in image number and # # # # (17) # (18) where " # #$ denotes the coordinates of object point number. Describe the depths, in (1), of point in image by the diagonal matrices " # # # # $# # # (19) 5.2 Subspace Formulation Using these notations (4) can be written # # # (20)

Denote the linear subspace in spanned by the rows in and by the rows in by and respectively. Then (20) can be interpreted as # (21) where the diagonal matrix is interpreted as an operator that acts on a linear subspace by componentwise multiplication of the diagonal elements. One advantage of this formulation in (21) is that it is independent of the chosen coordinate system in the images, since is the same subspace for every choice of affine coordinates in the images. The advantage of the introduced notations in (17), (18) and (19) is that (1) can be written for images of points:.. (22) For convenience, introduce the notation. Then, multiplying each side with its transpose and dividing with the number of images, (22) can be written (23) The corresponding equation for the subspace analogy follows directly from (21): # (24) where # denotes the sum of the two subspaces and. Our goal now is to use (22), (24) to design an algorithm for calculating, and from the image data. Then gives us the reconstruction (structure, perception) and give us the camera matrices (motion; obtained as the nullspaces of, action). Introduce the matrices # " $ # (25) where is the projection matrix onto. The purpose of the factor in (25) is just to make the size of the entries in more independent on the number of images. Observe that is independent on the chosen basis for since " $ " " $ $ " $ " $ " $ # which implies that is independent on the chosen coordinate system in the image. Now (24) implies that # # (26)

since is a dimensional subspace. This can be viewed as another way of describing the multilinear constraints, which is more suitable for numerical calculations. Let, # # be the singular values of and introduce the proximity measure, which measure the degree of -dimensionality of the sum of the subspaces. Then the reconstruction problem can be formulated by a variational formula (27) In the noise free case the minimum value is equal to. When noise is present in the measurements, minimising this variational formula gives a reconstruction that is independent of the chosen coordinate systems and the ordering of the points. This reconstruction can be obtained from the range of, which is equal to. We remark that the proximity measure can not be interpreted in terms of distances in the images, since distances have no meaning in this coordinate independent approach. Instead it is an abstract measure of the dimension of a subspace. 5.3 An Algorithm Reconstruction can be made from an iterative algorithm consisting of the following steps: 1. Start by putting. 2. Calculate from (25). 3. Calculate the singular value decomposition of, i.e. and the proximity measure. If is sufficiently small stop. 4. Let denote the first four rows of, which will be used as an approximation of the object. 5. Use (20) to estimate from and. 6. Goto 2. The criteria on the proximity measure for terminating have to be chosen appropriately. Step 5 above can done giving a result that is independent of both coordinate systems and the ordering of images and points. Let and denote the projection matrices onto form and its orthogonal complement respectively. Then we use (21) in the # (28) where denotes an orthonormal basis for. For details see [11]. We remark that the main difference between this algorithm and the one presented in [23] is that the former one gives a result that is independent of both the ordering of image and points and the choice of coordinate systems in the images, whereas the latter does not. This is due to the use of Euclidean distances between corresponding points and epipolar lines as well as an asymmetry in the use of different images.

a b c d e f g h i j k l Figure1. Illustration of the trilinear approach. a)-i) show nine images of a simple scene. In j), some of the extracted points are shown in the affine coordinate system defined by three basis points. In k), the reconstructed camera motion is shown together with a reconstruction of the extracted corner points. l) highlights the reconstructed object. 6 Experiments In this section two different experiments will be presented. Reconstruction using Trilinear Forms We illustrate the trilinear approach with an experiment where 9 images have been taken of a simple scene. These are shown in Fig. 1. Some of the corners are extracted and three corners, one from each box on the floor, are used as an affine basis in each image in order to simplify the algebra (see [9,8]). The corner positions have been used to estimate the trilinear forms and therefrom the reconstruction and the motion of the camera. This motion is presented together with the reconstruction of some of the corner points. Reconstruction using the Subspace Approach Consider the four images of a toy block scene in Figure 2. corresponding points (corners of blocks) were picked out manually and were used as inputs to the algorithm. The results are shown in Figure 2, together with the obtained reconstruction. In order to display the reconstruction a Euclidean representation of all projectively equivalent reconstructions, obtained from the algorithm, has to be chosen. This has been done using a three-dimensional model of the object and selecting the closest possible of the projective reconstructions, in least squares sense. However, this is the only stage where the model has been used.

0 2 4 6 8 10 12 14 16 18 20 4.5 4 5 5.5 3.5 6 3 6.5 7 2.5 7.5 2 8 8.5 1.5 9 1 0 2 4 6 8 10 12 14 16 18 20 Figure2. Illustration of the performance of the algorithm. Above: Four images of the toy block scene. Below: The logarithm of the proximity measure, the estimated standard deviation and the obtained reconstruction. 7 Conclusions In this paper we have shown that multilinear forms can be used in perception and action tasks in computer vision. Firstly, multilinear forms can be directly used to reconstruct the scene, giving the perception, and to calculate the camera motion, giving a necessary input to action tasks. Secondly, a generic algorithm, based on multilinear forms, for reconstruction and calculation of camera motion has been described. The algorithm is generic in the sense that all corresponding points and all images are treated in an equal way, i.e. no points or images are used as base points or reference image respectively. Furthermore, the algorithm gives a result that is independent on the chosen coordinate systems in the images, which is the natural way to deal with uncalibrated cameras. Although, there is no guarantee that this iterative algorithm will converge, experiments show that the convergence is usually very fast. The algorithms are presented using real images, showing that it is possible to obtain both reconstruction and camera motion. Further investigations would be towards specific action tasks, such as picking up an object or positioning. References 1. Berthilsson, R., Heyden, A., Sparr, G., Recursive Structure and Motion from Image Sequences using Shape and Depth Spaces, to be presented at CVPR 97, 1996. 2. Faugeras, O., D., What can be seen in three dimensions with an uncalibrated stereo rig?, ECCV 92, Lecture notes in Computer Science, Vol 588. Ed. G. Sandini, Springer-Verlag, 1992, pp. 563-578.

3. Faugeras, O., D., Mourrain, B., On the geometry and algebra on the point and line correspondences between N images, Proc. ICCV 95, IEEE Computer Society Press, 1995, pp. 951-956. 4. Faugeras, O., D., Mourrain, B., About the correspondences of points between N images, Proc. IEEE Workshop on Representation of Visual Scenes, 1995. 5. Hartley, R., I., Projective Reconstruction and Invariants from Multiple Images, IEEE Trans. Pattern Anal. Machine Intell., vol. 16, no. 10, pp. 1036-1041, 1994. 6. Hartley, A linear method for reconstruction from lines and points, Proc. ICCV 95, IEEE Computer Society Press, 1995, pp. 882-887. 7. Heyden, A., Reconstruction and Prediction from Three Images of Uncalibrated Cameras, Proc. 9th Scandinavian Conference on Image Analysis, Ed. Gunilla Borgefors, Uppsala, Sweden, 1995, pp. 57-66. 8. Heyden, A., Reconstruction from Image Sequences by means of Relative Depths, Proc. ICCV 95, IEEE Computer Society Press, 1995, pp. 1058-1063. An extended version to appear in IJCV, International Journal of Computer Vision, 1996. 9. Heyden, A., Åström, K., A Canonical Framework for Sequences of Images, Proc. IEEE Workshop on Representation of Visual Scenes, 1995. 10. Heyden, A., Åström, K., Algebraic Varieties in Multiple View Geometry, ECCV 96, Lecture notes in Computer Science, Vol 1065, Ed. B. Buxton, R. Chippola, Springer-Verlag 1996, pp. 671-682. 11. Heyden, A., Projective Structure and Motion from Image Sequences using Subspace Methods, Submitted to SCIA 97, 1996. 12. Luong, Q.-T., Vieville, T., Canonic Representations for the Geometries of Multiple Projective Views, ECCV 94, Lecture notes in Computer Science, Vol 800. Ed. Jan-Olof Eklund, Springer-Verlag, 1994, pp. 589-599. 13. Maybank, S., Theory of Reconstruction from Image Motion, Springer-Verlag, Berlin, Heidelberg, New York, 1993. 14. Mohr, R., Arbogast, E., It can be done without camera calibration, Pattern Recognition Letters, vol. 12, no. 1, 1991, pp. 39 43. 15. Poelman, C., J., Kanade, T., A Paraperspective Factorization Method for Shape and Motion Recovery, ECCV 94, Lecture notes in Computer Science, Vol 801. Ed. Jan-Olof Eklund, Springer-Verlag, 1994, pp. 97-108. 16. Shashua, A., Trilinearity in Visual Recognition by Alignment, ECCV 94, Lecture notes in Computer Science, Vol 800. Ed. Jan-Olof Eklund, Springer-Verlag, 1994, pp. 479-484. 17. Shashua, A., Werman, M., Trilinearity of Three Perspective Views and its Associated Tensor, Proc. ICCV 95, IEEE Computer Society Press, 1995, pp. 920-925. 18. Sparr, G., An algebraic-analytic method for affine shapes of point configurations, proceedings 7th Scandinavian Conference on Image Analysis, 1991, pp. 274-281. 19. Sparr, G., A Common Framework for Kinetic Depth, Reconstruction and Motion for Deformable Objects, ECCV 94, Lecture notes in Computer Science, Vol 801. Ed. J-O. Eklund, Springer-Verlag 1994, pp. 471-482. 20. Sparr, G., Simultaneous Reconstruction of Scene Structure and Camera Locations from Uncalibrated Image Sequences, proceedings 13th International Conference on Pattern Recognition, 1996, pp. 328-333. 21. Tomasi, C., Kanade, T., Shape and Motion from Image Streams under Orthography: A Factorization Method, IJCV, 9(2):137-154, 1992. 22. Triggs, B., Matching Constraints and the Joint Image, Proc. ICCV 95, IEEE Computer Society Press, 1995, pp. 338-343. 23. Sturm, P., Triggs, B., A Factorization Based Algorithm for Multi-Image Projective Structure and Motion, ECCV 96, Lecture notes in Computer Science, Vol 1065. Ed. B. Buxton and R. Cipolla, Springer-Verlag 1996, pp. 709-720.