PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1. We present some preliminary results on a system for tracking 3D motion using

PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1 Tak-keung CHENG derek@cs.mu.oz.au Leslie KITCHEN ljk@cs.mu.oz.au Computer Vision and Pattern Recognition Laboratory, Department of Computer Science, University of Melbourne, Parkville, Victoria, Australia 3052 ABSTRACT We present some preliminary results on a system for tracking 3D motion using input from a calibrated stero pair of video camreas, and on a new stereo camera calibration technique. The system is organized as a collection of intercommunicating \agents", which can run concurrently, using a cycle of prediction and verication of 3D motion. This system architecture is designed for real time operation, but as yet only parts of the system are running in real time. 1 INTRODUCTION Real time 3D motion tracking is one of the important problems in computer vision research. In general, our approach for 3D motion tracking can be decomposed into four components: motion parameter estimation, feature extraction, predictionverication, and segmentation. The denition of real time can vary. For us, it means that the system is able to obtain the information it needs in time for that information still to be useful. L R h hhh ( (( h hhh ( (( ( ( STEREO FEATURE 3D MOTION CLUSTER DISPLAY CAMERA CALIBRATION FEATURE MATCHING Figure 1: 3D Feature Based Tracker Cooper and Kitchen [1, 2] have been developed a 2D motion tracking system which can run in real time. In their system, they subdivide the problem into several subproblems and distribute such subproblems to dierent agents. These agents run as concurrent processes in a distributed architecture so as to exploit 1 This research was supported by Australian Research Council, Small Grants Scheme.

parallelism to minimize the time consumed. The system presented here is a further development of this work, to handle 3D motion. We will give a brief review of the 2D motion tracking system in Section 2. The overall system design of the proposed system (3DFBT, for \3D Feature-Based Tracker") is shown in Figure 1. This set of agents can be subdivided into two major parts. The rst includes the initialization processes. These include the Calibration and Matching agents. The rest of the system (only partially implemented as yet) is intended to perform the continuous tracking and segmentation processes. The detailed system design for 3DFBT will be described in Section 3. Our system uses stereo views of interest-point features for the tracking. A stereo calibration technique solves for the mapping 2D to 3D, and vice versa. After the relationship has been established, we can use the 3D point positions directly to do motion prediction and motion parameter estimation. In the rest of this paper, we present the details of two agents (Calibration, Stereo- Feature) which have been implemented, and some preliminary results based on using vector velocity prediction. 2 REVIEW OF 2D FEATURE-BASED TRACKER The major components of the 2D Feature Based Tracker (2DFBT) are three separate agents. Figure 2 is the overall structure for 2DFBT. Each agent runs on FEATURE CLUSTER DISPLAY Figure 2: 2D Feature Based Tracker a separate machine and they communicate via ETHERNET. At the moment, FEATURE runs on an IBM-compatible with an Intel 80486/33 CPU, CLUSTER runs on Silicon Graphics IRIS INDIGO and DISPLAY runs on Silicon Graphics PERSONAL IRIS. Here are the functions for each agent: Feature This agent extracts [1, 2] interest points (feature points) using the \Early Jump-Out Method" to do the feature matching between two images. This agent only understands the world through feature points, in terms of statement such as \feature 99 has position (x, y) and velocity (u, v)". In addition, it can use the current estimate of the feature's image velocity to predict the position of the feature in the new data (next frame). In each frame-time, the current state of the known features is sent to CLUSTER for further processing. Cluster This agent groups the feature points as reported by FEATURE into clusters such that the motion of each feature is consistent with rigid 2-D motion of the group to which it belongs. It is able to guess the locations

and velocities of features that have been lost by FEATURE. Finally, it sends the information to DISPLAY. Display The major purpose of this agent is to give a graphical representation of the model being maintained by CLUSTER. Each frame-time, CLUSTER sends to DISPLAY a summary of the scene as recorded in CLUSTER's model. It acts as a graphic display of the features, their velocity, and their group memberships. The system can track about 70 features at 5 frames per second. 3 OVERVIEW OF THE 3D FEATURE-BASED TRACKER The major components for 3DFBT have been shown in Figure 1. This set of agents can be split into two parts. The rst comprises the initialization process and the second comprises the continous tracking and segmentation processes. As with 2DFBT, all the agents except the initialization process run as concurrent processes in a distributed architecture. All the agents communicate via WORLD (The communication program developed by the University of Melbourne Computer Vision and Pattern Recognition Laboratory). Stereo Feature runs on an IBMcompatible with an Intel 80486/33 CPU under OS/2; all the other agents run on SGI PERSONAL IRIS or SGI IRIS INDIGO machines. The major functions for each agent are shown below: Camera Calibration Agent As its name implies, this agent tries to establish the relationship between the 2D images and the 3D world. It is the rst one to run in the system. A more detailed description for this agent can be found in Section 4. Feature Matching Agent One of the initialization processes, this agent extracts the images from the Left and Right stereo cameras and detects possible feature points from each image. After the feature points from each images are found, then it tries to nd all the pairs of corresponding feature points from the Left and Right image. At the moment, the matching for corresponding feature points is done manually, since the initial correspondence problem is not the main focus of our work. Stereo Feature Agent One of the continous tracking processes, this agent controls the two cameras to extract the images of the scene, detect the stereofeature points, and calculate the vector velocity for each feature point. A more detailed description for this agent will be found in Section 5. 3D Motion Cluster Agent One of the tracking and segmentation processes, this agent maintains all information for the known objects found by the system. It receives the 3D vector velocity and 3D position for each feature point from Stereo Feature Agent in each stereo frame time. Initially, the agent assumes all the feature points belong to the same object (i.e. a stationary object). After it detects the movement in a set of feature points,

then it tries to use a general motion estimation method to determine the motion parameters. After the motion parameters are found, it tries to nd which feature points move consistently with this set of parameters. This agent also uses a prediction/verication cycle to predict the position and motion of the known objects. Display Agent The function of this agent is exactly same as the Display in 2DFBT. It gives a graphical representation of the model being maintained by 3D motion cluster agent. In the continous tracking process, we try to run the system at about 5 frames per second. Because of the overhead of the 2D and 3D information exchange, we plan to track only about 30-35 features at the 5 frame per second rate. 4 CALIBRATION The major purpose of the calibration agent is to establish the relationship between 3D world coordinates and their corresponding 2D image coordinates. More specically, it takes as input two sets of corresponding control points from the two cameras and computes two sets of camera parameters. Because we need to calibrate two cameras frequently, we have developed a new calibration technique which calibrates two cameras at a same time. We simply call it Simultaneous Camera Calibration. This section will describe the idea of this calibration method and some related experimental results. Currently this agent runs as a separate o-line process. 4.1 Pinhole Camera Model Camera calibration is a process of representing a camera in terms of some mathematical models. The model that is used for this report is the Pinhole camera model. (X; Y; Z) is the object point in a real-world coordinate system, and (x; y) is the corresponding image point projected into the image plane. In general, we can use the two equations below to do the calibration process: x = C x + f M x m 11 (X? T x ) + m 12 (Y? T y ) + m 13 (Z? T z ) m 31 (X? T x ) + m 32 (Y? T y ) + m 33 (Z? T z ) y = C y + f M y m 21 (X? T x ) + m 22 (Y? T y ) + m 23 (Z? T z ) m 31 (X? T x ) + m 32 (Y? T y ) + m 33 (Z? T z ) where (T x ; T y ; T z ) are the translation parameters, (m ij ) is the rotation matrix for the camera, (M x ; M y ; f) are the scale factor and the focal length for the camera, and (C x ; C y ) are the coordinates of the image center. These two equations are the so-called collinearity equations. It is necessary to do a full-scale search to solve these nonlinear equations to nd the 11 independent camera parameters. Therefore the method used to establish the calibration using the above two equations is called the Nonlinear Camera Calibration Method [3, 5, 4].

4.2 Simultaneous Camera Calibration (SCC) The idea of this method comes directly from the nonlinear camera calibration method. In the case of the two stereo cameras, four equations will be involved, so 22 camera parameters need to be established (11 for each camera). We could use the Newton-Raphson method to solve the above problem. In our system, we simply combine the residuals of the four equations together, giving an objective function to be minimized which looks like: = (C x1 + f 1 M x1 m 11 (X? T x1 ) + m 12(Y? T y1 ) + m 13(Z? T z1 ) m 31 (X? T x1 ) + m 32 (Y? T y1 ) + m 33 (Z? T z1 )? x 1) 2 + (C y1 + f 1 M y1 m 21 (X? T x1 ) + m 22 (Y? T y1 ) + m 23 (Z? T z1 ) m 31 (X? T x1 ) + m 32 (Y? T y1 ) + m 33 (Z? T z1 )? y 1) 2 + (C x2 + f 2 M x2 n 11 (X? T x2 ) + n 12(Y? T y2 ) + n 13(Z? T z2 ) n 31 (X? T x2 ) + n 32 (Y? T y2 ) + n 33 (Z? T z2 )? x 2) 2 + (C y2 + f 2 M y2 n 21 (X? T x2 ) + n 22(Y? T y2 ) + n 23(Z? T z2 ) n 31 (X? T x2 ) + n 32 (Y? T y2 ) + n 33 (Z? T z2 )? y 2) 2 where (x i ; y i ) are the image coordinates of the control point's projection in camera i, (T xi ; T yi ; T zi ) are the translation parameters for camera i, (m ij ) is the rotation matrix for camera 1, (n ij ) is the rotation matrix for camera 2, (C xi ; C yi ) is the camera center for camera i, (M xi ; M yi ; f i ) are the scale factors and focal length for camera i. The intended advantage of using this function is that we can use the relationship between 3D coordinates and two 2D feature coordinates to improve our calibration accuracy. In normal calibration method, we only can calibrate one camera at a time. In case of the stereo system, we need to calibrate two cameras in two separate process. This is a time-consuming task. More importantly, when two cameras are close together, and hence the baseline is short, the ambiguity for 3D recovery becomes large. This source of error can reduce our experimental accuracy. In SCC method, we can reduce the dependency between parameters, and the ambiguity in their solution. Even if the two cameras are close together, we still can use their dependency in the above function to obtain better parameters hence a better approximation. Other advantage are that the process start-up time is reduced (one process for two cameras) and that the required number of control points for solving two cameras remains the same as a single one. The major disadvantage for this method is that is greatly increases the number of unknowns to be solved for simultaneously. In the following section we will present some experimental results and compare such results with some obtained from the normal nonlinear calibration method. 4.3 Experimental Results In our experiment, two images are extracted from the left and right stereo cameras. For normal nonlinear calibration, 32 control points are used to calibrate each camera. For SCC, 32 stereo control points are used. The measurement error for the control points is about 0:1cm. Table 1 shows two sets of parameters which were found by SCC. Table 2 shows two sets of parameters which were found by the

Camera M x M y C x C y T x T y T z! Right 4.47 6.62 217.19 218.29 89.45 62.35 21.38 1.62?0.13?0.73 Left 4.94 7.31 327.63 236.51 71.62 94.57 21.78 1.64 0.12-0.83 The distance between the camera and object is xed at 110:5cm. Table 1: Two sets of camera parameters found by SCC. Camera M x M y C x C y T x T y T z! Right 4.67 6.91 219.72 216.14 93.17 64.61 21.87 1.63?0.11?0.76 Left 4.81 7.10 331.22 238.95 69.69 91.96 21.67 1.64 0.12?0.83 The distance between the camera and object is xed at 110:5cm. Table 2: Two sets of camera parameters found by normal Calibration method. normal calibration method. Having these two sets of parameters, we randomly choose some known 3D points and pairs of 2D points to test both parameters. The RMS error for image projection of the 3D points using parameters found by the normal calibration method is about 0.306 pixels, and the RMS error for the parameters found by our method is about 0.317 pixels. The RMS error for 3D position calculated by back-projection of the 2D features using parameters found by the normal calibration method is about 0.145 cm, and the average error for the parameters found by our method is about 0.144 cm. We also have done considerable analysis of the performance of SCC. The results showed the improvement is not signicant. This contradicts our original expectation that using 3D coordinates and simultaneous 2D feature coordinates from two cameras would improve our calibration accuracy. At the moment, we are still working on this problem, therefore we do not have any nal conclusion for this unexpected behaviour. However, we think this may be caused by the dimensionality of the objective function being too high (20 parameters). So this problem may be solved by reducing the dimensionality of the function, or by breaking the function parameters into two parts. One part could be solved by a linear method, and the rest of the parameters could be solved by a nonlinear method [6]. 5 STEREO FEATURE The basic structure of this agent is similar to the Feature agent in [1, 2]. We use the same feature matching technique (the Early Jump-Out Method) for matching the features between two frame times. At the moment, correspondence of image features between the two stereo views is done manually in an initialization process. Therefore, for now, we do not handle new feature points entering the view once the tracking process has started. After the initial stage, we have a set of corresponding 2D feature positions. Then we can use the camera parameters which come from the Calibration agent to construct a set of 3D feature point positions. When tracking features, the Stereo Feature agent uses the current estimate of the 3D feature velocity (assumed zero

initially) to predict the future 3D position of the feature. After we have this information, we can convert the 3D position into the two 2D image positions in the stereo views. The correlation values are then computed in a small region near each predicted location and thresholded to produce a set of possible match points. The mean position of the possible match points is used as the new 2D feature position. From the 2D feature positions in two new frames, we can compute the 3D position and then update the 3D feature velocity. The updated 3D velocity can then be used to make further predictions, in an ongoing prediction/verication cycle. 5.1 Experimental Results At this stage, we have run the 3DFBT on a number of simple examples, of which the one presented here is typical. The 3DFBT is not yet running in real time, because of some technical diculties with networking. So currently, it completes processing on each stereo pair before requesting the next stereo pair in the sequence. The motion in this experiment is generated by moving a target object manually 1mm to the right per stereo frame time for 20 frames. The camera set-up had been previously calibrated using the SCC method on a separate test object marked with 32 control points. Figure 3 shows a stereo image pair taken during the experiment. The corresponding feature point positions are shown in Table 3. We used about 20 feature points for this experiment, but show only 6 in Table 3. Figure 3: A stereo pair taken at frame number 10 of the sequence. Features detected are superimposed as small black squares in the top pair of images. The bottom pair of images have superimposed on them the feature positions predicted from frame number 9.

Feature Number Predicted Position Actual Position 1 (2.13, -3.57, 21.95) (2.13, -3.56, 21.95) 2 (-3.64, 3.55, 21.94) (-3.63, 3.54, 21.93) 3 (-2.16, 0.73, 18.92) (-2.16, 0.73, 18.91) 4 (0.62, -0.36, 15.54) (0.62, -0.35, 15.54) 5 (-3.75, 3.29, 12.56) (-3.74, 3.29, 12.55) 6 (2.27, -3.49, 12.50) (2.26, -3.48, 12.49) Table 3: Comparison of the 3D predicted position and actual position found by the system for tracked features in frame number 10 of example sequence. During the experiment, the system may give a wrong prediction for some feature points or there may be errors in the image input. This may cause the system to fail to detect that feature point in that frame. In the case that the system fails to nd the feature in one image (Left or Right) but not both, then the system tries to use the found one as reference to determine the missing one's position and velocity. In the case that the system cannot nd the feature points in either image, then we assume the prediction is correct and use that for the feature's new position and update its velocity. If we cannot nd a feature in either image for 5 consecutive stereo frames, then the system declares that feature is lost. Using this strategy, we can prevent the loss of features due to input errors or occlusion. 6 REFERENCES [1] J. Cooper and L. Kitchen. Multi-agent motion segmentation for real-time task directed vision. In Australian Joint Conference on A. I., Perth, Australia, November 1990. [2] J. Cooper and L. Kitchen. A region based object tracker. In Third National Conference on Robotics, Melbourne, Australia, June 1990. [3] W. Faig. Calibration of close-range photogrammetry systems: Mathematical formulation. Photogrammetric Eng. Remote Sensing, 41:1479{1486, 1975. [4] D. B. Gennery. Stereo-camera calibration. In Proc. Image Understanding Workshop, pages 101{108, 1979. [5] I. Sobel. On calibrating computer controlled camera for perceiving 3-d scenes. Artical Intelligence, 5:185{188, 1974. [6] Roger Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using o-the-shelf TV cameras and lens. IEEE Journal of Robotics and Automation, RA-3(4):323{344, August 1987.