3D Tracking Using Two High-Speed Vision Systems

3D Tracking Using Two High-Speed Vision Systems Yoshihiro NAKABO 1, Idaku ISHII 2, Masatoshi ISHIKAWA 3 1 University of Tokyo, Tokyo, Japan, nakabo@k2.t.u-tokyo.ac.jp 2 Tokyo University of Agriculture and Technology, Tokyo, Japan, iishii@cc.tuat.ac.jp 3 University of Tokyo, Tokyo, Japan, ishikawa@k2.t.u-tokyo.ac.jp Abstract When considering real-world applications of robot control with visual servoing, both 3D information and a high feedback rate are required. We have developed a 3D target tracking with a 1ms feedback rate using two high-speed vision s called Column Parallel Vision (CPV) s. To obtain 3D information, such as position, orientation, and shape parameters of the target object, a feature-based algorithm has been introduced using moment feature values extracted from vision s for a spheroidal object model. Also, we propose a new 3D self windowing method to extract the target in 3D space, which is an extension of the conventional self windowing method in 2D images. 1 Introduction To control a robot dynamically by direct visual feedback [1, 2], a servo rate of around 1kHz is required. Conventional vision s using CCD cameras cannot realize this fast feedback rate because of the slow transmission rates of video standards. To solve this problem, we have developed a 1ms highspeed vision called the Column Parallel Vision (CPV) [3] and demonstrated a high-speed grasping task [4] using our vision in our previous work. However, our previous using one vision could not extract 3D information, thus we made the assumption of constant distance between the camera and target. In many real applications, 3D information such as positon, motion, orientation or/and shape information of the target is strongly required. In this research, we have developed a 3D tracking algorithm and a for extracting 3D information in a 1ms cycle time. Our goal is to apply them for high-speed grasping tasks in a 3D real world. (see Fig 1.) We have Presently with RIKEN, Bio-Mimetic Control Research Center, Nagoya, Japan, nakabo@bmc.riken.go.jp. First active vision First CPV High speed target tracking f 1 2D image feature extraction object Feature based 3D reconstruction Hand-Arm High speed motion Object Model Position, Orientation Shape parameters Motion How to reach and grasp the target? Figure 1: 3D tracking and grasping task f 2 Second active vision Second CPV 2D image feature extraction All processes are executed in a 1ms cycle time developed the 3D tracking using two high-speed vision s, in which bottleneck-free processing has been realized by massively parallel image processing in CPV s and a feature-based 3D reconstruction in a DSP. Active vision s called AVS-II [3] enable fast control of the gazes to track the target at a 1ms visual servo rate. Also, we propose a new method for extracting the target in a 3D space called 3D self windowing, as conventional self windowing [5] is considered only in a 2D image plane. In the next section, we describe the details of the proposed algorithms. The and an implementation of these algorithms are described in section 3. In section 4, some experimental results are given.

2 3D Tracking Algorithms 2.1 Task-based object model Our final goal is, as shown in Fig.1, to catch the target moving fast and irregularly in a 3D space by the dynamically controlled hand-arm using direct feedback of 3D visual information. Considering these tasks, target position and motion in 3D cartesian coordinates should be obtained for tracking the target, and shape parameters and orientation of the target will be required for guiding the arm to the target and deciding the trajectories of the fingers in the hand for preshaping. In this research, we choose a spheroidal object model, whose parameters are centroid, radial length of rotation and direction and length of the rotation axis. Using these parameters, the approach of the arm can be determined from the direction of the rotaton axis and preshaping can be organized from the size and the length of each direction of the spheroid. These parameters contain sufficient information for our task, so that we are able to focus on the algorithms for extracting these 3D parameters using two high-speed vision s. 2.2 Reconstruction of ellipsoidal shape model (in 2D) In this section, we introduce the moment feature values and compute the ellipsoidal shape parameters from the moment values. Let [u, v] T be the image coordinates and I(u, v) beaninput image. The (i + j)th order bond moments m ij are defined by the following equation: m ij u i v j I(u, v) (1) u,v These moment feature values are often used in visual servoing. In our vision, we can extract these values at high speed, as shown in section 3. The center of gravity ū, v, variance σ 2 u,σ2 v and covariance C uv of an image pattern can be calculated by the moment values, as: ū = m 1 /m, v = m 1 /m (2) σ 2 u = m 2 ū 2 m (3) σ 2 v = m 2 v 2 m (4) C uv = m 11 ū vm. (5) The parameters of an ellipsoidal shape model will be computed from these moment feature values. First, we consider the basic ellipsoid pattern described as: u 2 /a 2 e + v 2 /b 2 e 1, (6) where a e and b e are the lengths of the major and minor (a e > b e ) axes of the ellipsoid. We consider the general description of ellipsoid S, by rotating θ e and translating [u e, v e ] T the basic ellipsoid given by equation (6). [ ] [ ][ ] [ ] u cos θe sin θ = e u ue v sin θ e cos θ e v + (7) v e We consider the binary image, where I(u, v) = 1 inside the ellipsoid, and I(u, v) = outside. Let us calculate the moment feature values from the ellipsoid pattern S. Clearly, the center of an ellipisoid is: u e = ū, v e = v. The second-order moments are calculated as: σ 2 u = u 2 Idudv = a eb e π S 4 (a2 e cos 2 θ e + b 2 e sin 2 θ e ) (8) σ 2 v = v 2 Idudv = a eb e π S 4 (a2 e sin 2 θ e + b 2 e cos 2 θ e ) (9) C uv = uvidudv = a eb e π S 4 (a2 e b2 e ) sin 2θ e. From these equations, θ e can be calculated as: θ e = 1 ( ) 2 arctan 2C uv σ 2 u. (1) σ2 v Solving (8) and (9), a e and b e can be obtained as: 4 a e = 4 8 (σ 2 u cos2 θ e σ 2 v sin2 θ e ) 3 π cos 2θ e σ 2 v cos2 θ e σ 2 u sin2 θ e 4 b e = 4 8 (σ 2 v cos 2 θ e σ 2 u sin 2 θ e ) 3 π cos 2θ e σ 2 u cos 2 θ e σ 2 v sin 2. θ e We have shown how to calculate all the parameters of the ellipsoid from the moment feature values. 2.3 Reconstruction of spheroidal shape model (in 3D) Next, we describe the reconstruction of a spheroidal shape model from the feature values extracted from one vision. Let x = [x, y, z] T be the cartesian coordinates of the object coordinate. We first consider the basic spheroid, whose center is positioned at the origin of the coordinate and the axis of symmetry is aligned along the x axis. The spheroid is described as: x T Σx = 1, where Σ=diag. [ ] 1/a 2 s, 1/b 2 s, 1/b 2 s. (11) We assume the weak perspective camera model. [ ] u u = f M(Rx + T), (12) v T z where M = [I 2 ], R is a 3 3 rotation matrix, T = [T x, T y, T z ] T is a translation vector from the object to the

camera coordinate, and f denotes focal length and scale factor of the camera. We assume that R i denotes the rotation θ i around an axis i, and introduce the following equations: R = R z R y R x (13) x = R y R x x (14) With respect to these equations, we divide the transformation of the camera projection into two phases. 1. An orthographic projective transformation from the 3D object coordinate to the x y plane. 2. A similarity transformation from the x y plane to the uv image plane. Now we consider the first transformation. At first, the relation of R x ΣR T x = Σ shows that there is no need to know the parameter θ x in this case. Substituting equation (14) into equation (11), we obtain the following equation: x T Σ x = 1, where Σ = R y ΣR y T. (15) Generally, the projection of the spheroid is an ellipsoid. In this case, an ellipsoid formed by an orthographic projection of the spheroid to the x y plane is an envelope generated by the intersecting lines of the x y plane and the tangent planes of the spheroid parallel to the z axis. We will calculate the projection practically. Any tangent planes at x on the spheroid (15) are described as: x T Σ x =. (16) If the plane is parallel to the z axis, the normal vector of the plane is orthogonal to the z axis, for example: x T Σ [,, 1] T =. (17) Substituting equation (15) into equation (17), we can eliminate z, and rewriting x x, we have the following equation, which describes an ellipsoid on the x y plane: x 2 + y 2 = 1. (18) a s2 cos 2 θ y + b 2 s sin 2 2 θ y b s Now we consider the second transformation. Substituting equation (13) and (14) into equation (12), we can obtain the following equation. [ u v ] = f T z ([ ][ ] cos θz sin θ z x sin θ z cos θ z y + [ Tx T y ]) (19) Finally, comparing equations (18),(19) and (6),(7), we obtain the following relations between ellipsoidal parameters and spheroidal parameters: θ z = θ e T x = u e T z / f (2) T y = v e T z / f (21) b s = b e T z / f a s2 cos 2 θ y + b 2 s sin 2 θ y = a e T z / f. Note that T x, T y, and θ z have been computed, but T z and θ y are unknown and we are not concerned with θ x with respect to the position and the orientation of the target. 2.4 Computing position and orientation (in 3D) In the previous section, we have calculated the parameters available from one camera. Now we will integrate information from two cameras. Let R b and T b be, respectively, rotation and translation matrices from the first camera to the second. We derive them from the initial setup of the two cameras and encoder sensor data of the active vision s. Let R i and T i be those from the target to the ith camera. Now we have: R b T 1 + T b = T 2, (22) R b R 1 = R 2. (23) Substituting equation (2) and (21) into equation (22), we have: u e1 / f u A = R e2 / f b v e1 / f v 1 e2 / f 1 A [ T z1, T z2 ] T = T b. We can solve these equations by minimizing the least squares error of the solution, as: [ T z1, T z2 ] T = A + T b = (A T A) 1 A T T b. Next, we compute the parameter θ y. Suppose R i = R zi R yi R xi. Multiplying the vector [1,, ] T to (23) from the right-hand side, R xi can be eliminated. Finally we obtain: where: cos θ y1 n sin θ y1 a = [ cos θ y2,, sin θ y2 ] T, (24) Solving (24), we obtain: [ n o a ] = R z2 T R b R z1. θ y1 = arctan (n 2 /a 2 ) ( ) a3 n 2 a 2 n 3 θ y2 = arctan. a 2 n 1 a 1 n 2 Now all the algorithms are shown for extracting 3D information from moment feature values.

Possible area of target in 3D space Possible area of target pattern in Image 1, known from Image 2. Frontier point Obstacle be the self window W i, as: W i = D(Tt 1 i ). (25) step 2. Create a tentative target pattern P i by masking the raw image S i from the vision sensors by the self windowing mask W i : Image 1 Epipolar geometry Image 2 Figure 2: Epipolar geometry in 3D self windowing 2.5 Conventional self windowing (in 2D) Now we will focus on a method of extracting the pattern of the target from input images on vision sensors. In our previous work, we proposed a self windowing method [5], in which self windowing masks are created from the target pattern in the previous frame cycle, so that the target can be tracked continuously providing there is a high enough frame rate of the vision. Basically, this method can be applied to the task here. However, when an occlusion occurs, this can be detected by an abrupt increase of the area of the target pattern, but we cannot distinguish the target pattern from the obstacle until the object patterns will be separated again in the images. 2.6 3D Extended self windowing To solve this problem, we propose a 3D self windowing using an epipolar geometry of two vision s. As shown in Fig.2, assumig the pattern obtained from the conventional self windowing as a tentative target pattern, we can consider an area enclosed by the pencil of the contour of the tentative pattern and take a product space of these areas from two visions, which can be considered as a possible area where the target should be in the 3D space. This area can be used as a 3D mask to separate the target object from the obstacle, and it is certain that we can track the target continuously because of the high frame rate of the vision s. 2.7 3D self windowing algorithm In this section, we describe our 3D self windowing algorithm. Let the number of i (i=1,2) denote each of the vision s. We assume the patterns to be binary (1,). step 1. Suppose the target pattern at the last frame T i t 1 has been obtained. Let the dilated (D) pattern of T i t 1 P i = W i S i. (26) step 3. Find the tangent lines l i max and l i min of the contour of the pattern P i passing an epipole e = [e u, e v ] t. Call the tangent points f i max and f i min the tentative frontier points. The sets of points L i max and L i min on the lines l i max and l i min can be described as: c max = max(c), subject to L P i φ (27) c min = min(c), subject to L P i φ (28) L i max = {u c max (u e u ) (v e v ) <ɛ} (29) L i min = {u c min (u e u ) (v e v ) <ɛ}, (3) where ɛ hasasufficiently small value. f i max and f i min can be chosen from the sets of points F i max and Fi min as: Fmax i = Li max Pi (31) Fmin i = Li min Pi. (32) step 4. Exchange the tentative frontier points between two vision s, and calculate epipolar lines m i max, m i min from these points as: [ m j maxt, 1] T = F [ f i maxt, 1] T, (i j) where the 3 3 matrix F is a fundamental matrix which is calculated from R b, T b, and known camera parameters. step 5. Though it cannot be ensured that the calculated epipolar lines m i max, m i min pass the true frontier points, we can say that the target object should be presented inside the area M i clipped by the epipolar lines m i max, m i min. Using Mi as the mask for extracting the target, Tt i can be described as: T i t = P i M i. (33) In conventional self windowing, the algorithm has been stopped at step 2 treating T i t = P i. In the proposed algorithm, the mask M i in (33) compresses the possible area of the target, so that an occlusion-free recognition is realized. 3 System and Implementation 3.1 Dual CPV and DSP In the following, we briefly describe our and demonstrate high-speed computing of the proposed algorithms.

Image input PD array 128 x 128 pixels Control signals 8bit ADC 128 Cycle time : 1ms Column parallel image transfer Instructions Controller PE array 128 x 128 PEs Figure 3: CPV Summation circuit Column parallel data inout Image feature extraction to DSP network Table 1: Processing time on CPV processing contents steps time 2D self windowing 7 2.3 µs search frontier points (max) @24 2 158.4 µs 3D self window mask 13 42.9 µs th order moment (m ) 36 11.9 µs 1st order moments (m 1, m 1 ) @39 2 25.7 µs 2nd order moments (m 2, m 2 ) @12 2 79.2 µs 2nd order moment (m 11 ) 315 14. µs total 1286 424.4 µs Active vision motion and position Active vision (AVS-II) - 1 Servo controller Parallel DSP First CPV and AVS (left camera) object Object image Object image Moment feature values CPV -1 3D SW mask parameters CPV -2 Active vision (AVS-II) - 2 Object model 3D SW mask parameters Moment feature values Servo controller 3D reconstruction Object model parameters, position, and orientation Obstacle Second CPV and AVS (right camera) Active vision motion and position Cycle time : 1ms Figure 5: Photo of experimental setup Figure 4: Block diagram of 3D tracking The consists of two independent vision s and a DSP. The vision s are called the CPV [3], which has 128 128 photodetectors and an allpixel parallel processing array based on an S 3 PE architecture and exclusive summation circuit for calculating moment values, as shown in Fig.3. The architecture of the CPV is optimized for high-speed visual feedback, so that it can realize a 1ms feedback rate while executing various kinds of image processing algorithms. Each of the vision sensors of CPV s are attached onto the active vision s called AVS-II. Each of the active vision s enables high-speed gaze control for tracking the target independently. In the DSP, parallel DSPs (TI:C6721 4) are used for PD servo control of both actuators of AVS-II. Also, they are used for computing an integration of the feature values from two vision s. The block diagram of the entire is shown in Fig.4. 3.2 Implementation of algorithms In the 3D self windowing algorithm, identical operations on large sets of points are often used when processing images. There can be operated extremely fast by pixel parallel processings in a CPV. For example, an input image is first binarised in each pixels and the operations in (25),(26) and (31)-(33) are executed in a few steps. Also there are search processes in (27)-(3), but we only need to search limited regions of parameters, because there is a sufficiently high frame rate. The processing time of every algorithm in a CPV for proposed 3D tracking is shown in Tab.1. Note that the total processing time is less than half a millisecond, which is the cycle time of the tracking. A calculation of an epipolar geometry and a 3D reconstruction in the DSP, and an exchange of the feature values between the CPV and the DSP[3] are also sufficiently fast that, in total, a bottleneck-free processing for the goal task is realized in our. 4 Experimental Results The experimental setup is shown in Fig.5. The distance between the two vision s is 1cm and that from the camera to the object is about 8cm. The size of the spheroid is 2cm by 1cm. Though the target and the obstacle overlap in the image in Fig.5, the left camera

-2-15 -1-5 5 1 15 2 3 Input Image (binary) Calculated Moment extracted from CPV sysyem. Moment th order:mo=1222, 1st order:mx=69953 My=66898 2nd order:mxx=4131455 Myy=3828938 Mxy=374698 --------------------------------- Center X=57.244682 Y=54.744682 Variance Sx=12717.75 Sy=166628.25 Covariance Cxy=-83456.75 Tilt Theta=-.668897(rad) =-38.3251(deg) Ellipse a=14.126347 b=27.57376 Figure 6: Results of reconstruction of ellipsoid From left image without obstacle with obstacle 3D self window masked 25 2 15 1 5 Left camera image Obstacle (masked) -5 [pixel] Figure 7: Results of 3D self windowing Right camera image Table 2: Results of 3D reconstruction of spheroid parameters calculated truth error rate distance: to CPV1 71cm 65cm 9.2% distance: to CPV2 6cm 54cm 11.1% rotation: θ z 42deg. 4deg. 5.% rotation: θ y 28deg. 4deg. 3.% length of axis: a s 21cm 2cm 5.% length of radius: b s 1cm 1cm.%.4.3.2.1 Left camera.2.4.6.8 1. 1.2 1.4 1.6 -.4 position and direction -.2.2 Right camera.4 [m].6 shows them to be separated. First, the results of a calculation of ellipsoidal parameters are shown in Fig.6. The left-hand image in the figure is the binarised input image and the right-hand image is the reconstructed ellipsoid whose parameters are calculated by the proposed algorithm, which seems to be well estimated. Next, we show the result of the 3D self windowing in Fig.7. Shown on the left are the trajectories of the target. Without a 3D self windowing mask, the trajectory is biased to the left and downwards by the disturbing obstacle. Images on the right are the results of 3D masking, so that only the target patterns are extracted. Last, we show an example of the result of the 3D reconstruction, shown in Tab.2 and Fig.8. Some of the parameters are calculated close to the true values. But some parameters (for example θ y ) are not sufficiently accurate even for a task such as the rough grasping of the target by the hand-arm. This might be caused by an inaccurate calibration of initial rotations of camera coordinates. A demonstration of the target tracking with two cameras can be viewed in a video clip in a CD-ROM of the proceedings. 5 Conclusion A 3D tracking consisting of two CPV s and a DSP has been developed. Also a momentfeature-based 3D reconstruction algorithm and a new 3D Figure 8: Result of 3D reconstruction in 3D graph self windowing method have been introduced. The processing time of the 3D tracking is less than 1ms, which is the required speed for real time and dynamic control of the robot. The accuracy of the recent will be improved by an accurate calibration of camera coordinates. References [1] K. Hashimoto, editor. Visual Servoing. World Scientific, 1993. [2] S. Hutchinson, G. D. Hager, and P. I. Corke. A Tutorial on Visual Servo Control. IEEE Trans. on Robotics and Automation, Vol. 12, No. 5, pp. 651 67, 1996. [3] Y. Nakabo, M. Ishikawa, H. Toyoda, and S. Mizuno. 1ms Column Parallel Vision System and its Application of High Speed Tracking. In Proc. IEEE Int. Conf. on Robotics and Automation, pp. 65 655, 2. [4] A. Namiki, Y. Nakabo, I. Ishii, and M. Ishikawa. 1-ms Sensory-Motor Fusion System. IEEE Trans. on Mechatronics, Vol. 5, No. 3, pp. 244 252, 2. [5] I. Ishii, Y. Nakabo, and M. Ishikawa. Tracking Algorithm for 1ms Visual Feedback System Using Massively Parallel Processing Vision. In Proc. IEEE Int. Conf. on Robotics and Automation, pp. 239 2314, 1996.