DYNAMIC STEREO VISION FOR INTERSECTION ASSISTANCE

FISITA 2008 World Automotive Congress Munich, Germany, September 2008. DYNAMIC STEREO VISION FOR INTERSECTION ASSISTANCE 1 Franke, Uwe *, 2 Rabe, Clemens, 1 Gehrig, Stefan, 3 Badino, Hernan, 1 Barth, Alexander 1 Daimler AG, Group Research, Germany, 2 University of Kiel, 3 University of Frankfurt KEYWORDS environment perception, driver assistance, intersection assistance, stereo vision, space-time stereo. ABSTRACT More than one third of all traffic accidents with injuries occur in urban areas, especially at intersections. Therefore, a driver assistance system supporting the driver in cities is highly desirable and has a tremendous potential of reducing the number of collisions at intersections. A suitable system for such complex situations requires a comprehensive understanding of the scene. This implies a precise estimation of the free space and the reliable detection and tracking of other moving traffic participants. Since the goal of accident free traffic requires a sensor with high spatial and temporal resolution, stereo vision will play an important role in future driver assistance systems. Most known stereo systems concentrate on single image pairs. However, in intelligent vehicle applications image sequences have to be analyzed. The contribution shows that a smart fusion of stereo vision and motion analysis (optical flow) gives much better results than classical frame-by-frame reconstructions. The basic idea is to track points with depth known from stereo vision over two and more consecutive frames and to fuse the spatial and temporal information using Kalman filters. The result is an improved accuracy of the 3D-position and an estimation of the 3D-motion of the considered point at the same time. This approach, called 6D Vision, enables a detection of moving objects even if they are partially hidden. From static points very accurate occupancy grids are built. A global optimization technique delivers a robust estimation of the free space. Pixels moving in the world are clustered to objects which are then tracked over time in order to estimate their motion state and to predict their paths. This allows for powerful collision avoidance systems: pedestrians crossing the street are detected before they enter the lane; the same holds for vehicles from the sides which are not detectable by common radar systems. Since we are able to estimate the yaw rate of oncoming traffic, the prediction is not restricted to straight motion but can detect potential collisions with turning traffic, especially at intersections. Urban vision asks for a large field of view. Within the German project AKTIV a fisheye stereo camera system is under development with a field of view of up to 150 degrees. If the 6D-Vision principle is applied to these images, laterally entering vehicles are also detectable.

INTRODUCTION Stereo vision is a research topic with a long history. See (1) for an overview. For a long time, correlation based techniques were commonly used. They deliver precise and reliable measurements in real-time on a PC or on dedicated hardware. Recently, much progress has been achieved in dense stereo. Especially the work of Hirschmueller (2) paves the road towards real-time solutions. His so called Semi-Global-Matching algorithms deliver nearoptimum solutions on the computational expense of a classical correlation scheme. New subpixel algorithms (3) reduce the distance noise significantly and further push the limits of stereo for a given camera system. Fig. 1 compares the results obtained by a common correlation based scheme with a modern dense stereo algorithm. The colors encode distance, the warmer the color the closer the point. The results do not only differ in density. Note the differences in low contrast areas such as the building and the road surface. Using stereo vision, the three-dimensional structure of the scene is easily obtained. The standard approach for free space analysis and obstacle detection is as follows: After rectification, the stereo correspondences are computed. Then, all 3D points are projected onto an occupancy grid. In a third step, this grid is segmented and potential obstacles are tracked over time in order to verify their existence and to estimate their motion state. This strategy ignores the strong correlation of successive frames and the information contained within. This paper describes an efficient exploitation of the correlation in time. It leads to more precise and stable results, and allows estimating the motion state of single image points even before the objects are detected. This "track-before-detect" approach distinguishes between static and moving pixels before any segmentation has been performed. Using static points, very accurate occupancy grids are generated while moving points can be easily grouped. The paper is organized as follows: First we sketch the problems in stereo vision and show that the uncertainties of occupancy grids are significantly reduced if the stereo information is integrated over time. Then, we introduce a Kalman filter based integration of stereo and optical flow allowing for the direct measurement of 3D-position and 3D-motion of all tracked image points (6D-Vision). The following section describes the motion state estimation of oncoming vehicles at intersections. Finally, we highlight the potential of fisheye cameras for intersection assistance and give results. Fig. 1: Correlation based stereo (left) vs. dense stereo (right). Red encodes close, green encodes far points. Note the higher density especially in low-contrast areas like the road or the building on the right side.

STEREO VISION AND FREE SPACE ANALYSIS Given a carefully rectified stereo image pair (i.e. all lens distortions have been corrected and the epipolar lines coincide with the image rows), stereo vision aims to find corresponding features in the left and right image pair along the epipolar lines. From the disparities, i.e. the distance between corresponding points, the world position can be easily derived. Nevertheless, the task is not as simple as it sounds: periodic structures can cause false correspondences, hidden points are hard to identify, areas with low or even no contrast are difficult to evaluate and illumination differences ask for robustness of the used similarity measure. Besides the mentioned epipolar constraint, other constraints like the ordering constraint, the uniqueness constraint, the smoothness constraint or the recently introduced gravitational constraint (3) help to solve those problems. Since the relative orientation of a stereo camera system cannot be assumed to be constant over time, a slow on-line calibration is necessary. Recently, Dang (4) proposed a scheme that solves this task robustly. As mentioned in the introduction, it is common to accumulate all 3D points above ground in a stochastic occupancy grid. Figure 2 shows such a grid obtained for the urban situation considered in the sequel. The origin of the coordinate system is centered in our own vehicle. Our standard stereo camera system has a base line of 30cm and an angle of view of 42 deg. The imagers have VGA resolution. Fig. 2: Occupancy grids of an urban situation. Left: stereo image pair with enlarged bicyclist. Middle: the stochastic occupancy grid based one single image pair. Right: the improved accuracy obtained by the procedure. Note the decreased uncertainty especially at larger distances. It becomes obvious that the uncertainty of stereo depth measurements increases quadratically with distance. Therefore, the bicyclist (zoomed out in the left image) at around 60m is highly blurred in the occupancy grid. Free space analysis of those occupancy grids is not very reliable, thus we are looking for strategies to reduce the uncertainty.

One way to reduce the disparity noise is the tracking of features in the images over multiple frames. If the disparity measurements are uncorrelated, the variance decreases with 1/N, if N is the number of images. The 6D-Vision algorithm described below exploits this fact. Fortunately, tracking becomes redundant in static scenes when the ego-motion of the camera is known a priori. This is beneficial since it allows working with dense stereo disparity maps despite the real-time constraint. Disparity measurements which are consistent over time are considered as belonging to the same world point, and therefore, disparity variance is reduced accordingly. This stereo integration requires three main steps: Prediction: the current integrated disparity and a variance image are predicted. This is equivalent to computing the expected optical flow and disparity based on ego-motion. Our prediction of the variance image includes the addition of a driving noise parameter that models the uncertainties of the system, such as ego-motion inaccuracy. Measurement: disparity and variance images are computed based on the current left and right images. Update: if the measured disparity confirms its prediction, then both are fused together reducing the variance of the estimation. The verification of the disparity consistency is performed using a standard 3-sigma test. Figure 2 shows an example of the improvement achieved. The occupancy grid shown at the right was computed with an integrated disparity image. Note the significantly reduced uncertainties of the registered 3D points. A bicyclist at approximately 60 meters away is marked in the images. The occupancy grids shown above are in Cartesian coordinates. However, Cartesian space is not a suitable space to compute the free space because the search must be done in the direction of rays leaving the camera. The set of rays must span the whole grid. This leads to discretization problems. A more appropriate space is the polar space. In polar coordinates every grid column is, by definition, already in the direction of a ray. Therefore, searching for obstacles in the ray direction is straightforward. For the computation of free space the first step is to transform the Cartesian grid to a polar grid by applying a remapping operation. The polar representation we use is a Column/Disparity occupancy grid, for a detailed discussion see (5). A result of this is shown in the middle image of Figure 3. Fig. 3: Free space computation. The green carpet shows the computed available free space. The free space is obtained applying dynamic programming on a Column/Disparity occupancy grid, which is as a remapping of the Cartesian depth map, shown at the right. The free space resulting from the dynamic programming is shown over the grids.

In polar representation, the task is to find the first visible obstacle in the positive direction of depth. All the space found in front of occupied cell is considered free space. The desired solution forms a path from left to right segmenting the polar grid into two regions. Instead of simply thresholding each column, dynamic programming is used. The method based on dynamic programming has the following properties: Global optimization: every row is not considered independently, but as a part of a global optimization problem which is optimally solved. Spatial and temporal smoothness of the solution: the spatial smoothness is imposed by penalizing depth discontinuities while temporal smoothness is imposed by penalizing the deviation of the current solution from the prediction. Preservation of spatial and temporal discontinuities: the saturation of the spatial and temporal costs allows the preservation of discontinuities. Figure 3 shows the result of the dynamic programming applied to the considered scene. For more details on this analysis see (6). 6D-VISION Until now, we assumed the world to be static and showed how to combine successive stereo image pairs to reduce the variance of the free space estimation. This information can be used for obstacle detection and obstacle avoidance in a straight-forward manner, since all non-free space is considered an obstacle. However, the world is not completely static and a system for obstacle detection has to cope with moving objects and precisely estimate their movements to predict potential collisions. A common approach is to analyze the occupancy grid and to track isolated objects over time. The major disadvantage of this algorithm is that the segmentation of isolated objects is difficult in scenes consisting of multiple nearby objects. Fig. 4: Dangerous traffic scene. The left image shows a pedestrian appearing behind a standing car. The corresponding stereo reconstruction is shown in the center image. Red encodes close, green encodes far points. The optical flow field is shown in the right image. Here red lines encode large image displacements, green small displacements. This problem is illustrated in Figure 4: Here the pedestrian appears behind the standing car and runs towards the street. In the center image, the reconstructed stereo information is shown using the red to green color encoding scheme. Here, the points belonging to the pedestrian are hardly distinguished from the points on the standing car. A segmentation based on this

information only will therefore merge the pedestrian and the standing car into a single static object. In the right image, the optical flow between the last and the current frame is shown. The color encodes the length of the displacement vector: red encodes large image displacements, green small displacements. Here the pedestrian and the standing car can easily be distinguished. This leads to the main idea of the 6D-Vision algorithm: Track an image point in one camera from frame to frame and calculate its stereo disparity. Together with the known motion of the ego-vehicle, the movement of the corresponding world point can be calculated. In practice, a direct motion calculation based on two consecutive frames is extremely noisy. Therefore the obtained measurements are filtered by a Kalman filter. Since we allow the observer to move, we fix the origin of the coordinate system to the car. The state vector of the Kalman filter consists of the world point in the car coordinate system, and its corresponding velocity vector. The six-dimensional state vector ( X, Y, Z, X&, Y&, Z& ) gives this algorithm its name: 6D Vision. The mathematical details are given in (7).,, with u and v being the current image coordinates of the tracked image point and d its corresponding disparity. As the perspective projection formulae are non-linear, we have to apply the Extended Kalman filter. The measurement vector used in the update step of the Kalman filter is ( u v, d ) Fig. 5: 6D-Vision block diagram A block diagram of the algorithm is shown in Figure 5. In every cycle, a new stereo image pair is obtained. In the left image, appropriate features (e.g. edges, corners) are detected and tracked over time. In the current application we use a version of the Kanade-Lucas-Tomasi tracker (8) which provides sub-pixel accuracy and tracks the features robustly for a long image sequence. The disparities for all tracked features are determined in the stereo module. After this step the estimated 3D-position of each feature is known. Together with the ego-motion the measurements of the tracking and the stereo modules are given to the Kalman filter system that updates the state estimation. For the next image pair analysis, the acquired 6D information is used to predict the image position of the tracked features. This yields a better tracking performance with respect to speed and robustness. In addition, the predicted depth information is used to improve the stereo calculation. The motion of a vehicle is not at all straight but exhibits strong pitch and roll motion. In order to compensate for these disturbances, a precise ego-motion analysis is advisable. If stereo tracks are available, the full motion state (6 degrees of freedom) can be obtained from vision. The powerful real-time algorithm we use is described in (9).

Fig. 6: Estimation results for the pedestrian from Figure 5. The time between the images is (from left to right) 0, 80, 160 and 240 ms. The vectors point to the predicted position of the corresponding world point in 0.5s. The color encodes the distance of the points. The result of this algorithm is shown in Figure 6. From left to right the estimation results for the pedestrian from Figure 5 are shown at 0, 80, 160 and 240 ms relative to the first appearance of the pedestrian. The estimated velocity vectors point to the predicted position of the corresponding world point in 0.5s. The colors encode the distance of the points. It can be seen, that this rich information helps to detect the moving pedestrian and provides a first prediction of its movement at the same time. OBJECT TRACKING 6D-Vision is a powerful method to extract linear point motion in the 3D world. A group of 6D vectors corresponding to adjacent 3D points with similar 3D motion vector is likely to belong to the same object and, thus, can be used to generate object hypothesis. However, due to the linear motion model of the single points, predicting the motion of such object hypothesis without any further constraints is also limited to linear motion. With respect to vehicles, especially at turning maneuvers, the prediction of the driving path can not be very precise and may lead to misinterpretations. In (10), a vision-based approach for estimating the nonlinear motion state of vehicles from a moving platform is proposed. Objects are represented by a 3D point cloud combined with a state vector including object pose and dynamics. It is assumed that it is possible to distinguish vehicles from other moving objects such as pedestrians, e.g. based on dimension and velocity of a cluster of 6D vectors. Fig. 7: Turning vehicle at an intersection. The orange box indicates the estimated position and orientation. The red arrow indicates the predicted driving path assuming constant motion. The dynamics of a vehicle is approximated by a coordinated turn motion model, which restricts lateral movements to a circular path based on velocity and yaw rate. Moving the point cloud in the world induces changes in the image plane and is observed in terms of optical flow and disparity changes. An Extended Kalman Filter is used to solve for the

inverse problem, i.e. relating these observations in the image to a movement of the point cloud in the world. All points are referred to a local object coordinate system defined for each tracked vehicle. It is assumed that the real position of a point within the object coordinate system does not change over time (rigid body assumption) and that the object's structure is well described by the point cloud. In practice one has to deal with noisy observations of these points and the real position of a point is not known as is the overall structure. However, it is possible to refine the object point cloud over time based on a number of noisy observations of the single points. Fig. 7 shows a typical situation at an intersection. The orange box shows the current position of the oncoming car. The complete motion state of the turning vehicle has been estimated based on the stereo tracks. Assuming constant motion, the green arrow in front of the car indicates the expected circular driving path for the next second. The proposed system is able to estimate the motion state of vehicles at urban intersections including velocity, yaw rate, and acceleration as well as position and orientation, and runs currently at 25Hz on our demonstrator car UTA. The filter can be easily extended by adding additional measurements, for example radar sensor information such as relative velocity or distance. FISHEYE STEREO FOR INTERSECTION ASSISTANCE Common stereo camera systems have opening angles around 40 degrees. Simple investigations reveal that this angle must be increased to about 150 degrees if dangerous situations at intersections should be recognized, e.g. vehicles coming from the side. Fisheye lenses in contrast to standard wide angle lenses have the advantage of a constant resolution over the whole field of view. Currently, we use 150 degree lenses. A typical image is shown in fig. 8. The computation can be limited to 400 lines of the 1628x1236 imager. In the first step, the images are rectified, based on the data of calibration performed in an offline process. For details see (11). This allows using the free space analysis and 6D-Vision as described above without any changes. The rectification step works with a cylindrical camera model as opposed to the pinhole model in order to obtain a bounded image size. Figure 8 shows a situation at a pedestrian crossing. The computed free space is overlaid in green. Fig. 8: Free space analysis for the pedestrian crossing situation. Figure 9 shows a second intersection scene where a vehicle approaches quickly from the right having the right-of-way. Note the position of the vehicle at initial detection. It is first detected

at 15m longitudinal and 22m lateral distance, yielding 26m Euclidian distance. An earlier detection was impossible due to occlusion of a wall visible at the right edge. Fig. 9: 6D Vision result for a scene with a vehicle approaching fast from the right side having the right-of-way. The significant lateral motion is detected within 4 frames. The arrow length shows the predicted position in 0.5s. The arrow color encodes distance - red is near and green is far away Fig. 10: Situation shown in fig. 8 two seconds later after the stop of our own vehicle. The actual object detection is done via direction and position analysis of the 6D vectors (see previous section). Figure 10 shows the same scene two seconds later. The ego-vehicle almost stopped while the vehicle from the right was able to continue. SUMMARY Vehicles acting in a dynamic environment must be able to detect any static or moving obstacle. This implies that an optimal stereo vision algorithm has to seek for an optimal exploitation of spatial and temporal information contained in the image sequence. As shown in the paper, precision and robustness of 3D reconstructions are significantly improved if the stereo information is appropriately integrated over time. This requires knowledge of the ego-motion, which in turn can be efficiently computed from 3D-tracks. It turns out that the obtained ego-motion data outperforms commonly used inertial sensors. The obtained depth maps show less noise and uncertainties than those generated by simple frameby-frame analysis. A dynamic programming approach allows for determining the free space without any susceptible obstacle threshold. The algorithm runs in real-time on a PC and has proven robustness in daily traffic including night-time driving and heavy rain. The request to detect small or partly hidden moving objects from a moving observer asks for fusion of stereo and optical flow. This leads to the 6D-Vision approach that allows to simultaneously estimate position and motion of each observed image point. Since the fusion is based on Kalman filters, the information contained in multiple frames is integrated. This leads to a more robust and precise estimation than differential approaches like pure evaluation of the optical flow on consecutive image pairs. Grouping this 6D-information is very reliable

and enables fast detection of moving objects which can be further tracked using appropriate dynamic models. The same concept is applied to cameras with fisheye lenses. Practical tests confirm that a crossing cyclist at an intersection is detected within 4-5 frames. The implementation on a 3.2GHz Pentium 4 proves real-time capability. Currently, we select and track about 2000 image points at 25Hz (the images have VGA resolution). REFERENCES (1) D.Scharstein, R.Szeliski: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1) (2002) pp.7-42. (2) H.Hirschmueller: Accurate and efficient stereo processing by semi-global matching and mutual information. CVPR 2005, San Diego, CA. Volume 2. (June 2005) pp.807-814. (3) S.Gehrig, U.Franke: Improving Stereo Sub-Pixel Accuracy for Long Range Stereo, Workshop on Virtual Representations and Modeling of Large-Scale Environments VRML@ ICCV 07, Rio, 2007. (4) T.Dang, C.Hoffmann: Tracking Camera Parameters of an Active Stereo Rig. In 28th Annual Symposium of the German Association for Pattern Recognition (DAGM 2006), Berlin, September 12-14 2006. (5) H.Badino, U.Franke, and R.Mester: Free space computation using stotchastic occupancy grids and dynamic programming. Workshop on Dynamical Vision, ICCV 07, Rio, 2007. (6) U.Franke, S.Gehrig, H.Badino, C.Rabe: Towards Optimal Stereo Analysis of Image Sequences, RobotVision 2008, 18.-20.February 2008, Auckland. (7) U.Franke, C.Rabe, H.Badino, S.Gehrig: 6D-Vision: Fusion of Stereo and Motion for Robust Environment Perception, 27 th DAGM Symposium 2005, pp. 216-223 ISBN 3-540-28703-5. (8) J.Shi and C.Tomasi, Good Features to Track. IEEE Conference on Computer Vision and Pattern Recognition, pages 593-600, 1994. (9) H.Badino, U.Franke, C.Rabe, S.Gehrig, Stereo-vision based detection of moving objects under strong camera motion, VisApp, Setubal (Portugal), February 2006. (10) A.Barth, U.Franke: Where will the Oncoming Vehicle be the Next Second?, IEEE Intelligent Vehicles Symposium IV 2008, Eindhoven, 4.-6. June 2008. (11) S.Gehrig, C. Rabe, L. Krüger, 6D Vision Goes Fisheye for Intersection Assistance, Canadian Robot Vision, Windsor, May 2008.