Evaluation of Stereo Matching Algorithms for Occupant Detection

Size: px

Start display at page:

Download "Evaluation of Stereo Matching Algorithms for Occupant Detection"

Dora Magdalen Spencer
6 years ago
Views:

1 Evaluation of Stereo Matching Algorithms for Occupant Detection Sidharta Gautama TELIN-Universiteit Gent St.Pietersnieuwstr.41 B-9000 Gent (Belgium) Simon Lacroix, Michel Devy LAAS-CNRS 7 avenue Colonel Roche Toulouse Cedex (France) simon@laas.fr, devy@laas.fr Abstract Vision systems within vehicles offer new opportunities in the automobile industry. The detection and classification of passenger and driver seat occupancy open up new ways to improve the safety and comfort of the passengers. In this paper we present results of a stereo system designed for the observation of the cockpit scene in order to provide information about the passenger presence and location within the vehicle to improve the control of the airbag firing. We compare different techniques and examine the effect of random and systematic errors on the performance in precision, robustness and processing speed. These results establish a foundation for on-going work on occupant detection for vehicle safety. 1 Introduction Improvement of the safety and comfort of the vehicle occupant is an important research domain in the automobile industry. New sophisticated functions are being developed which improve and automate a variety of vehicle components (e.g. intelligent cruise control, airbag firing). The development of these functions is related to the introduction of new sensors in the automobile application field (radar, lidar, video sensing devices). These sensors will be able to provide information to decision algorithms which can process the raw data into an interpretable analysis of the observed scene. In this paper we discuss occupant detection using stereo vision, related to safety issues. The massive introduction of airbags in cars has generated new problems: indeed, airbag systems are operating in open loop conditions; when an impact is detected, the airbag is automatically This work is supported by the PREDIT program of the french Ministry of National Education and Research. inflated without any feedback about the nature of the seat occupant. This mode is not without problems and has caused a number of dramatic accidents (e.g. passengers in unusual situations are severely injured by the airbag, babies, installed on a child seat in reversed position, are thrown to the back of the vehicle when the airbag inflates). The determination of the presence of a person or object on the seat, and the evaluation of the position of the passenger within the cockpit can be used to improve the operational conditions of the airbag. All these developments are supported by the improvement of the sensor technology and the cost reduction of the electronic components. In order to analyze the seat occupancy, we developed a 3D vision system based on a pair of stereo cameras. Stereo vision offers the potential to produce detailed results within real-time constraints and is suited for irregular environments. The application of occupant detection demands a level of accuracy in addition to the mere detection of presence. A fine location of the boundaries is desired to accurately position the person within the vehicle. This puts strong constraints on the performance of the 3D vision system: 1. Real-time demands: in order for the system to react quickly enough in the case of an accident, a 3D reconstruction at 10 Hz or higher is required; 2. Dense reconstruction: the large variety of situations (e.g. person with child, person reversed, different objects) puts high demands on the density of the 3D reconstruction to be able to sufficiently characterize the form in later stages; 3. Accurate reconstruction: Of importance is the positioning of the head within the cockpit. The application aims at an accurate positioning within centimeter range. In this paper, we evaluate the performance of different stereo algorithms within this context. Section 2 introduces

2 area-based matching. Matching is a fundamental problem in stereo vision, where the aim is to identify corresponding points in left and right image. We examine two approaches to this problem. In section 3, the performance in precision, robustness and computational complexity for these two approaches is examined. Section 4 outlines the application of the system to head detection. 2 Area-based matching Area-based matching algorithms in stereo vision rely on the comparison of regularly sized regions of pixel values in the two images, in order to find the best match. Matching measures are used to provide a numerical measure of the similarity between a window of pixels in one image and a window in another image. Based on these measures an optimal match is determined. 2.1 Zero mean Normalised Cross Correlation (ZNCC) ZNCC belongs to the class of correlation-based measures. These measures are based on the cross-correlation function R between two X images I 1 and I 2 : R(x y)= I 1(u v)i 2(x + u y + v) u v2w where I 1 (x y) and I 2 (x y) denote the intensity values at pixel (x y). ZNCC deals with a possible gain factor and bias between the two images (radiometric distortion) by resp. dividing by the variances of each correlation window, and by first substracting the mean of the correlation window from each pixel value: ScoreZNCC(x y) = P u v2w (I1(u v) ; I 1):(I2(x + u y + v) ; I 2) q P u v2w (I1(u v) ; 2.2 Census Transform (CT) P : I 1)) 2 :( u v2w (I2(u v) ; I 2)) 2 CT belongs to the class of non-parametric techniques [1] [2], which are based on the relative ordering of pixel intensities within a window, rather than the intensity values themselves. Consequently, these techniques are robust with respect to radiometric distortion, since differences in gain and bias between two images will not affect the ordering of pixels within a window. The CT maps the window surrounding the centre pixel to a bit string. If a particular pixel s value is less than the centre pixel then the corresponding position in the bit string will be set to 1, otherwise it is set to 0. Two census transformed images are compared using a similarity measure based on the Hamming distance, i.e. the number of bits that differ in the two bit strings: ScoreCT (x y) = X u v2w Hamming(I1 0 (u v) I0 2 (x + u y + v)) where I 0 1 and I 0 2 represent the census transform of I 1 and I 2. We have introduced a modification within the census transform to make it more robust to image noise and rectification errors. In correspondence with ZNCC, we perform a census transform by comparing each pixel s value with the mean of the correlation window instead of the intensity of the centre pixel. 3 Experimental evaluation For stereo vision, performance depends on many factors, including the contrast and spatial distribution of texture in the intensity signal, the noise level in the images, artifacts introduced by the matching algorithms, and the true range to the objects in the scene [3]. In our experiments, we characterize both random errors caused by noise and the systematic errors resulting from artifacts. To evaluate the performance of the ZNCC and CT algorithms within the framework of occupant detection, we collected different sets of stereo pairs of a flat board, with and without an object of known geometry, from an optical bench. We used a set cameras (2.6 mm objectives) with a 5 cm separation (stereo baseline). A third order distortion model is used to calibrate the cameras and estimate the epipolar geometry. The range of interest is limited to 2 meter. Sets of 50 stereo pairs were taken under constant illumination and processed at resolution to generate sample statistics (fig. 1). Besides these images, we mounted an experimental system within a vehicle cockpit which is used to evaluate the algorithms within the context of occupant detection. 3.1 Analysis of random errors At subpixel resolution, the precision (variance) of a estimate d reflects statistical fluctuations around the sub-pixel mean. This precision depends on the variance of noise in the image, the image intensity in the image and the matching algorithm used. We are interested in the uncertainty in 3-D measurements computed from. If the estimate d follows a Gaussian distribution (which has been verified by Matthies [3]), it has been shown that the estimate of the range Z can be modelled as Gaussian with mean Z = k= d and variance 2 Z ( Z d )2 = k 2 d 4 :2 d (1)

(a) Flat Board (b) Book (a) ZNCC Figure 1. Test Images where k is a calibration constant. Using equation (1), the standard deviation on the range can be plotted out against the range (fig. 2).

All images were processed with ZNCC and both CT algorithms using a correlation window of 99 pixels and statistics collected.

3 (a) Flat Board (b) Book (a) ZNCC Figure 1. Test Images where k is a calibration constant. Using equation (1), the standard deviation on the range can be plotted out against the range (fig. 2). We have analysed this for a set of 50 images, taken from the same scene (a flat board, fig 1(a)). All images were processed with ZNCC and both CT algorithms using a correlation window of 99 pixels and statistics collected. Plotting the standard deviation on the range, we see that the precision remains within the centimeter range for both ZNCC and CT algorithms. ZNCC provides a slightly more stable result, which is due to the higher smoothing effect established by the correlation window. The modification on the CT algorithm introduces no effect on the precision over the original CT. To allow a qualitative evaluation of the robustness of the different algorithms, we averaged the number of matched pixels found for each image pair over the total dataset. This comparison shows the immediate benefit of the modification made to the original CT. Introducing an averaging process in combination with the census transform stabilises the transform for image noise and errors due to the rectification process (fig. 3). Table 1 summarizes the statistics. Based on these results, we opted to continue with the modified version of the CT algorithm. In the following discussion CT stands for the modified census transform. ZNCC CT CT (modified) % % % Table 1. Matched pixels per image pair, averaged over the total dataset. 100% = total image 3.2 Quality of the estimate The at each pixel is estimated using area-based matching with a correlation window around each pixel. (b) CT (c) CT mod Figure 2. Std.deviation on range estimates Several criteria can be used to filter false matchings: strength, uniqueness and form of the correlation peak (fig. 4), and inverse validition. This last test inverses the role of left and right image and considers as valid only those matches for which the reverse correlation has fallen on the initial point in the left image [4]. Sub-pixel estimates are obtained by fitting a parabole to the three correlation values surrounding the optimum. We asses the quality not only on the correctness of the estimate, but also on the ability to filter out false matchings using the validity criteria. To evaluate the matching algorithms, we examined the estimate at various points within the images

(a) (b) (c) Figure 3. Disparity images of a sample: (a) ZNCC, (b) CT, (c) CT modified correlation signal difference range variance strength Figure 4. Thresholds on correlation peak shown in fig 5.

For these textures, the matching strength of the texture over a given range is plotted out in fig. 6, comparing the correlation signal for ZNCC and CT. (a) (b) (c) Figure 5.

This test proves to be the most reliable. Estimates which fall outside the inverse validity test are false or ambiguous.

To be able to use the other measures in conjunction with the inverse validity test, the plot for these measures should show a monotonously increasing curve.

We examine the evolution of the peak value, the difference with the second largest peak, the variance and the inverse validation, which are used by the filter criteria, and this for varying

4 (a) (b) (c) Figure 3. Disparity images of a sample: (a) ZNCC, (b) CT, (c) CT modified correlation signal difference range variance strength Figure 4. Thresholds on correlation peak shown in fig 5. The dots show the locations where samples were taken. Some example textures are shown for a correlation window of pixels. For these textures, the matching strength of the texture over a given range is plotted out in fig. 6, comparing the correlation signal for ZNCC and CT. (a) (b) (c) Figure 5. Texture samples validation has been plotted as a grey bar when the inverse test is validated (i.e. the reciprocal match falls within one pixel of the original point). This test proves to be the most reliable. Estimates which fall outside the inverse validity test are false or ambiguous. However, the inverse does not hold and false estimates can still filter through. To be able to use the other measures in conjunction with the inverse validity test, the plot for these measures should show a monotonously increasing curve. For the different points we examined, only the difference measure proves to be useful. The peak value and the variance generally show a fluctuating behaviour. We examine the evolution of the peak value, the difference with the second largest peak, the variance and the inverse validation, which are used by the filter criteria, and this for varying correlation window sizes. Since small correlation windows tend to give an unreliable estimate, this gives a rough idea of which criteria are useful to filter out unwanted responses. Fig. 7 shows the plot of these measures for a given point (texture b, fig. 5). The inverse The difference measure proves to be a useful attribute with CT in contrast with ZNCC. To show this, we classified the estimates in our sample points as reliable or unreliable. This was done by regarding each estimate within the observed scene and see if the estimate was consistent with the scene or not. Only the points which passed the inverse validity test are regarded. ZNCC and CT give the same consistency evaluation for each point (i.e. both techniques give approximately the same estimate). However using the difference measure, CT is

5 (a) Texture a (a) ZNCC (b) Texture b (c) Texture c Figure 6. Correlation signals: (right) CT (left) ZNCC, (b) CT Figure 7. Evolution of validity measures able to filter out a large part of the inconsistent points (cfr. fig. 8). With CT, the difference measures generally shows a high respons ( unique peak) with the good estimates and a low respons ( ambiguous peak) with the unreliable estimates. ZNCC on the other hand tends to give a strong respons even in the case of unreliable estimates. 3.3 Analysis of systematic errors Systematic errors are due to processing artifacts. One of these artifacts is the so-called object growing effect. This growing effect is a bias introduced by the finite size of the correlation window. When the correlation window encounters an object edge, the mean calculated within this window (as does ZNCC) reflects both the high and low contrast texture. This causes a confusion in the matching measure and results in a bias towards high contrast. A halo is seen around the object within which the range estimates are nearly the same as the range to the object instead of the range to the background. The size of the halo is half the size of the correlation window. In 3D, this size grows with the distance to the cameras. The problem has been investigated by a number of people (cfr. [5]), but the proposed solutions are not always well adapted to real-time systems. To characterize the object growing effect, we take the 3D reconstruction of a book (fig. 1(b)) and transform the object into a frame following the principal axes of the book. After this transformation, the 3D points are thresholded by keeping all the points above a certain height off the board plane. In this way we are able to segment the points belonging to the object. Using this segmentation, we take the histogram of the 3D points along the horizontal axis. This allows us to estimate the absolute horizontal precision of the system. The book has a width of 22 cm. Both ZNCC and CT give an overestimation of the width, but CT proves to be more accurate (i.e cm compared to 24.9 cm, fig. 9(a)). CT avoids the object growing effect not by a better estimate but by a more efficient filtering, as

6 (a) ZNCC (a) Positional accuracy (b) CT Figure 8. Difference and peak values shown in the previous section. This answer of unknown is much prefered over erroneous estimations. The size of the correlation window also influences the directional accuracy. Increasing the correlation window smooths out the normal directions of the 3D points. This is shown in fig. 9(b). The normal direction is computed in each 3D point using ~n = ~x u ~xv j ~xu ~xvj (2) where ~xu and ~xv are the partial derivatives calculated by differencing in a 4-neighbourhood. Taking the histogram of the normal direction along the range axis, which should be a delta peak in the ideal case (= perfectly flat surface). We note a better performance by the CT algorithm. 3.4 Performance within cockpit model A prototype system, consisting of a pair of stereo cameras combined with infrared illumination, is being developed for installation within a cockpit model. The (b) Directional accuracy Figure 9. Accuracy of range estimates images produced by the cameras are subsampled to a size of pixels. Table 2 shows the timing results for the matching and reconstruction process, performed on an SUN Ultra Sparc 10. The range which is scanned for each pixel is 1 of the image width. The correlation window size is 4 taken pixels and 5 5 pixels for resp. resolution and ZNCC CT 3D Rec ms 550 ms 20 ms ms 100 ms 10 ms Table 2. Timing results on an Ultra Sparc 10 Figure 11 shows an extract of the reconstructed scene. It illustrates the performance of CT compared to ZNCC with respect to the characterisation of form and the object growing effect.

7 4 Head detection using data fusion Although the prototype system s main goal is the classification of the seat occupancy (e.g. child seat, object, passenger), the results obtained with the system allow the exploration of head detection inside the vehicle. Within the context of passenger safety, the head plays a central role. Locating the head with respect to the dashboard is an important issue and fast, robust techniques need to be developed. We are currently exploring a system which fuses 3D and intensity information. Since the vehicle cockpit offers a well-defined environment, a narrow region of interest (ROI) can be defined in which head detection based on skin tone is performed. To eliminate false alarms, we envisage a curvature estimation on the detected points. For this phase, we rely on the better results of CT over ZNCC in precision which makes a finer characterisation of form possible. The flow chart of the system is outlined in fig. 10. Skin tone Left Camera rectification 5 Conclusions Disparity estimation 3D reconstruction ROI Data fusion Right Camera rectification Figure 10. Flow chart Shape In this paper, we compared ZNCC and CT matching algorithms for stereo vision. The data indicates precision within centimeter range for a range up to 2 metres and processing times of 1-10 Hz on our system. CT performs better in terms of accuracy and speed. We examined different validity tests to filter out false matches. The inverse validity test proved to be the most efficient, but does not filter out all unreliable matches. While examining the signal (peak, difference and variance) offers no further improvement of ZNCÇ the difference test shows to be reliable to further suppress unwanted matches in the case of CT. More particularly, the effect of object growing is handled much better in this case. CT can avoid the object growing effect not by a better estimate but by a more efficient filtering. The respons of unknown is much prefered over erroneous estimations. Compared with other solutions [5], CT requires no extra processing time and is well suited for real-time applications. A strong interest of car manufacturers exists in the concept of occupant detection using video sensors. Smart airbag concepts with partial inflation capabilities should take into account information provided by such a perception system. Our prototype system shows encouraging results for the problem of head detection within the vehicle cockpit. Future work is planned to further exploit the results of stereo using CT in order to accurately position the passengers head with respect to the airbag. References [1] R.Zabih, J.Woodfill, Non-parametric Local Transforms for Computing Visual Correspondence, Third European Conference on Computer Vision, Stockholm, Sweden, [2] J.Banks, M.Bennamoun, P.Corke, Fast and robust stereo matching algorithms for mining automation, Proc. of the Int. Workshop on Image Analysis and Information Fusion, Adelaide, Australia, [3] L.Matthies, P.Grandjean, Stochastic Performance Modeling and Evaluation of Obstacle Detectability with Imaging Range Sensors, IEEE Trans. Robotics and Automation, Vol.10, No.6, pp , [4] P.Lassere, P.Grandjean, Stereo Vision Improvement, 7th Int. Conf. on Advanced Robotics, Guixols, Spain, [5] L.Robert, R.Deriche, Dense Depth Map Reconstruction: A Minimization and Regularization Approach which Preserves Discontinuities, Fourth European Conference on Computer Vision, Cambridge, UK, 1996.

8 (a) Disparity map ZNCC (b) Correlated pixels ZNCC (c) Disparity map CT (d) Correlated pixels CT (e) ZNCC (f) CT Figure 11. Reconstruction examples

Transactions on Information and Communications Technologies vol 16, 1996 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 16, 1996 WIT Press, ISSN ransactions on Information and Communications echnologies vol 6, 996 WI Press, www.witpress.com, ISSN 743-357 Obstacle detection using stereo without correspondence L. X. Zhou & W. K. Gu Institute of Information