PEDESTRIAN safety is a problem of global significance.

Size: px

Start display at page:

Download "PEDESTRIAN safety is a problem of global significance."

Loraine Porter
6 years ago
Views:

1 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER On Color-, Infrared-, and Multimodal-Stereo Approaches to Pedestrian Detection Stephen J. Krotosky and Mohan Manubhai Trivedi Abstract This paper presents an analysis of color-, infrared-, and multimodal-stereo approaches to pedestrian detection. We design a four-camera experimental testbed consisting of two color and two infrared cameras for capturing and analyzing various configuration permutations for pedestrian detection. We incorporate this four-camera system in a test vehicle and conduct comparative experiments of stereo-based approaches to obstacle detection using unimodal color and infrared imageries. A detailed analysis of the color and infrared features used to classify detected obstacles into pedestrian regions is used to motivate the development of a multimodal solution to pedestrian detection. We propose a multimodal trifocal framework consisting of a stereo pair of color cameras coupled with an infrared camera. We use this framework to combine multimodal-image features for pedestrian detection and to demonstrate that the detection performance is significantly higher when color, disparity, and infrared features are used together. This result motivates experiments and discussion toward achieving multimodal-feature combination using a single color and a single infrared camera arranged in a cross-spectral stereo pair. We demonstrate an approach to registering multiple objects across modalities and provide an experimental analysis that highlights issues and challenges of pursuing the crossspectral approach to multimodal and multiperspective pedestrian analysis. Index Terms Active safety, collision avoidance, intelligent vehicles, person detection, tracking. I. INTRODUCTION PEDESTRIAN safety is a problem of global significance. Of the 1.17 million yearly worldwide traffic fatalities, 65% are pedestrian-related [1]. In fully industrialized nations, pedestrian safety remains a high priority, with pedestrian fatalities accounting for 10.9% of all traffic deaths in the United States [2] and fatalities in Britain twice as likely for pedestrians than vehicle occupants [3]. In rapidly industrializing countries, pedestrian fatalities are overwhelmingly more costly in both proportion and sheer volume. Studies have found pedestrian fatalities accounted for over half of all traffic deaths in both China [4] and India [5]. Naturally, an issue of this impact has received significant attention from all aspects of the research community. Ongoing computer-vision research is making strides to Manuscript received January 15, 2007; revised April 23, 2007 and June 21, This work was supported in part by the Technical Support Working Group and in part by the U.C. Discovery Grant. The Associate Editor for this paper was U. Nunes. The authors are with the Computer Vision and Robotics Research Laboratory, University of California, San Diego, CA USA. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TITS detect and to track pedestrians from both moving vehicles and transportation infrastructure. These approaches to pedestrian detection use visual or infrared imagery [6] in both monocular and stereo-camera configurations. The choice of visual or infrared imagery is significant, as each provides disparate, yet complementary information about a scene. Visual cameras capture the reflective light properties of objects in a scene, whereas infrared cameras are sensitive to the thermal emissivity properties of the same objects. Features extracted from each modality can be used to determine the presence of pedestrians in a scene. Additionally, their combination can provide a level of feature robustness beyond what is readily obtained from a single camera type. Additionally, multiple camera systems have been incorporated into pedestrian detection in order to extract depth estimates which are crucial to the task of collision mitigation and occlusion handling. In order to be able to register unimodal stereo imagery, correspondencematching techniques [7] are often sufficient. However, in a multimodal multiperspective system, the different appearance of objects in the visual and infrared imagery makes finding a robust correspondence technique challenging [8]. This paper presents research toward the development of a multimodal multiperspective system that can extract the features that are necessary for robust pedestrian detection. We design an experimental testbed consisting of two color and two infrared cameras for comparing multicamera approaches to pedestrian detection. We perform comparative experiments of stereo-based-detection approaches using unimodal imagery and demonstrate the high obstacle-detection rate achievable with both color and infrared imageries, and analyze the features and properties of the color and infrared imageries that are useful in classifying the detected obstacles into pedestrian regions. From this analysis, we propose a multimodal trifocal framework consisting of a stereo pair of color cameras coupled with a single infrared camera. Using a calibrated three-camera setup allows accurate and robust registration of color, disparity, and infrared features using the properties of the trifocal tensor. We demonstrate that the combination of color, disparity, and infrared information can yield significant gains in pedestrian detection compared with detectors trained on only unimodal or stereo features. This result motivates experiments and discussion of a cross-spectral stereo framework to pedestrian detection. Using a single color and a single infrared camera arranged in a stereo pair, we demonstrate an approach to registering color and infrared features and discuss the issues and challenges of pursuing the cross-spectral framework to multimodal and multiperspective pedestrian analysis /$ IEEE

2 620 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 II. RELATED RESEARCH Our focus on pedestrian detection is concerned with the methodologies and challenges of conventional camera systems. Specifically, we will review studies that utilize the color and infrared imageries in single and multicamera configurations. For a more comprehensive review of computer-vision-based approaches to pedestrian detection, we refer the reader to a recent survey paper by Gandhi and Trivedi [9]. Typically, to find pedestrians in crowded and varied scenes with a single camera, a trained set of features used to identify pedestrian regions is extracted. In color imagery, common features include Haar wavelet [10] or Gabor filter [11] responses, component-based gradient responses [12], image contours with mean field models [13], implicit shape models [14], and local receptive fields [15]. Similarly, features are extracted from monocular infrared approach. Typically, the features extracted from the infrared imagery are selected for their relation to the unique thermal signature of humans that enables straightforward segmentation. Such features include thermal hotspots [16], body-model templates [17], shape-independent multidimensional histograms, inertial and contrast base features [18], and histograms of oriented gradients [19]. The features extracted from monocular imagery are then typically used in a classification scheme using many positive and negative examples. The most common approach to classification is to use a support vector machine (SVM) [10], [12], [13], [15], [16], [19], [20]. Additional approaches to classification include template matching [17], [21], convolutional neural networks [22], and Chamfer distance matching [14]. While good pedestrian detection in monocular imagery can be achieved, a single-camera approach is limited in one critical area: accurate and reliable depth estimation. To achieve this, a multicamera system is necessary, typically arranged in a stereo vision configuration. Visual stereo-camera systems [23] [25] utilized dense-stereo matching to identify candidate pedestrian regions and to determine their distance from the camera. Infrared-stereo-camera systems have followed, which combine the benefits of infrared features with the powerful depth estimation inherent in stereo vision [21], [26]. Additionally, a four-camera system that is separately combining color-stereo and infrared-stereo systems has been investigated [27]. In typical stereo approaches to pedestrian detection, depth estimates yield an initial set of obstacle regions that can then be classified as pedestrians using monocular-image features. III. STEREO-BASED PEDESTRIAN DETECTION A fundamental step to analyzing pedestrians in stereo imagery is to detect obstacles and to localize their position in a 3-D space. We adapt a classical approach to obstacle detection in the stereo imagery proposed by Labayrade et al. [28] which utilizes the concept of v-disparity to identify obstacles in a scene. The v-disparity is a histogram of the stereo disparity image that accumulates the disparity values present in each row in the image. This histogram has been shown to be useful in identifying obstacles when the camera is relatively parallel to Fig. 1. Flowchart of stereo disparity-based obstacle-detection algorithm. the imaged scene so that objects appear at distinct planes in the disparity domain [24], [25], [27]. A. Disparity-Based Obstacle Detection Our goal is to provide a comparative analysis of colorstereo and infrared-stereo imageries for pedestrian detection. We use the v-disparity approach to obstacle detection so that it can be implemented for both color-stereo and infrared-stereo imageries without modification. We examine each approach s ability to generate robust stereo disparities for determining obstacle areas in a scene. This comparison of low-level detection accuracy will lead to an evaluation of each camera-type s potential for higher level pedestrian classification and analysis. Fig. 1 shows a flowchart of the obstacle-detection algorithm. 1) Dense-Stereo Matching: We first perform the densestereo matching to yield disparity estimates of the imaged scene. We select the correspondence-matching algorithm by Konolige [29] for its ease of use and reliable disparity generation for both color-stereo and infrared-stereo imageries. Example disparity images from each approach are shown in Fig. 2. 2) u- and v-disparity Image Generation: The u- and v-disparity images are histograms that bin the disparity values d for each column or row in the image, respectively. The resulting v-disparity histogram image indicates the density of disparities for each image row v, whereas the u-disparity image shows the density of disparities for each image column u. Fig.3 shows an example of u-disparity images, and Fig. 4 shows the corresponding v-disparity images generated from the colorstereo and infrared-stereo disparity maps in Fig. 2. Notice that the u-disparity images in Fig. 3 show three distinct horizontal regions corresponding to the three pedestrians in the scene. It is these regions that we wish to detect in order to build candidate pedestrian areas. The region spanning

KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 621 Fig. 5. ROI generation in u- andv-disparity images with color- and infraredstereo images. (a) Color u-disparity.

Example u-disparity images from color- and infrared-stereo images. (a) Color. (b) Infrared. Fig. 4.

the entire length at the top of the u-disparity image indicates the background plane and can be filtered from processing. Similarly, the v-disparity images in Fig.

Additionally, the downward-sloping trend for each row in the v-disparity image is exploited to estimate the ground plane in the scene [28].

For each column corresponding to a disparity d in the v-disparity image, we select the lowest pixel location, whose value is above a threshold, as a candidate ground-plane point. The ground plane Fig.

3 KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 621 Fig. 5. ROI generation in u- andv-disparity images with color- and infraredstereo images. (a) Color u-disparity. (b) Infrared u-disparity. (c) Color v-disparity. (d) Infrared v-disparity. Fig. 2. Example disparity images from color- and infrared-stereo images. (a) Color. (b) Infrared. Fig. 3. Example u-disparity images from color- and infrared-stereo images. (a) Color. (b) Infrared. Fig. 4. Example v-disparity images from color- and infrared-stereo images along with the detected ground plane. (a) Color with ground plane. (b) Infrared with ground plane. the entire length at the top of the u-disparity image indicates the background plane and can be filtered from processing. Similarly, the v-disparity images in Fig. 4 show vertical peaks of high density for both the background plane and the range of disparities containing pedestrians. These regions also need to be detected to build pedestrian candidates. Additionally, the downward-sloping trend for each row in the v-disparity image is exploited to estimate the ground plane in the scene [28]. 3) Ground-Plane Estimation: To estimate the ground plane, we extract candidate points in the v-disparity image. For each column corresponding to a disparity d in the v-disparity image, we select the lowest pixel location, whose value is above a threshold, as a candidate ground-plane point. The ground plane Fig. 6. Bounding-box candidates with color- and infrared-stereo images. (a) Color. (b) Infrared. is estimated by fitting these candidate points to a line with a robust linear regression scheme that uses weighted least squares that iteratively reweights using the bisquare weighting function. Fig. 5(b) and (d) shows the v-disparity images for color-stereo and infrared-stereo imageries with the candidate ground-plane points in red and the fitted ground-plane estimate plotted in cyan. Using a dense stereo with a robust point candidate generation and an iterative line fitting, we obtain robust ground-plane estimates in both color- and infrared-stereo imageries. 4) Candidate-Bounding-Box Generation: Bounding-box candidates can be extracted from regions-of-interest (ROI) in the u- and v-disparity images. The ROIs in the u-disparity image are extracted by scanning the rows of the image for continuous spans where the histogram value exceeds the given threshold. Fig. 5(a) and (b) overlays the extracted regions in green on the u-disparity image. The ROIs are extracted from the v-disparity image by selecting columns where the sum of the histogram values above the ground plane is greater than the threshold. The ROI spans from the ground plane to the highest point in the column that exceeds the given threshold. Fig. 5(c) and (d) shows the extracted regions in green on the v-disparity image. The candidate bounding boxes are selected from the ROIs in the u- and v-disparity images based on their disparity values. For a given disparity d, the widths of the bounding boxes are determined by the ROIs found in the u-disparity image, and the heights are derived from the ROIs in the v-disparity image. Large bounding boxes associated with background regions are filtered, and the remaining candidates are shown in Fig. 6. 5) Candidate Filtering and Merging: As shown in Fig. 6, there are often multiple overlapping candidate bounding boxes

Example of the final selection of pedestrian candidates after boundingbox merging with color- and infrared-stereo images. (a) Color. (b) Infrared. generated.

4 622 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 Fig. 8. Experimental testbed: Two color cameras and two infrared cameras arranged in stereo pairs and mounted to the front of the LISA-P testbed. Fig. 7. Example of the final selection of pedestrian candidates after boundingbox merging with color- and infrared-stereo images. (a) Color. (b) Infrared. generated. This occurs because the disparities associated with a single pedestrian span a range of values, particularly as the pedestrian moves closer to the camera. We merge significantly overlapping candidates if the disparities that are associated with the bounding boxes are close. The final pedestrian candidate bounding boxes are shown in Fig. 7. Notice how the overlapping candidates have merged into the correct bounding boxes corresponding to the pedestrians in the scene. B. Experimental Framework and Testbed We establish a framework for experimenting and analyzing pedestrian detection approaches to facilitate a direct side-byside comparison of the data coming from color-stereo and infrared-stereo imageries. A custom rig was designed, consisting of a matched color-stereo pair and a matched infraredstereo pair. The two pairs share identical baselines and are aligned in pitch, roll, and yaw to maximize the similarities in the field of view. Calibration data were obtained by illuminating a checkerboard pattern with high-intensity halogen bulbs so that the checks would be visible in both color and infrared imageries, and standard calibration techniques could be applied to obtain the intrinsic and extrinsic parameters of the cameras. The calibrated rig was mounted on the grill of the Laboratory for Intelligent and Safe Automobiles (LISA)-P testbed [30], [31], a Volkswagen Passat equipped with the computing, power, and cabling requirements necessary to synchronously capture and save the four simultaneous camera streams. Fig. 8 shows the four-camera rig properly arranged and mounted on the LISA-P. Fig. 9. Merge and miss errors from pedestrian-candidate generation. (a) Color merged. (b) IR merged. (c) Color missed. (d) IR missed. depth, complexity, and occlusion. To allow for direct comparison, color and infrared videos were captured synchronously and were analyzed using the disparity-based obstacle-detection algorithm in Section III-A. Successful detection was indicated by a bounding box which is correctly overlaid by a corresponding pedestrian region. If our merging process combined two separate pedestrian regions, we consider the detection correct, yet note it as a merge error [Fig. 9(a) and (b)]. We reason that errors associated with lack of sophistication in our chosen merging algorithm should not adversely affect the detection rate, as the desire is to evaluate the effectiveness of identifying pedestrian regions and not the robustness of the merging procedure. This is also a fair assessment for collision mitigation, as finding all the critical areas in the scene is given priority over discerning the merged bounding boxes. Therefore, false negatives were counted only if a pedestrian region was missed [Fig. 9(c) and (d)], and false positives were counted when a bounding box selected a nonpedestrian region. Still, had we incorporated the merge errors, the total detection rate would decrease by only 1% for the color and 1.4% for the infrared. Table I shows the compiled results of the comparative experiments, and Fig. 10 shows additional examples of detection. C. Experimental Analysis of Disparity-Based Obstacle Detection in Color- and Infrared-Stereo Imageries Experiments were conducted so that multiple pedestrians walk in front of the LISA-P testbed with varying degrees of IV. STEREO-BASED PEDESTRIAN-DETECTION ANALYSIS Our comparative experiments in Section III with stereobased pedestrian detection for the color and infrared imageries indicate a very high level of detection accuracy and a low

5 KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 623 TABLE I COMPARISON BETWEEN COLOR- AND INFRARED-STEREO IMAGERIES FOR DISPARITY-BASED OBSTACLE DETECTION false-positive rate in both modalities. However, we provide a deeper analysis of the experiments to help understand and evaluate the success of these experiments. We note that the difference in the pedestrian counts in Table I comes from the position and view differences of the colorstereo and infrared-stereo cameras. As only pedestrians that are fully visible in the image are considered, there are frames where a pedestrian is only visible in one modality. However, given the high number of examples, the detection rates can be directly compared despite the different tallies. The experiments yielded such a high rate of detection since the captured images did not include nonpedestrian obstacles, such as other vehicles or bicyclists, so detection of any obstacle region is assumed to be a pedestrian. For our experiments, this assumption is appropriate, as we are interested in evaluating how color and infrared dense-stereo correspondences can be used in low-level pedestrian detection. In that respect, our experiments demonstrate that both achieve high rates of lowlevel obstacle detection, which is an imperative first step toward a robust pedestrian detection and collision mitigation. However, in real-world driving scenarios, this is not sufficient for pedestrian detection. Detected obstacles can include a variety of objects found in common driving scenes other than pedestrians, and additional processing is necessary to filter the detected obstacles to identify pedestrians. For example, bounds on pedestrian bounding-box features, such as size, disparity, and aspect ratio, can be learned or heuristically selected to filter out bounding boxes associated with other objects in the scene [27]. However, such size-basedfiltering techniques will have difficulty with nonpedestrian bounding boxes that fall within the selected bounds of pedestrian candidates. Additionally, the selection of appropriately robust bounds is a challenging task, as bounding-box sizes can vary significantly with changes in pedestrian pose and disparity fidelity. To achieve a more reliable detection of pedestrian candidates, it is necessary to use discriminant image features in a learning framework, such as those discussed in Section II. While justification can be made for selecting either color or infrared features for pedestrian detection, a more interesting proposition would be to use both to obtain a much larger set of discriminant image features, as a system that incorporates all features to improve detection. For example, the thermal hotspots of humans that often make pedestrians easily segmentable can be used combined with the color segmentation features common to challenging tasks, such as detecting articulated poses for classifying human interactions [32]. Although stereo color and infrared analyses can be separately combined [27], a more economical and desirable solution would be to combine the color, disparity, and infrared features in an integrated detection framework. In Section V, we propose a multimodal trifocal framework consisting of a stereo pair of color cameras coupled with a single infrared camera. Such a setup allows for accurate and robust registration of the color and infrared imageries using the trifocal tensor. We use this registration framework to design a pedestrian detector that integrates color, disparity, and infrared features and yields higher detection rates than using separate features. In Section VI, we investigate the feasibility of this integrated detection framework using a minimum camera crossspectral stereo system with a single color and single infrared camera. The challenge is to register image features in a crossspectral stereo, where conventional and state-of-the-art stereocorrespondence algorithms fail due to the disparate nature of the color and infrared imageries. As a step toward a densestereo algorithm for cross-spectral stereo imagery, we propose a stereo-registration algorithm for multimodal imagery [8], evaluate its applicability to pedestrian detection, and highlight the challenges of achieving robustness in this framework. V. M ULTIMODAL TRIFOCAL FRAMEWORK FOR PEDESTRIAN DETECTION The benefits of color-, disparity-, and infrared-image features can be incorporated using a three-camera approach consisting of a standard color-stereo rig paired with a single infrared camera. The trifocal framework, shown in Fig. 11, uses disparity estimates from the stereo imagery to register corresponding pixels in the infrared imagery. This can be done quickly and efficiently with the trifocal tensor the set of matrices relating the correspondences between the three images. The trifocal tensor can be estimated by minimizing the algebraic error of point correspondences [33]. The point correspondences can be obtained for trifocal imagery using the same calibration techniques used for stereo calibration, where the calibration board is visible in each trifocal. While only seven point point point correspondences are required to compute the trifocal tensor, in practice, we use many more correspondences to smooth errors in the point estimates. The resulting trifocal tensor is written as T =[T 1,T 2,T 3 ], where T i is a 3 3 matrix for the ith image in the set. From this tensor notation, standard two-view geometry parameters, such as the fundamental matrices F, the epipoles e, and the projection matrices P, can be determined. Additionally, given a point correspondence x x, we can estimate the point transfer to the third image point x as ( ) [x ] x i T i [x ] = (1) i The dense-stereo matching gives the x x correspondences, and the infrared point transfer is estimated and aligned

624 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 Fig. 10. Example of the final selection of pedestrian candidates with color- and infrared-stereo input images.

Experimental Evaluation of Pedestrian Detection Using Color-, Disparity-, and Infrared-Image Features To determine the effect of using multimodal features for pedestrian detection, we use the

6 624 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 Fig. 10. Example of the final selection of pedestrian candidates with color- and infrared-stereo input images. Fig. 11. Flowchart of the trifocal-tensor approach to pedestrian detection for color stereo and infrared framework. Fig. 12. Registered color, disparity, and infrared imageries using trifocal tensor. (a) Color. (b) Disparity. (c) Aligned infrared. to the color reference image. Fig. 12 shows an example set of the registered trifocal imagery. A. Experimental Evaluation of Pedestrian Detection Using Color-, Disparity-, and Infrared-Image Features To determine the effect of using multimodal features for pedestrian detection, we use the trifocal framework to register the color, disparity, and infrared imageries into a single five-channel multispectral image, allowing for the comparison of pedestrian detectors that make use of different combinations of image features. To train the detectors, positive pedestrian samples are manually annotated, and for each positive sample, ten negative samples are generated by moving the positive bounding box to a random nonoverlapping position in the image. All samples are resized to a common size (24 60 pixels), as shown in Fig. 13. We elect to extract histogram-of-oriented-gradient features similar to those proposed by Dalal and Triggs [34]. For each of the color, disparity, and infrared images, we compute an X Y Θ element histogram, where X, Y, and Θ are the numbers of histogram bins in width, height, and gradient orientation, respectively. For our experiments, we use a element histogram, resulting in a 128-element feature vector for each image type. This descriptor was selected on the notion that gradient information can discriminate a pedestrian from other objects. While we make no claims of feature optimality, gradient-based features are common in pedestrian-detection literature, and we feel that its use is sufficient for the evaluation of the effect of multispectral-image features on the detection accuracy. We train pedestrian detectors for all combinations of the color, disparity, and infrared features using an SVM with a radial basis function as the kernel type [35]. We train each SVM using 865 annotated positive samples (and 8650 negative samples) collected from the video obtained while driving the LISA-P testbed in store parking lots and local roads in La Jolla, California. Similarly, we evaluate using a test set of 641 positive samples and 6410 negative samples from a separate set of videos obtained while driving the LISA-P. Pedestrians in the training and testing sets range from approximately 3 to 30 m from the vehicle. The resulting receiver-operatingcharacteristic (ROC) curves are plotted in Fig. 14, and detection rates for a 5% false-positive rate are shown in Table II. The pedestrian detector that combines the color, disparity, and infrared features outperforms the other detectors by a significant margin. By integrating the features, we exploit the complementary nature of multimodal imagery to yield more than a 5% increase in detection for a 5% false-positive rate. We also note that the combinations of color + infrared and color + disparity do not outperform the detector that is trained only on color. We suspect that this is because gradientbased features are not suitable for discriminating pedestrians in the low-contrast disparity and infrared images. This drop in performance is evident in the detectors that are trained only on disparity or infrared. Given the relatively low number of positive samples, the addition of only disparity or infrared seems only to add noise. It is then all the more interesting

KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 625 Fig. 13. Selection of positive and negative samples used for training pedestrian detectors.

The combination of color, disparity, and infrared features performs the best.

7 KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 625 Fig. 13. Selection of positive and negative samples used for training pedestrian detectors. Each sample consists of color, disparity, and infrared images. (a) Positive samples. (b) Negative samples. Fig. 14. ROC for pedestrian detection. The combination of color, disparity, and infrared features performs the best. TABLE II PEDESTRIAN-DETECTION RATE FOR 5% FALSE-POSITIVE RATE that the color + disparity + infrared-trained detector performs so well. The discriminant gains from combining all the features greatly outweigh the noise added from nonideal gradient features. We anticipate that greater gains in accuracy could be achieved by using more discriminant features in each image spectrum. requirement of two color cameras for the stereo-correspondence matching is redundant from a feature perspective. We investigate achieving the stereo-correspondence matching using crossspectral stereo a single color and single infrared camera. While a cross-spectral stereo system has the potential to integrate the color, disparity, and infrared detail, the nontrivial problem of accurate and robust stereo registration must first be resolved. Toward achieving this, we have developed an algorithm for matching regions in cross-spectral stereo images [8]. This approach gives a robust disparity estimation with statistical confidence values for images that have an initial object segmentation. Fig. 15 shows the algorithmic framework of the regionbased stereo algorithm. The acquired and rectified image pairs are denoted as I L, for the left color image, and I R, for the right infrared image. Due to the high differences in the imaging characteristics, the matching is focused on the foreground pixels from an initial-segmentation estimate. To obtain the segmentation in a moving vehicle, we use an optical-flow-based approach to detect moving pedestrians in the scene [36]. Our experiments have shown that this approach is relatively robust at low speeds (< 10 m/h) and could be adapted for higher speeds with egomotion estimation. Low-speed analysis is useful in a variety of driving scenarios, including parking lots, residential and shopping areas, and starting or stopping at a traffic signal. Additionally, while stationary pedestrians pose a segmentation issue for optical-flow techniques, we expect that static objects above the ground can be identified through long-term tracking of the scene. Given the optical-flow estimates for motion in the horizontal m u and vertical m v directions, as well as occluded regions m occ, we estimate the foreground regions F where there is motion in either the horizontal or vertical direction and no occlusion. Morphological operations smooth the estimate. VI. CROSS-SPECTRAL STEREO-CORRESPONDENCE MATCHING FOR PEDESTRIAN DETECTION The multimodal trifocal framework demonstrates the benefit of integrating the color, disparity, and infrared features for pedestrian detection. While an attractive framework, its F =(( m u > 0) ( m v > 0)) (m occ =0). (2) We denote the color and infrared foreground images as F L and F R, respectively, which are shown in Fig. 16. The color

Given the two correspondence windows W L,i and W R,i,d, we first linearly quantize the image to N levels such that N Mh /8 [37], as this equation has been shown to determine the number of levels

8 626 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 a disparity offset. For a given column i, a reference window is determined, and the correspondence values are found for all d d min,...,d max. Given the two correspondence windows W L,i and W R,i,d, we first linearly quantize the image to N levels such that N Mh /8 [37], as this equation has been shown to determine the number of levels needed to give good results for maximizing the mutual information between image regions. The similarity between the two image patches can be measured by the mutual information between them, which is defined as I(L, R) = l,r P L,R (l, r) log P L,R(l, r) P L (l)p R (r) (3) where P L,R (l, r) is the joint probability mass function (pmf), and P L (l) and P R (r) are the marginal pmfs of the left and right image patches, respectively. P L,R (l, r) is computed as the normalized 2-D histogram of the image intensities, and the marginal probabilities are determined by summing along 1-D of this histogram. We define the mutual information between the two correspondence windows as I i,d where i is the center of the reference window and i + d is the center of the moving window. For each column i, we compute I i,d for d d min,...,d max. We choose the best disparity d i as the one that maximizes the mutual information d i = arg max MI i,d. (4) d Fig. 15. Flowchart of region-based correspondence matching in cross-spectral stereo for pedestrian detection. Fig. 16. Outlined foreground extraction for color and infrared images. (a) Color segmentation. (b) Infrared segmentation. image is also converted to grayscale for a mutual informationbased matching. The matching is performed by fixing a window in one foreground image and by sliding a correspondence window along the second image. Given the height h and the width w of the image, for each column i 0,...,w,letW L,i be a reference window in the left image of height h and width M. The width M is experimentally determined for a given scene and is typically less than the width of the target object in the scene. In our case, the value of M was 31 pixels. The height h is the largest spanning of the foreground within the reference window. The correspondence window W R,i,d in the right image also has height h but is located at column i + d, where d is Fig. 17 shows example correspondence windows and a plot of the mutual information for the range of disparities. The red box in the color image is the reference window, and the green boxes in the infrared image are the candidate match windows. We assign a vote for d i to all the foreground pixels in the reference window. Define a disparity voting matrix D L of size (h, w, d max d min +1)as the range of disparities. Then, for each foreground pixel in a given reference window W L,i, (u, v) (W L,i F L ), we accumulate the disparity voting matrix at D L (u, v, d i ). Since the correspondence windows are M pixels wide, each column in the disparity voting matrix will have M votes. For each pixel (u, v) in the image, D L can be thought of as a distribution of matching disparities from the correspondence windows. Since it is assumed that a single person is at a single distance from the camera, a good match should have a large number of votes for a single disparity value, whereas a poor match would be distributed across the range disparity values. The best disparity value and its corresponding confidence at each pixel are then found as DL(u, v) = arg max D L (u, v, d) (5) d CL(u, v) = max D L (u, v, d). (6) d For a pixel (u, v), thevalueofcl (u, v) is the number of votes for the best disparity value DL (u, v). A higher confidence value indicates that the disparity maximized the mutual information for a large number of correspondence windows, and in turn, the disparity value is more likely to be accurate. Values for

KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 627 Fig. 17. Mutual information for finding corresponding windows in a cross-spectral stereo imagery. (a) Color image.

Resulting disparity image D from combining the left and right disparity images DL and D R, as defined in (7). (a) Disparity image. (b) Unaligned. (c) Aligned.

The aligned disparity images are then combined using an AND operation. This experimentally gives the most robust results.

(7) <C R (u, v) The resulting disparity image D (u, v) can be used to register multiple objects in the scene, even at very different depths from the camera. Fig.

9 KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 627 Fig. 17. Mutual information for finding corresponding windows in a cross-spectral stereo imagery. (a) Color image. (b) Infrared image. (c) Mutual information for correspondence window. TABLE III CROSS-SPECTRAL STEREO REGISTRATION OF PEDESTRIAN REGIONS Fig. 18. Resulting disparity image D from combining the left and right disparity images DL and D R, as defined in (7). (a) Disparity image. (b) Unaligned. (c) Aligned. DR and C R are similarly determined by making the right image the reference. The values of DR and C R are then shifted by their disparities so that they align to the left image. The aligned disparity images are then combined using an AND operation. This experimentally gives the most robust results. For all pixels (u, v) such that CL (u, v) > 0 and C R (u, v) > 0 { D D (u, v) = L (u, v), CL (u, v) C R (u, v) DR (u, v), C L (u, v). (7) <C R (u, v) The resulting disparity image D (u, v) can be used to register multiple objects in the scene, even at very different depths from the camera. Fig. 18 shows the registration result for the images carried throughout the algorithmic derivation. Fig. 18(a) (c) shows the disparity image D, the initial alignment of the color and infrared images, and the alignment after shifting the foreground pixels by the resulting disparity image, respectively. The infrared foreground pixels are overlaid (in green) on the color foreground pixels (in purple). The crossspectral stereo-correspondence matching successfully aligns the foreground areas of the three people in the scene. A. Experimental Analysis of Cross-Spectral Stereo-Correspondence Matching for Pedestrian Detection Using the same experiments performed in Section III-C, we analyze the cross-spectral stereo-correspondence matching of pedestrian regions in an outdoor environment. The goal was to demonstrate a successful matching for the configurations of people in different positions, for the distances from the camera, and for the levels of occlusion. We evaluate the registration by visually inspecting the alignment of the corresponding color and infrared pedestrian regions. Visually well-aligned are considered correct, and misaligned, missing, or partially aligned regions are deemed incorrect. Table III summarizes our exper- Fig. 19. Cross-spectral stereo-registration results for pedestrian detection. (a) Color. (b) Infrared. (c) Unaligned. (d) Aligned. iments analysis, and Fig. 19 shows examples of correct correspondence matching. Additional experiments [38] demonstrate its robustness to different capture devices and environmental conditions.

10 628 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 Fig. 20. Disparity discontinuity errors in cross-spectral stereo analysis due to artifacting arising from windowed correspondence matching. (a) Color. (b) Infrared. (c) Disparity. (d) Aligned. One challenge associated with this approach to cross-spectral stereo lies in the vertical artifacts from the multiple voting windows that give the resulting registration hard vertical edges at disparity discontinuities. This is most evident when the inherent disparity discontinuity of the occluding pedestrians is forced to a vertical edge, as shown in Fig. 20. Despite these artifacts, we still identify the two distinct obstacle regions. Additionally, incorporating subpixel interpolation would improve the registration, as the integer-based disparity matching of our approach can easily be off by a pixel in either direction of the correct match. The initial segmentation, while necessary for the success of this algorithm, is limiting in several aspects. First, segmentation is challenging, and the result can often be noisy, easily overor underestimating the true object boundaries. We motivated the initial segmentation as a way of providing appropriately sized regions for matching the features in the color and infrared imageries. However, the very idea of an initial segmentation precludes registration estimates for regions that are not within the segmentation boundaries. Clearly, a better approach would be to register features from the entire image. Achieving this is an open research challenge that we are actively pursuing. We feel that a multifeature-matching approach that can integrate structural feature matching, such as edges, with pixel- or areabased matching, is promising. VII. DISCUSSION AND CONCLUDING REMARKS The depth estimates obtained from the vehicle-mounted stereo imagery give rise to a v-disparity-based approach for extracting the obstacle regions from the scene. We have outlined such an algorithm and have provided comparative experiments that indicate that color- and infrared-based stereo disparities are both capable of highly accurate pedestrian detection (> 98%) with low false positives ( 1%). Given these high detection rates, the selection of an appropriate camera system for pedestrian detection turns to each modality s ability to classify the detected obstacles as pedestrians. Because of the disparate physical processes that yield color and thermal images, extractable features are largely unique to each modality. As previous approaches have demonstrated that both color- and infrared-image features can be used for classifying pedestrians, we propose a multimodal trifocal framework that integrates color, depth, and infrared features for pedestrian detection. The multimodal trifocal solution pairs a color-stereo rig with a single infrared camera to accurately register pixels in each image. We use this framework to demonstrate that integrating color, disparity, and infrared features for training a pedestrian detector yields an improved accuracy over detectors that utilize only unimodal or stereo features. From a cost-benefit perspective, we suggest that the multimodal trifocal framework is likely the best approach, as it can achieve the benefits of multimodality seen in higher camera solutions, yet maintains the robustness not yet seen in the two camera cross-spectral solutions. Future areas for investigation include a more extensive evaluation of the color, disparity, and infrared features. Additionally, an integrated object candidate generation and pedestrian-detection algorithm using the multimodal trifocal framework would be useful for evaluating the robustness to various lighting and environmental conditions. In cross-spectral stereo analysis, the disparate nature of multimodal imagery that we hope to exploit in feature extraction makes correspondence matching challenging. We have established an object-level registration scheme for establishing correspondences and have experimentally demonstrated successful registration of object regions across the color and infrared imageries. The 87% registration rate shows the feasibility of creating a multimodal-feature set in a cross-spectral stereo framework. Although the initial-segmentation requirement places limits on the generality and robustness of the approach, we feel that this is a good first step toward the development of a cross-spectral stereo-correspondence algorithm that generates disparity images similar to that of the conventional stereo algorithms for unimodal imagery. We believe that advancement may be obtained by exploring multiple feature or hierarchical matching schemes that can integrate structural feature matching, such as edges, with a pixel- or area-based matching. These multimodal and multiperspective approaches provide insight into the overall active-safety paradigm. Pedestrian safety is one of the many aspects of the driving environment that needs to be monitored to ensure safety in the vehicle and the surrounding areas [31]. The multimodal-feature set that is extractable from a multimodal trifocal or cross-spectral stereo solution could provide a robust and unified framework for analyzing the vehicular environment [39], as well as higher level driver intent analysis, such as lane changing [40], turning [20], or braking [41]. REFERENCES [1] [Online]. Available: safety.htm [2] Traffic safety facts 2004: A compilation of motor vehicle crash data from the fatality analysis reporting system and the general estimates system. Nat. Highway Traffic Safety Assoc., U.S. Dept. Transp. [Online]. Available: TSFAnn/TSF2004.pdf [3] J. R. Crandall, K. S. Bhalla, and N. J. Madeley, Designing road vehicles for pedestrian protection, Brit. Med. J., vol. 324, no. 7346, pp , May [4] D. Mohan, Traffic safety and health in Indian cities, J. Transp. Infrastruct., vol. 9, no. 1, pp , [5] S. K. Singh, Review of urban transportation in India, J. Public Transp., vol. 8, no. 1, pp , [6] Y. Fang, K. Yamada, Y. Ninomiya, B. Horn, and I. Masaki, Comparison between infrared-image-based and visible-image-based approaches for pedestrian detection, in Proc. IEEE Intell. Veh. Symp.,2003, pp

KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 629 [7] D. Scharstein and R. Szeliski, Middlebury College stereo vision research page, 2005. [Online]. Available: http://cat.middlebury.

270 287, May/Jun. 2007. [9] T. Gandhi and M. M. Trivedi, Pedestrian protection systems: Issues, survey, and challenges, IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3, pp. 413 430, Sep. 2007. [10] L.

11 KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 629 [7] D. Scharstein and R. Szeliski, Middlebury College stereo vision research page, [Online]. Available: [8] S. J. Krotosky and M. M. Trivedi, Mutual information based registration of multimodal stereo videos for person tracking, Comput. Vis. Image Underst., vol. 106, no. 2/3, pp , May/Jun [9] T. Gandhi and M. M. Trivedi, Pedestrian protection systems: Issues, survey, and challenges, IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3, pp , Sep [10] L. Andreone, F. Bellotti, A. De Gloria, and R. Laulette, SVM-based pedestrian recognition on near-infrared images, in Proc. 4th Int. Symp. Image Signal Process. Anal., 2005, pp [11] H. Cheng, N. Zheng, and J. Qin, Pedestrian detection using sparse Gabor filter and support vector machine, in Proc. IEEE Conf. Intell. Veh., 2005, pp [12] A. Shashua, Y. Gdalyahu, and G. Hayun, Pedestrian detection for driving assistance systems: Single-frame classification and system level performance, in Proc. IEEE Conf. Intell. Veh., 2004, pp [13] Y. Wu, T. Yu, and G. Hua, A statistical field model for pedestrian detection, in Proc. Comput. Vis. Pattern Recog., 2005, pp [14] B. Leibe, E. Seemann, and B. Schiele, Pedestrian detection in crowded scenes, in Proc. Comput. Vis. Pattern Recog., 2005, pp [15] S. Munder and D. Gavrila, An experimental study on pedestrian classification, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp , Nov [16] F. Xu, X. Liu, and K. Fujimura, Pedestrian detection and tracking with night vision, IEEE Trans. Intell. Transp. Syst., vol. 6, no. 1, pp , Mar [17] A. Broggi, A. Fascioli, P. Grisleri, T. Graf, and M. Meinecke, Modelbased validation approaches and matching techniques for automotive vision based pedestrian detection, in Proc. Comput. Vis. Pattern Recog., 2005, p. 1. [18] Y. Fang, K. Yamada, Y. Ninomiya, B. K. P. Horn, and I. Masaki, A shapeindependent method for pedestrian detection with far-infrared images, IEEE Trans. Veh. Technol., vol. 53, no. 6, pp , Nov [19] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, Pedestrian detection using infrared images and histograms of oriented gradients, in Proc. IEEE Conf. Intell. Veh., 2006, pp [20] S. Cheng and M. M. Trivedi, Turn-intent analysis using body pose for intelligent driver assistance, Pervasive Comput., vol. 5, no. 4, pp , Oct. Dec [21] M. Bertozzi, A. Broggi, C. Caraffi, M. D. Rose, M. Felisa, and G. Vezzoni, Pedestrian detection by means of far-infrared stereo vision, Comput. Vis. Image Underst., vol. 106, no. 2/3, pp , May/Jun [22] M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata, Pedestrian detection with convolutional neural networks, in Proc. IEEE Intell. Veh. Symp., 2005, pp [23] L. Zhao and C. Thorpe, Stereo- and neural network-based pedestrian detection, IEEE Trans. Intell. Transp. Syst., vol. 1, no. 3, pp , Sep [24] G. Grubb, A. Zelinsky, L. Nilsson, and M. Rilbe, 3D vision sensing for improved pedestrian safety, in Proc. IEEE Conf. Intell. Veh., 2004, pp [25] P. Alfonso, D. F. Llorca, M. A. Sotelo, L. M. Bergasa, P. Revenga de Toro, J. Nuevo, M. Ocana, and M. A. G. Garrido, Combination of feature extraction methods for SVM pedestrian detection, IEEE Trans. Intell. Transp. Syst., vol. 8, no. 2, pp , Jun [26] X. Lie and K. Fujimura, Pedestrian detection using stereo night vision, IEEE Trans. Veh. Technol., vol. 53, no. 6, pp , Nov [27] M. Bertozzi, A. Broggi, M. Felias, G. Vezzoni, and M. Del Rose, Lowlevel pedestrian detection by means of visible and far infra-red tetravision, in Proc. IEEE Conf. Intell. Veh., 2006, pp [28] R. Labayrade, D. Aubert, and J.-P. Tarel, Real time obstacle detection in stereovision on non flat road geometry through v-disparity representation, in Proc. IEEE Conf. Intell. Veh., 2002, pp [29] K. Konolige, Small vision systems: Hardware and implementation, in Proc. 8th Int. Symp. Robot. Res., 1997, pp [30] M. M. Trivedi, S. Y. Cheng, E. M. C. Childers, and S. J. Krotosky, Occupant posture analysis with stereo and thermal infrared video: Algorithms and experimental evaluation, IEEE Trans. Veh. Technol., vol. 53, no. 6, pp , Nov [31] M. M. Trivedi, T. Gandhi, and J. McCall, Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety, IEEE Trans. Intell. Transp. Syst., vol. 8, no. 1, pp , Mar [32] S. Park and M. M. Trivedi, Multi-person interaction and activity analysis: A synergistic track- and body-level analysis framework, Mach. Vis. Appl. Special Issue Novel Concepts Challenges Generation Visual Surveillance Systems, vol. 18, no. 3/4, pp , Aug [33] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge, U.K.: Cambridge Univ. Press, [34] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proc. Comput. Vis. Pattern Recog., 2005, pp [35] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, [Online]. Available: cjlin/libsvm [36] A. S. Ogale and Y. Aloimonos, A roadmap to the integration of early visual modules, Int. J. Comput. Vis. Special Issue Early Cognitive Vision, vol. 72, no. 1, pp. 9 25, Apr [37] P. Thevenaz and M. Unser, Optimization of mutual information for multiresolution image registration, IEEE Trans. Image Process., vol. 9, no. 12, pp , Dec [38] S. J. Krotosky and M. M. Trivedi, Multimodal stereo image registration for pedestrian detection, in Proc. IEEE Conf. Intell. Transp. Syst., 2006, pp [39] T. Gandhi and M. M. Trivedi, Vehicle surround capture: Survey of techniques and a novel omni-video-based approach for dynamic panoramic surround maps, IEEE Trans. Intell. Transp. Syst., vol. 7, no. 3, pp , Sep [40] J. McCall, D. Wipf, M. Trivedi, and B. Rao, Lane change intent analysis using robust operators and sparse Bayesian learning, IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3, pp , Sep [41] J. McCall and M. Trivedi, Driver behavior and situation aware brake assistance for intelligent vehicles, Proc. IEEE Special Issue Advanced Automobile Technologies, vol. 95, no. 2, pp , Feb Stephen J. Krotosky received the B.S. degree in computer engineering from the University of Delaware, Newark, in 2001, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of California, San Diego, in 2004 and 2007, respectively, specializing in signal and image processing. He is currently an Algorithm Development Engineer with the Advanced Multimedia and Signal Processing Division, Science Applications International Corporation, San Diego, CA. Mohan Manubhai Trivedi received the Ph.D. degree in electrical engineering from Utah State University, Logan. He is a Professor with the Department of Electrical and Computer Engineering and the Founding Director of the Computer Vision and Robotics Research Laboratory, University of California, San Diego. His research interests include computer vision, intelligent vehicles and transportation systems, and human machine interfaces. Dr. Trivedi is a member of the IEEE Computer Society, from which he received both the Pioneer Award and the Meritorious Service Award, and a Fellow of the International Society for Optical Engineering.

A Comparison of Color and Infrared Stereo Approaches to Pedestrian Detection

Proceedings of the 2007 IEEE Intelligent Vehicles Symposium Istanbul, Turkey, June 13-15, 2007 A Comparison of Color and Infrared Stereo Approaches to Pedestrian Detection Stephen J. Krotosky and Mohan