A Study of Vehicle Detector Generalization on U.S. Highway

26 IEEE 9th International Conference on Intelligent Transportation Systems (ITSC) Windsor Oceanico Hotel, Rio de Janeiro, Brazil, November -4, 26 A Study of Vehicle Generalization on U.S. Highway Rakesh N. Rajaram, Eshed Ohn-Bar, and Mohan M. Trivedi Laboratory for Intelligent and Safe Automobiles University of California, San Diego {rnattoji, eohnbar, mtrivedi}@ucsd.edu Abstract Vehicle detection is an essential task in an intelligent vehicle. Despite being a well-studied vision problem, it is unclear how well vehicle detectors generalize to new settings. Specifically, this paper studies the generalization capability of vehicle detectors on a U.S. highway dataset. Two types of models are employed in the experimental analysis, a subcategory aggregate channel features model and a regionbased convolutional neural network model. The experiments demonstrate limited generalization capability of pre-trained models when evaluated on a dataset captured in new settings. This observation motivates technical modifications in order to improve generalization to the new dataset. By exploring novel training techniques, we significantly improve detection performance by up to %, demonstrating the importance of studying cross-dataset generalization. I. INTRODUCTION A robust object detector must handle appearance variations due to a variety of reasons, including different scenes (urban roads vs highways), camera viewing angle, occlusion due to other objects, truncation due to movement in and out of the camera field of view, as well as variations in the object itself (e.g. SUV vs. sedan). A study in the performance impact of such appearance variations is a generalization study regarding limitations of the object detector. Our study is motivated by the need for training object detectors which can generalize well to new vehicles and settings. Several related research studies propose object detectors with improved generalization capabilities. Notably, the Deformable Parts Model (DPM) [2] models object-part relationships for achieving increased flexibility in detection. The DPM also trains multiple aspect-ratio models for better handling variations due to aspect-ratio of objects. The [3] vehicle detector extends this idea further by training models for varying vehicle types (sub-categories), at varying aspectratio, orientation, occlusion, or by clustering visual descriptors. Current state-of-the-art object detectors such as R-CNN [4], [5] achieve generalization through modeling hierarchies of increasingly higher-level representations and employing large amounts of data. Although the field has seen significant improvement in object detection performance in recent years, analysis on the performance of the aforementioned models when training on a particular scene settings and testing on different settings is lacking. Vision based vehicle detection is a widely studied research topic of the past decade. Literature until 23 has been carefully surveyed by Sivaraman et al. in [6], [7]. Recent success of vehicle detectors on KITTI [] datasets could (a) A typical drive on a highway (b) A typical drive on a urban road Fig.. Difference in scene composition between highway and urban drives. The urban images are from the KITTI [] dataset and the highway images are from a video dataset collected by us. Vehicle ground truth annotation boxes are shown in green. be attributed to robust handling of appearance variation. Regionlets [8] handles variation in location of object parts by a flexible feature extraction scheme within a region of object proposal. A variant of DPM, OC-DPM [9] proposes a DPM with occlusion specific model components. 3DVP [] clusters samples into voxel patterns by occlusion and truncation, and consequently trains a cluster specific vehicle detector. Current state-of-the-art region-based deep convolutional neural network (DCNN) models [4] train multilayer architectures on large amounts of data for implicitly handling object variations in classification and detection. We note that when on-road vehicle detection is concerned, 978--59-889-5/6/$3. 26 IEEE 277

Training Data Clustering Model Learning Training Data Clustering Model Learning (a) Training pipeline Test Image Pixel lookup Features Test Image Pixel lookup (b) Testing pipeline Features Fig. 2. This work studies different clustering techniques derived from aspect ratio and scale, similar to [3] for learning a highway detector. The monolithic detector uses AdaBoost with color and gradient-based pixel lookup features to learn models that provides fast detection in test time. most existing approaches and datasets involve training and testing within similar scenes and geographical locations. Furthermore, although our study focuses on vehicles, similar issues are expected for other types of road occupants [], [2]. In this work, we perform a study of object detector generalization when training is done on the KITTI [] dataset, collected in urban settings in Europe (Karlsruhe, Germany), and testing is done on a U.S. (San Diego) highway dataset, collected by us. The domain application of intelligent vehicles and autonomous driving requires object detection models to generalize over such variations in settings and geographical locations. We study the impact of parameter choices when training on generalization capability to the new dataset. Furthermore, as generalization capability is influenced by both the training procedure and the dataset used, our experiments can study the impact of dataset bias. Specifically, the motivation for this paper are as follows. ) Initial experiments with detectors pre-trained on KITTI dataset generated sub-optimal results on our new highway dataset. 2) Most of the previous work in highway vehicle detection are restricted to narrow back view of the vehicles. Our study is reported on data collected with wider view camera and annotations of entire view of the vehicles (as opposed to just rear part). 3) Most of the previous work in highway vehicle detection report evaluation metric using 5% PASCAL overlap criteria. We include analysis on detector performance at different overlap thresholds. Performance at a higher overlap threshold implies b This motivating points guide our study of generalization of vehicle detectors on highway settings. Fig. highlights some key difference between urban roads and highway data collected at the two geographically different locations. The contributions of the paper are as follows. ) We evaluate two state-of-the-art vehicle detectors on a new highway dataset. 2) We explore the impact of different clustering options on detector s performance and show improvement of more than % over pre-trained models. This demonstrates the importance of studying cross-dataset, cross-settings generalization. 3) The fine-grained analysis of missed detections provides insight into future scope of research. For the comparative analysis, we employ the R-CNN [4] and [3] approaches. The impact of training procedure is mostly done on due to its fast training and testing time, with the aim of highlighting some of the improvements that can be helpful when adapting an object detector from one settings to another. II. SUBCAT DETECTOR FOR ANALYZING GENERALIZATION The key components of the [3] detector are shown in Fig. 2. The crux of this method is to cluster objects into different categories based on features which can be visual (features such as colorspace, gradient magnitude, etc.), geometric (aspect ratio, height, 3D orientation, etc.), or semantic (occluded, truncated, etc.). Then, for each of these clusters a model is generated by training a clusterspecific detector (we chose the ACF [3], [4] detector for our generalization studies). During test time, detection boxes from all the cluster-specific detectors are joined to produce the final vehicle detection boxes. Formally, let O = {o i j } be the set of all vehicles, with i as the image index and j indexing each vehicle in image i. In our clustering process, each cluster provides a subset, 278

Count Count 5 4 3 2 4 3 2.8..2.4.6.8 2. 2.2 2.4 2.6 Aspect Ratio = W/H 3 4 5 6 7 8 9 > Object Height in pixels Fig. 3. Distribution of vehicle aspect ratio and height plays a significant role in deciding the number of clusters and their properties. Our Highway dataset has diverse aspect ratio and height, motivating learning cluster-specific models suitable for handling such challenges. Ultimately, the goal is to study elements which impact generalization when training on KITTI and testing on the highway dataset. C k O and k N where N is the total number of clusters. The cluster set satisfies the following constraints: () C x C y = φ, x y, and (2) O = n C x. x= III. SAN DIEGO HIGHWAY DATASET The dataset used in this paper was captured using the testbed with a front facing PointGrey color camera at 28x396, 5fps. The annotated frames used in the analysis corresponds to a drive on December 24, 25 at AM PST on San Diego interstate highway. It was a sunny morning leading to some bright reflections from the surround vehicle surface. We choose 22 semi-contiguous frames and annotate 2536 cars that are bigger than 3x3, and atleast 5% visible. Since the clustering process heavily depends on the vehicle aspect ratio (the ratio of bounding box width to height and height distribution, Fig. 3 helps in deciding optimal clustering strategies. IV. EXPERIMENTAL ANALYSIS The LISA-T highway dataset is split equally with the first 5 images going into training set while the remaining images going into the validation set. Data augmentation in the form of horizontally flipped images are added into respective sets. This results in 22 images in each set with 2856 vehicles in training and 226 vehicles in testing sets respectively. This experimental setup allows contrasting the impact of training either on KITTI or on highway settings. Let d be any detection bounding box and o be any ground truth bounding box in the same image. Then, PASCAL overlap threshold (η) is calculated as η = d o d o. Unless specified, we report area under the precision-recall curve (AUC) at η =.7 (7%). Baseline: We run 3 different experiments with the objective of providing baseline performance on the highway.9.8.7.6.5.4.3.2. Retrained, AUC=72.3% Fast-RCNN, AUC=6.9% Pretrained, AUC=6.99%.2.4.6.8 Fig. 4. -recall curve for baseline methods on the LISA highway test set. Cluster Center: mean Aspect Ratio 2.9.8.7.6.5.4.3.2. 2 4 8 5 Number of Clusters Fig. 5. Distribution of cluster average aspect ratio generated by k-means on LISA-T highway training set. dataset test set. The resulting precision-recall (PR) curves are shown Fig. 4. Pre-trained [3] models trained on entire KITTI [] object training set. It is an ensemble of 75 models, trained for 25 different orientation clusters at 3 different scales. This detector achieved 75.46% AUC on KITTI benchmark [5]. Out-of-the-box Fast-RCNN [4] using VGG6 [6] model fine-tuned on VOC7 [7]. About 2 object proposals per image are generated using EdgeBoxes [8]. Detections are generated at the default 5 scale multi-resolution settings. New models trained on the split proposed in [] with tree depth-2 and upto negative samples. All other parameters are same as pre-trained. Next, we consider several strategies for improving the performance of the detector trained on KITTI for U.S. settings. The aim is to gain insight into the type of factors impacting an object detector when applied to new settings. Strategy : We train ACF [3] models, each with a different aspect ratio as obtained by k-means clustering of the samples in the highway dataset. The model height (h k ) 279

.8.8.6.4.2 N=, AUC=7.2% N=2, AUC=69.4% N=4, AUC=56.59% N=8, AUC=6.73% N=, AUC=66.63%.2.4.6.8 Fig. 6. Performance curves for different number of clusters (N) under strategy..6.4.2 N=, AUC=7.2% N=2, AUC=74.74% N=4, AUC=76.98% N=8, AUC=8.63% N=, AUC=8.28%.2.4.6.8 Fig. 8. Performance curves for different number of clusters (N) under strategy 3. % 9%.8 8% 7%.6.4.2 N=, AUC=7.2% N=2, AUC=75.67% N=4, AUC=47.6% N=8, AUC=8.35% N=, AUC=2.8%.2.4.6.8 Fig. 7. Performance curves for different number of clusters (N) under strategy 2. is fixed to 3 pixels. Model width is set as w k = h k µ k where µ k is the mean aspect ratio of cluster k. We train upto 248 depth-2 decision trees with all positive samples from C k, upto negative samples and 4 rounds of hard negative mining using AdaBoost. Object locations from other clusters are ignored during hard negative mining. Cluster centers are shown in Fig. 5. Resulting performance curves are shown in Fig. 6. Strategy 2: Strategy is modified such that during the training of each model, we allow mining hard negatives from positive samples in other clusters. Resulting performance curves are shown in Fig. 7. Strategy 3: Strategy is modified such that all positive samples from O are used to train each model. Effectively we are training multiple models for the same data, but at different aspect ratios. We note that this is not the same as the original procedure described in Section II which parses the training set into disjoint training clusters. Strategy 3 can leverage from additional data within each AUC Fig. 9. 6% 5% 4% 3% 2% Retrained Fast-RCNN N=, Strategy N=2, Strategy 2 N=8, Strategy 3 % 4% 5% 6% 7% 8% Overlap Threshold AUC as function of overlap threshold (η) for selected experiments. cluster, and the resulting curves are shown in Fig. 8. V. DISCUSSION When comparing the different performance curves for the three strategies of adapting a detector to the new settings, including the baseline, we see that the pre-trained model performs sub-optimally compared to models trained specifically on highway datasets. [3] models trained only on data from urban areas of Germany fail to generalize to our highway dataset. On the other hand, Fast-RCNN [4] in particular and deep convolution neural networks in general are trained on vast amounts of data across multiple object classes and hence are expected to generalize well with appearance and scene variation. This is found to be true for highway vehicle detection but, the reason for lower AUC is explained in the subsequent section. A. Overlap Threshold The interplay of AUC and overlap threshold is insightful in understanding the localization performance of any detector. In Fig. 9, we take the best performing method from each strategy and plot AUC as a function of overlap threshold 28

Fig.. The distribution of missed vehicles with respect to object position in the image. Each dot is the center location of missed vehicles. Red corresponds to η =.5 with additional misses in green when η increase to.7. This result is generated using detector trained with strategy 3 (N = 8). Fig.. Example detection results generated using the detector trained with strategy 3 (N = 8). Green boxes are ground truth. Red boxes are detections with confidence score printed above the box. η. While all of the models follow a similar trend, Fast-RCNN shows tremendous decline with η.65. This is a well known issue with DCNN-based approaches, i.e. they are very good for classification tasks but mail fail at localizing the objects. In-fact, Fast-RCNN performs slightly better than all other models with 88.54% AUC at η =.5. This is a useful insight for the intelligent vehicles domain, which may not have been clearly visible in general object detection (e.g. the PASCAL [7] dataset). B. Missed Detections The location and size of missed detection are useful in understanding the limitations of the detector. We limit this analysis to the best performing detector (i.e. strategy 3 with N = 8). In Fig., each missed vehicle s center location is plotted as a dot on the image plane. Red dots correspond to η =.5 and green dots are added when η increases to.7. Fig. helps us draw two important conclusions. () Most of the truncated vehicles are detected but, poorly localized. (2) Small vehicles are almost never detected. This can be quantitatively seen in Fig. 2 where we plot fraction of missed detections vs bounding box height. Selected detection boxes along with ground truth are visualized in Fig.. VI. CONCLUDING REMARKS In this work, we studied the impact of model parameters and dataset training procedure on vehicle detection in highway settings. The study shows that a vehicle detector trained on highway dataset to perform significantly better than one trained on a geographically different dataset. In particular, improved performance could be obtained even with a fraction of the modeling complexity of an off- 28

Fraction missed.9.8.7.6.5.4.3.2. 3 4 5 6 7 8 9 Object Height Fig. 2. Fraction of missed vehicles as a function of vehicle height in pixels. Blue bars corresponds to η =.7 and red bars correspond to η =.5. This plot is generated using detector trained with strategy 3 (N = 8). the-shelf, pre-trained detector (for, we use 8 as opposed to the 75 components in [3])). This does convey how new evaluation settings could render elements of the model redundant. Furthermore, the detector performance (AUC) evaluated at different overlap thresholds suggests that highway vehicle detection is still an open challenge, especially when localization accuracy is important. [] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, Data-driven 3d voxel patterns for object category recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 25. [] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, An exploration of why and when pedestrian detection fails, IEEE Conference on Intelligent Transportation Systems, September 25. [2] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, Looking at pedestrians at different scales: A multiresolution approach and evaluations, IEEE Transactions on Intelligent Transportation Systems, 26. [3] P. Dollár, R. Appel, S. Belongie, and P. Perona, Fast Feature Pyramids for Object Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24. [4] P. Dollár, Piotr s Computer Vision Matlab Toolbox (PMT). http://vision.ucsd.edu/ pdollar/toolbox/doc/ index.html. [5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, Vision meets robotics: The KITTI dataset, The International Journal of Robotics Research, 23. [6] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR, vol. abs/49.556, 24. [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision, vol. 88, pp. 33 338, June 2. [8] C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, in Computer Vision ECCV 24, pp. 39 45, Springer, 24. VII. ACKNOWLEDGMENTS The authors would like to thank the support of our sponsors and associated industry partners. We also thank our colleagues at the Laboratory for Intelligent and Safe Automobile (LISA), University of California, San Diego, for encouragement and assistance. REFERENCES [] A. Geiger, P. Lenz, and R. Urtasun, Are we ready for autonomous driving? the KITTI vision benchmark suite, in Conference on Computer Vision and Pattern Recognition, 22. [2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatively trained part based models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 627 645, 2. [3] E. Ohn-Bar and M. M. Trivedi, Learning to detect vehicles by clustering appearance patterns, IEEE Transactions on Intelligent Transportation Systems, 25. [4] R. Girshick, Fast r-cnn, in International Conference on Computer Vision (ICCV), 25. [5] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, RefineNet: Iterative refinement for accurate object localization, in IEEE Intelligent Transportation Systems Conference, 26. [6] S. Sivaraman and M. M. Trivedi, Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis, IEEE Transactions on Intelligent Transportation Systems, 23. [7] S. Sivaraman and M. M. Trivedi, A general active learning framework for on-road vehicle recognition and tracking, IEEE Transactions on Intelligent Transportation Systems, 2. [8] X. Wang, M. Yang, S. Zhu, and Y. Lin, Regionlets for generic object detection, in Computer Vision (ICCV), 23 IEEE International Conference on, pp. 7 24, IEEE, 23. [9] B. Pepikj, M. Stark, P. Gehler, and B. Schiele, Occlusion patterns for object class detection, in Computer Vision and Pattern Recognition (CVPR), 23 IEEE Conference on, pp. 3286 3293, IEEE, 23. 282