Detection and Handling of Occlusion in an Object Detection System

Size: px

Start display at page:

Download "Detection and Handling of Occlusion in an Object Detection System"

Martin Bryan
6 years ago
Views:

1 Detection and Handling of Occlusion in an Object Detection System R.M.G. Op het Veld a, R.G.J. Wijnhoven b, Y. Bondarau c and Peter H.N. de With d a,b ViNotion B.V., Horsten 1, 5612 AX, Eindhoven, The Netherlands; a,c,d Eindhoven University of Technology, Den Dolech 2, 5612 AZ, Eindhoven, The Netherlands ABSTRACT Object detection is an important technique for video surveillance applications. Although different detection algorithms were proposed, they all have problems in detecting occluded objects. In this paper, we propose a novel system for occlusion handling and integrate this in a sliding-window detection framework using HOG features and linear classification. The occlusion handling is obtained by applying multiple classifiers, each covering a different level of occlusion and focusing on the non-occluded object parts. Experiments show that our approach based on 17 classifiers, obtains an increase of 8% in detection performance. To limit computational complexity, we propose a cascaded implementation that only increases the computational cost by 3.4%. Although the paper presents results for pedestrian detection, our approach is not limited to this object class. Finally, our system does not need an additional dataset for training, covering all possible types of occlusions. Keywords: Object Detection, Occlusion Handling, Histogram of Oriented Gradients (HOG) 1. INTRODUCTION In this paper we focus on object detection, in particular but not exclusively, the detection of humans in the domain of video surveillance. Detection of objects is a challenging task because of large variations in lighting (sun, shadows), object position and size, object deformations (shape) and large intra-class variations in object and background. Although the quality of detection algorithms is constantly improving and partially solve the previous challenges, state-of-the-art methods still struggle to detect objects that are occluded or are in unusual poses. 1 Occlusion is a particular problem that is different from the previous challenges, since it takes away partial object information. The variation and amount of occlusion forms a problem of its own, which has not been broadly studied, so that we specifically investigate the handling of occlusions in this paper. Some typical occlusions are visualized in Figure 1. Popular object detection algorithms use a sliding-window detection stage, where a sliding classification window is evaluated at different positions in the image. At each search position, the local image region is classified into object/background. To remove variations in contrast and light conditions, the raw intensity values of the image pixels are typically first transformed into an invariant feature space. A popular feature descriptor for object characterization is the Histograms of Oriented Gradients (HOG). 2 The obtained feature description is then classified by a linear Support Vector Machines (SVM), 3 which is selected for its simplicity and good performance. A well-known dataset for occlusion experiments is the Caltech Pedestrian dataset, which focuses on pedestrian detection in an urban environment. Here, over 70% of all pedestrians are occluded in at least a single video frame. Statistics on these occlusions show that 95% of all occlusions in this dataset occur from the bottom, the right and the left of the pedestrians. 4, 5 This aspect will be specifically addressed later in this paper. Our work concentrates on improving an existing real-time sliding-window object detection system that uses linear classification (SVM). To this end, we explore the detection of occluded regions and compare this with the detection of regions without occlusion. The first approach focuses on the detection of occlusions using the classification score, whereas the second approach focuses on the detection of non-occluded regions using multiple classifiers in parallel, each dealing with different partial occlusions. We will evaluate both approaches, and show that the latter approach is most suited. The remainder of the paper is organized as follows. We introduce related work in Section 2. Our existing object detection system is described in Section 3. Then, Section 4 describes our implementations of the two evaluated approaches. Section 5 outlines the applied datasets and experimental results. Experimental results are discussed in Section 6, followed by the conclusions in Section 7.

Figure 1. Typical occlusions in crowded scenes with pedestrians. 2. RELATED WORK AND DETAILED PROBLEM STATEMENT Approaches from literature to handle such occlusions are mainly divided into two groups.

Pixel-level segmentation methods 6 10 obtain a good detection performance, but result in high computational cost.

To reduce computational cost, the segmentation resolution can be reduced from pixel to e.g. cell-level segmentation.

In a more detailed approach, Monroy and Ommer 7 propose a model-driven method to learn object shapes, without requiring segmented training data.

All learned shape-models have to be matched with every detection window, which is not feasible in real-time. Wang et al.

2 Figure 1. Typical occlusions in crowded scenes with pedestrians. 2. RELATED WORK AND DETAILED PROBLEM STATEMENT Approaches from literature to handle such occlusions are mainly divided into two groups. The first group focuses on the detection of occlusions, whereas the second group concentrates on the detection of non-occluded regions. Detection of occlusions is proposed in the following studies. Pixel-level segmentation methods 6 10 obtain a good detection performance, but result in high computational cost. Such methods distinguish different objects based on the segmentation outcome and use the pixel data of the segmented areas to classify the objects. To reduce computational cost, the segmentation resolution can be reduced from pixel to e.g. cell-level segmentation. 11 These techniques require pixel-level annotated data for training, which is preferably avoided. In a more detailed approach, Monroy and Ommer 7 propose a model-driven method to learn object shapes, without requiring segmented training data. Object models for detection are learned by explicit representations of object shapes and their segregation from the background. All learned shape-models have to be matched with every detection window, which is not feasible in real-time. Wang et al. 12 propose a technique that exploits classification scores per cell in a sliding-window detection system. They integrate the concept of object segmentation within the detection window and ignore the influence of occluded areas in the classification score for this window. Based on the localization of the occluded region, either an upper-body or a lower-body classifier will be evaluated on the non-occluded regions. This method is feasible for real-time requirements and only requires bounding-box annotated training data, instead of pixelsegmented training data. This paper will be addressed further in Section 4.1. Marín et al. 13 expand upon the work of Wang et al. 12 Instead of using a partial classifier for ambiguous detections, these detections are evaluated using an ensemble of random subspace classifiers. A validation set is used to select the best subset of classifiers, which forms a considerable drawback, as this dataset needs to cover all possible types of occlusions. The detection of non-occluded regions forms the second type of approaches. An evident first approach is to extend the image description with more informative features. As an example, HOG has already been extended with Locally Binary Patterns (LBP), 12 Color Self Similarity (CSS) 14 and Histograms Of Flow (HOF). 14, 15 Although this approach is useful, it emphasizes the object itself, but not the occlusions, so that it will have a limited or no impact on handling the occlusions. As an alternative, part-based detectors are partially robust against occlusions, because non-occluded parts are detected normally and occluded parts are missed. Tang et al. 21 propose multiple occlusion-aware classifiers are trained for pairs of specific combinations of occluding and occluded objects (e.g. person-to-person occlusion). Although this approach provides good results for pairs of pedestrians, different classifiers have to be trained for each other possible type of occlusion, which is not feasible for a real-world application. Mathias et al. 22 also use multiple classifiers that are evaluated for different sizes and positions of occlusions. Summarizing, we have adopted the work of Wang et al. 12 as a starting point, because it is based on the generic sliding-window detection approach and requires only limited additional complexity. To evaluate the suitability, we have also designed and implemented a system that focuses on the detection of non-occluded regions, using multiple classifiers, which is in line with Mathias et al. 22 This second method is chosen to have an alternative approach that is based on the same detection architecture, but uses a conceptually different solution.

Training stage (oﬄine) Sliding window Positive images B A Negative images Compute features Train classiﬁer Detection stage (online) C Compute features D Classiﬁcation Threshold tclass E Occlusion

Training is performed offline in the top-left part (Block A and B) and normal (online) detection is depicted in the lower part (Block C G). 3.

2 The input image is first transformed into an invariant feature space that models object shape using orientation histograms.

To detect objects of different sizes, the detection process is repeated for scaled versions of the input image.

Prior to object detection, the system is trained using example images of object (positive) and background (negative).

After training (normal system operation), input images are first converted into the HOG feature space (Block C).

At each image search position, the classifier returns a confidence score, which is thresholded (tclass ) to obtain the detections (Block E).

3 Training stage (oﬄine) Sliding window Positive images B A Negative images Compute features Train classiﬁer Detection stage (online) C Compute features D Classiﬁcation Threshold tclass E Occlusion handling Threshold tﬁnal F Detections G Merge detections Final detections Figure 2. Schematic overview of our object detection system. Training is performed offline in the top-left part (Block A and B) and normal (online) detection is depicted in the lower part (Block C G). 3. BASELINE OBJECT DETECTION SYSTEM The baseline object detection system is based on Histogram of Oriented Gradients (HOG) by Dalal and Triggs.2 The input image is first transformed into an invariant feature space that models object shape using orientation histograms. Object detection is performed by sliding a detection window over the image and classifying the feature description of this window into object/background. To detect objects of different sizes, the detection process is repeated for scaled versions of the input image. Finally, the window-level detections belonging to the same object are merged together into final detections. The total system is depicted in Figure 2. Prior to object detection, the system is trained using example images of object (positive) and background (negative). The images are transformed in the HOG feature space (Block A) and the resulting feature descriptors are used to train a classifier (Block B). After training (normal system operation), input images are first converted into the HOG feature space (Block C). The trained classifier is now used to detect objects in the sliding-window detection stage (Block D), which evaluates the image features. At each image search position, the classifier returns a confidence score, which is thresholded (tclass ) to obtain the detections (Block E). Because the object will be detected at several search positions and at several scales, all detections are merged by spatial clustering (Block F). Finally, these merged detections are also thresholded (tf inal ) to obtain the final detections (Block G). To describe the input image, we use the HOG feature transform.2 The image is divided into a spatial grid of cells of size 8 8 pixels. Gradient orientation information is calculated for each pixel and combined in a histogram for each cell. Orientation information is quantized into 9 bins and weighted by the gradient magnitude. Pixel gradients are calculated by filtering with [1, 0, 1] filters in both spatial dimensions. To normalize to image contrast, the histograms are normalized using L2 energy normalization. Each cell histogram is normalized multiple times, using the energy of all blocks of size 2 2 cells, of which this cell is part of. The feature vectors for all blocks belonging to the same detection window are concatenated and form the feature descriptor for this window. A visual representation of the complete process is shown in Figure 3. Intensity values Gradient calculation Orientations Combined orientations Histogram hi,j... x1, x2,..., xn Figure 3. Visual illustration of the HOG feature calculation process. To obtain object detection, the feature space is searched in a sliding fashion and the feature vector of each position is classified into object/background. During training, the linear classification decision boundary is created using a Support Vector Machine (SVM) classifier. After training, the resulting linear classifier is used for object detection, where the linear classification function is f (x) = ω T x + b, where b is the bias, ω the weight vector (normal of the hyperplane) and x the concatenation of all the features for the detection window (feature vector).

4 4. APPROACH The previously described sliding-window-based detection system is now extended with occlusion handling, where we focus on two different approaches. The first occlusion-handling approach detects occlusions using the cell scores of a full-window classifier. The second approach introduces a novel algorithm that focuses on designing a more occlusion-robust system by combining detections from multiple different classifiers. 4.1 Approach 1: occlusion detection The first approach is based on the work of Wang et al. 12 In case the score of the full-window classifier is ambiguous (t lower < score < t higher ), an occlusion might occur, the system uses the scores of the individual cells to segment the detection window into positive and negative regions. Negative regions typically do not contain an object and are ignored. Positive regions, possibly containing an object, are evaluated using a second partial classifier. Based on the score of the partial classifier, either the score of the full-window classifier (s glob ), the score of the partial classifier (s part ), or a combination of both is used. If the score of the partial classifiers is not sufficiently reliable (< t conf ), the global and partial classifiers are combined by a weighted combination of the scores, according to score = w part s part + w glob s glob. Negative cells decrease the score for the complete window, which can result in a misclassified sample. To separate negative cells from positive cells, first the score per cell needs to be computed. The score for a window is computed using the linear classification function from Section 3. The weight vector ω consists of weights for each individual cell. This vector is multiplied with the feature vector for the detection window, resulting in a weight per cell. In order to obtain the full-window classification score, a bias b is added, either to the full window or per cell. 12 The computed cell bias values are shown in Figure 4(b). Using these bias values per cell, the classification scores per cell can be calculated. Based on these scores, cells can be merged into different positive or negative regions. Negative regions indicate most likely occlusion or clutter and it is expected that ignoring these regions will improve the detection score. The merging of the noisy per-cell classification scores is implemented with a Mean-shift 23 segmentation algorithm. For this purpose, two kernels are employed: a Gaussian spatial kernel and a linear kernel, as specified by G(d) = e d2 2σ 2, G(s 1, s 2 ) = { wms s 2 if s 1 > 0, 0.5 w ms s 2 if s 1 0. (1) Here, d is the Euclidean distance between two cells, σ = 1 and s 1 and s 2 are the scores of the compared cells. 4.2 Approach 2: multiple classifiers for non-occluded regions In this approach, we pursue the novel concept of assigning a classifier for each different non-occluded region. Each classifier is trained for a certain occlusion pattern, as shown in Figure 6(a). During detection, all classifiers are evaluated at each sliding-window search position. We assume that in case of occlusion, there will be at least one classifier in the total set that matches to the current type of occlusion, so that a feasible detection is computed. It is important that the occlusion patterns and the related classifiers are representative for typical occlusion cases. We have designed 29 different classifiers, as shown in Figure 6(a). It seems straightforward to always choose the classifier that covers the smallest object region (largest occlusion region), but we will show in our experiments that such a classifier, despite being more robust to occlusion, always decreases the detection performance. For this reason, once individual classifiers are designed and selected, we will combine individual classifiers to create more robust classifiers for large occlusions. Since each classifier is trained independently, the margins for the linear classifiers are different. Classifiers covering a larger object region are incorporating more object information, so that they have better-defined margins and will perform better. For combining multiple classifiers, several metrics can be used. To calibrate each individual classifier, classifiers can be normalized using the maximum achievable detection score, as described in, 22 which is rather dataset dependent. Therefore, we propose an alternative to normalize each classifier by scaling the weight-vector ω (and its bias) of the linear classifier to unity energy. Furthermore, we calibrate t class for all classifiers using a fixed false-positive rate, to statistically equalize the number of false detections. Our advanced normalization is not only attractive for individual classifier normalization, but also paves the way for combining classifiers at a later

5 stage. Apart from individual normalization, the influence of each individual classifier to the combined result can be varied. This can be performed by imposing a weight of unity minus occlusion level 22 for each classifier. As in, 22 we assume that a small-object-region classifier performs always worse than a large-object-region classifier, so that the weight is adapted accordingly. For initial experiments, we have compared four possible combinations of selectively applying individual classifier normalization and applying weighting using the occlusion level, when combining the individual classifiers. These initial experiments have revealed that it is always preferred to apply both measures simultaneously. 4.3 Merging of detections After we have obtained multiple detections per object during the sliding-window detection stage, these detections have to be merged. Dalal 24 has proposed an elegant but computationally expensive Mean-shift procedure. A relatively simple method is proposed by Mathias et al., 22 which we refer to as NMS+Merging. We have implemented this two-stage approach because of the low computational requirements. First, all detections are sorted to their scores. Then, Non-Maximum Suppression (NMS) is applied to all detections belonging to the same classifier and the remaining detections are merged together. The criteria applied by Dollár 5 are used for non-maximum suppression (see Equation (2), left). If the overlap score is larger than the threshold t NMS, the detections are merged by ignoring the detection with the lowest score. In the second step, all remaining overlapping detections that satisfy the second overlap criterion are merged (see Equation (2), right). The previously stated NMS and overlap criteria are defined, respectively, as: area(b a B b ) min(area(b a ), area(b b )) > t NMS, area(b a B b ) area(b a B b ) > t merging. (2) If the overlap score is larger than the threshold t merging, the detections are merged and their detection scores are accumulated. We have empirically determined the threshold values at t NMS = 0.7 and t merging = EVALUATION We will evaluate the performance and efficiency of the two implemented approaches: the occlusion detection system from Wang et al. 12 and our system based on multiple classifiers,where each classifier detects different non-occluded regions. More specifically, we evaluate (1) if it is possible to use per-cell scores in order to detect occlusions, and (2) whether combining multiple, different classifiers can increase the detection performance. We first introduce the evaluation measures in Section 5.1 and then present the two datasets for our experiments in Section 5.2. For a better understanding of occlusion aspects related to the actual object class, we consider the importance of the region of occlusion in Section 5.3. The occlusion detection system is evaluated in Section 5.4 and our proposed multiple classifier system is extensively discussed and evaluated in Section Evaluation measures To quantify detection performance, we plot the Detection Error Trade-off (DET) 24 curves on a log-log scale, F alseneg i.e., the miss-rate (1 Recall or T ruep os+f alseneg ) versus the False Positives Per Window (FPPW). Evidently, low values for the miss-rate are desirable. The chosen parameters present the same information as the Receiver Operating Characteristics (ROC), but allow small probabilities to be distinguished more easily. We will use miss-rates at FPPW as a reference point for comparison of different results, since this is a realistic point of operation for a detection system. This implies that 1 out of 10, 000 negative windows is misclassified as an object. For the INRIA dataset, the images have on the average 50, 000 windows per image, when processing scales 1.0 and higher, while using scale factors To measure the performance of the complete system, we use the ROC curves, in which we plot the miss-rate versus the False Positives Per Image (FPPI) and compare results using the Area Under Curve (AUC) measure, where a lower AUC indicates a better detector performance. We have followed the same evaluation details as described by Dollár et al., 4, 5 in order to obtain comparable results.

5.2 Datasets We determine our results on the publicly available INRIA 2 dataset, but have also employed our own, more crowded Dancefestival dataset.

6 5.2 Datasets We determine our results on the publicly available INRIA 2 dataset, but have also employed our own, more crowded Dancefestival dataset. The INRIA 2 dataset contains 288 images with 1, 132 person crops of size pixels. The negative test set contains 453 images. The training set contains 1, 208 cropped object images (doubled by horizontal mirroring), and 1, 218 negative images. All results are reported for the 288 positive test images, while detecting pedestrians of 96 pixels or more in height. The negative images are used for the DET curves. Additionally, our Dancefestival dataset consists of 864 annotated persons in 38 images of resolution 1, pixels, containing 323 occluded persons. This dataset is added to evaluate performance in a crowded real-world scenario with many occlusions. Evaluation results are reported on pedestrians of height 48 pixels and up. For all experiments, we have trained our classifiers on the INRIA Person training set. Since the bounding-box annotations have different aspect ratios, we normalize all boxes to have a width of 0.41 times the height as in. 5 We search the image 3 outside the image cell coordinates to find people at the image borders. 5.3 Importance of region of occlusion In a first synthetic experiment, we evaluate the effect of occlusions at different object positions on the detection performance. We occlude each object image with a black rectangle at eight different vertical positions (height: 2 cells/16 pixels, width: image width/64 pixels), as shown in Figure 4(a) (a) Synthetic occlusion patterns Position of occlusion/canceling Canceling Baseline Miss-rate (%) (b1) (b2) -1 0 (b) Classification performance for occlusions vs cell cancelling (missrates at FPPW, lower is better). The average image of all test images is depicted in (b1). Negative bias values per cell in (b2). Figure 4. Different synthetic image occlusions in (a) and the evaluated classification performance in (b). The images are evaluated using both our baseline system without occlusion handling, and the perfect occlusion-handling method that cancels the contribution of the occluding cells in the final classification result. During canceling, the corresponding feature dimensions are set to zero, while compensating for the bias of these cells (see Subsection 4.1) by subtracting the corresponding values (visualized in Figure 4(b)). Note that more negative cell bias values represent more important cells. The occluded area belonging to Position 1 is indicated by the striped pattern. From the results shown in Figure 4(b), we observe that the detection performance is position-dependent and deteriorates mostly for adding occlusions to Positions 1 (head) and 5 (knees), which indicates that these regions have the highest importantance. In general, the miss-rates are higher than the performance with the nonoccluded images (15%). When adding occlusion to the bottom image part (Position 7, the area below the feet), the performance without occlusion handling increases compared to the baseline, which is caused by the absence of gradient information in this region. We can conclude that in all cases, occlusion decreases the performance of the baseline system. Furthermore, even in the case of perfect occlusion detection, cell cancellation still results in a decreased performance.

7 5.4 Approach 1: occlusion detection We will now evaluate the effect of detecting occlusions at the HOG cell level, as proposed by Wang et al. 12 and introduced in Section 4.1. After an extensive study to this approach, we have found that the output of the cell-based occlusion detection (after Mean-shift segmentation) is only used to activate the two partial classifiers (upper/lower body), provided that sufficient positive object information is present. The occlusion handling in this case is only covered by the fact that a partial classifier is activated at the image part that is not occluded. Therefore, we try to answer the following three questions. (1) What is the contribution of the segmentation compared to always applying the partial classifiers? (2) More specifically, is the occlusion detection stage necessary, or is it sufficient to always evaluate the partial classifiers? (3) How is the decision made to activate the partial classifiers? To answer these questions, we evaluate the influence of the segmentation and the amount of non-occluded information (positive cells), which is required to activate the partial classifiers. Two partial classifiers of size 8 8 cells are used, one for the upper body and one for the lower body. Both are trained using 50% of the original annotations and with one round of bootstrapping. We use the following parameters, t lower = 1, t higher = 1, t conf = 1.5, w part = 0.3 and w glob = 0.7, which are similar to the settings of Wang. We evaluate the effect of the weighting function from Section 4.1, by comparing with w part = 1.0 and w glob = 0.0. We use the number of positive cells to enable the partial classifiers and experimentally sweep this number from 0 to 64, using steps of 4 cells. Note that a minimum number of 0 cells is equal to always applying the partial classifiers and a minimum of 64 cells is equal to never applying the partial classifiers. When both partial classifiers are activated, s part is equal to the maximum of both partial classifiers. None of the classifiers are normalized. The obtained results are depicted in Figure 5, using window-level classification results. Unfortunately, we have not been able to reproduce any results published in Wang et al. 12 and expect that they have performed additional processing steps not described in their paper. Figure 5. Minimum of positive cells in each region vs miss-rates (for FPPW). Lower values are better. The use of only one full-object classifier (minimum of 64 positive cells) is always outperformed by the addition of partial classifiers. Best results without Mean-shift are obtained by activating partial classifiers when at least 24 positive cells are found in each region. When enabling Mean-shift, the optimum is obtained when at least 4 positive cells are found. However, this miss-rate is equal to the lowest miss-rate obtained without Meanshift. Overall, adding Mean-shift performs worse, or the improvement in detection performance is negligible (0.3%). Weighting (w part = 0.3, w glob = 0.7) is always required to compensate for the high number of false detections from the partial classifiers. Disabling weighting (w part = 1.0, w glob = 0.0) always results in miss-rates well above 15% and are therefore not shown. We have also experimented with Mean-shift on the binary cell scores, resulting in comparable results. Although this performance measure shows an increase in window-level classification performance, our integration in the complete system results in a decrease in performance when measuring False Positive Per Image (FPPI) (including merging). Summarizing, using segmentation information from occlusion detection results in a negligible increase in performance, compared to always applying the partial classifiers. Moreover, the performance of the detection system is even lower when it is embedded within the complete detection system (including merging). Using the segmentation information improves performance best when only few positive cells are required to enable the partial classifiers, showing that the noisy cell-based classification output cannot be employed directly. From the above, we therefore conclude that occlusion detection is not necessary and the addition of partial classifiers is the main source for the performance improvement.

5.5 Approach 2: multiple classifiers for non-occluded regions In this section, we will evaluate the effect of detecting the non-occluded object regions.

We will now evaluate the concept of applying multiple classifiers that cover different non-occluded regions in more detail.

8 5.5 Approach 2: multiple classifiers for non-occluded regions In this section, we will evaluate the effect of detecting the non-occluded object regions. Already in the previous approach, we have found that the main performance improvement originates from the application of two partial object classifiers, instead of detecting the occluded region. We will now evaluate the concept of applying multiple classifiers that cover different non-occluded regions in more detail. First, we evaluate which classifiers should be combined to obtain optimal results. Second, we examine a method to combine multiple classifiers, while calibrating each individual classifier. Finally, we propose a cascaded implementation to lower the computational cost for a real-time implementation Classifier design and evaluation of effective object region We already know that occlusions typically occur at certain object regions (bottom, right and left). 5 In a previous experiment in Section 5.3, we have found that for persons, most informative visual information is concentrated around the head area. We use these statistics and our findings to manually design 29 different classifiers, shown in Figure 6(a). The region that models the hypothesized occlusion area is ignored by the classifier and is drawn in black. To demonstrate the importance of the effective object region covered by the classifier (the classifier size), we compare 9 differently-sized classifiers, where the size of the occluded region is constantly increased (Classifiers 0-7 and 28 from Figure 6(a)). Each classifier is independently trained on the INRIA set and the performance is evaluated by measuring the AUC. The 9 different classifiers and their detection performances are shown in shown in Figure 6(b). Overall, we conclude that classifiers covering a larger region perform better. Classifier 1 obtains the lowest overall miss-rate, which we expect to be caused by noise in the lower part of the region and is ignored in this classifier. This finding is supported by our findings from Section 5.3. We have obtained comparable results for right-to-left and left-to-right occlusions. (a) (b) Different classifiers with detection performances (AUC) (lower is better). Figure 6. Multiple classifiers with different occlusion patterns in (a) and the detection performance of different classifiers with increasingly smaller object area in (b). Black areas represent occlusions where classifier weights have value zero Selection of classifier combinations When applying multiple classifiers using different effective object regions, each classifier will focus on different visual properties. We assume that the combination of these different classifiers will result in an improved detection performance. However, it is difficult to predict which combination of classifiers performs best. Therefore, we evaluate all 29 classifiers from Figure 6(a) and measure their effective detection performance. First, all classifiers are applied independently to the images and the detections (after merging) from all classifiers are evaluated to measure the contribution of each individual classifier. The amount of occurrences of each classifier is then used to select a combination of classifiers. A classifier is counted when it has the largest contribution (highest detection score) to a correct detection (ground truth). Classifiers are not weighted by the occlusion level, so that large-region classifiers are not prioritized. Note that weighting is enabled when the selected classifiers are applied in the final detection system. The classifiers are evaluated on both the positive INRIA test set and the Dancefestival datasets. We compare normalized classifiers and set the thresholds for each classifier at a false positive rate of

9 Table 1. Different combinations of classifiers. Dataset Selection # Manual , 15 3, 9, , 28 INRIA Auto 1, 3, 14, 2, 10, 12, 27, 7, 8, 9, 13, 4, 26, 11, 5, 15, 19 Dancefestival Auto 9, 13, 27, 12, 14, 3, 15, 26, 1, 2, 10, 8, 4, 11, 5, 7, 6 The relative number of detections for each classifier is shown in Figure 7(a). Note that a low contribution of a classifier means that there is another classifier giving higher detection scores on the same detections, thereby making this non-contributing classifier inferior. From Figure 7(a), it can be seen that the classifiers are inferior for both datasets. The Dancefestival dataset contains more occluded persons, leading to a higher preference for small-region classifiers and left/right occlusions. We now combine a selection of classifiers based on this occurrence histogram and evaluate the performance of the classifier combination on the INRIA dataset. To compare the quality of the automated selection process, we also evaluate the performance of a manual selection of classifiers. The classifier numbers of the selections are listed in Table 1. The first row shows the manually selected classifiers, while rows two and three depict the selections generated from the INRIA and Dancefestival datasets, respectively. Note that the classifier numbers correspond to the numbers from Figure 6(b). Increasing the number of combined classifiers increases detection performance and converges at around 11 classifiers. In general, adding more classifiers decreases the performance. In all three considered cases, a clear optimum occurs and from this point onwards, the performance always decreases when adding more classifiers. This implies that there is an optimal set of classifiers. Additional classifiers only add already found detections (redundancy) and therefore only generate false detections. With automated selection, the best results are obtained with 11 classifiers selected from the Dancefestival dataset. However, an even better performance is obtained with the combinations of 7 and 17 manually selected classifiers. Note that the selection of the first classifier is most critical. This is clearly seen from the Dancefestical selection, where Classifier 9 is selected as the first classifier, modeling a significant amount of occlusion. Amount of total detections (%) INRIA Dancefestival Classifier number (a) Influence of individual classifiers on total detections. AUC 36% 34% 32% INRIA auto Dancefestival auto Manually selected 30% Number of classifiers (b) Detection performance for multiple classifiers. Figure 7. Combining multiple classifiers and measuring the influence of individual detectors in (a) and evaluating the detection performance of several classifier combinations in (b). (a) Detection using 1 classifier (b) Detection using 17 classifiers Figure 8. Example detections when applying 1 vs. 17 classifiers. Note that multiple classifiers enable the detection of significantly occluded persons.

5.5.3 Real-time implementation Although this advanced occlusion handling increases the detection performance, adding more classifiers increases the computational costs linearly with the amount of

10 5.5.3 Real-time implementation Although this advanced occlusion handling increases the detection performance, adding more classifiers increases the computational costs linearly with the amount of classifiers. To reduce these costs, we propose a cascaded implementation that limits the amount of comparisons at each sliding-window search position. At each position, the largest-region classifier is evaluated first and only when the classification score is above a threshold, all other classifiers are evaluated. This already discards many search positions after applying the first classifier. We have evaluated both the computational complexity and the detection performance for a system with 1, 2, 4, 7 and 17 classifiers when operated with different threshold values (t class ). The results for the manually selected classifiers are depicted in Figure 9. This figure visualizes both the computational costs and the detection performance, both relative to the baseline system with one classifier. The computational costs are shown by the bars and are linked to the left vertical axis. Here, the value 100% represents the computational costs when applying the single classifier baseline system. The detection performances of the different systems are shown by the lines and are linked to the vertical axis at the right. Here, the value 100% represents the performance of the single classifier baseline system. Combining up to 17 classifiers, the performance always increases. However, combining more than 7 classifiers does not improve the performance significantly. Using 7 classifiers, the computational costs increase by 1.3%, while increasing the associated detection performance by 7.6%. A suitable trade-off between performance and costs would be to adopt 17 classifiers, using initial threshold 0.2, leading to an improvement of 8% in detection performance for 3.4% higher cost. This combination of classifiers detects more occluded objects, as shown in an example picture in Figure 8. Figure 9. Performance and computational costs for combinations of different numbers of classifiers. Detection performance is indicated by the lines (refer to axis at the right) and computational costs are indicated by the bars (refer to axis at the left). All classifier combinations are manually selected, as in Table 1.

11 6. DISCUSSION We have evaluated two conceptual approaches for occlusion handling: detecting occluded object regions and detecting non-occluded regions. Although in our experiments we have shown that the detection of occlusions is not preferable, these results can be improved when using multiple image features. We have found that despite Wang et al. 12 describe the use of only HOG features for occlusion detection, instead they combine HOG with LBP. Marín et al. 13 show that extending HOG with LBP results in a significant improvement of the detection performance. Furthermore, our method for automated classifier selection is not optimal. A more elaborate selection procedure should select mutually exclusive classifiers to improve combined detection performance. We also want to put a critical note with the evaluation measures and parameters. Not all publications apply the same evaluation criteria and often, individual algorithms are based on different parameters (such as scales, dataset details and other algorithm parameters). In order to obtain comparable evaluation results, Dollár et al. 4, 5 made a framework for the objective evaluation of detection results. Unfortunately, comparing different implementations of the same algorithm is still influenced by the applied algorithmic parameters. The state-of-the-art detection performance on the INRIA dataset is obtained by Mathias et al., 22 which obtain a miss-rate of 16.62%. By adding occlusion handling, the miss-rate decreases to 13.70%, while the computational cost increases with 330%. In our system with 17 classifiers, the miss-rate is lowered from 33.88% to 31.05%, while only increasing computations by 3.4%. This performance difference is caused by the difference between our simple HOG features, vs. the more discriminative features from. 22 However, these features are specific to the object class and introduce additional computational complexity. Finally, we want to remark that we have recovered the work of Mathias et al. 22 from literature only when already completing our own work. Although this resulted in several similarities, it also shows that the concept of applying multiple classifiers to detect non-occluded object regions provides a suitable solution that is now supported by two relatively independent investigations. 7. CONCLUSION In this paper, we have proposed a novel system for occlusion handling and integrated this in a sliding-window detection framework, using simple HOG features and linear classification. Occlusion handling is obtained by the combination of multiple classifiers, each covering a different level of occlusion. For real-time detection, our approach with 17 classifiers obtains an increase of 8% in detection performance, with respect to the baseline system. We have proposed a cascaded implementation that only increases computational cost by 3.4%. Although we only present results for pedestrian detection, our approach is not limited specifically to this object class. Moreover, the fixed HOG feature transformation allows for an extension towards other object classes without additional class-specific feature calculation. Pre-defining the types of occlusions prior to training creates the advantage that we do not need an additional dataset for training, which covers all possible types of occlusions. We have revealed that the effect of occlusion on the detection performance is position-dependent and for pedestrian detection, performance deteriorates mostly for occlusions around the head and knees. After implementing and evaluating the method by Wang et al., 12 we conclude that the largest contribution of the proposed occlusion handling is not caused by the cell-based occlusion detection and region merging, but originates from the addition of partial classifiers (upper/lower body). We have found that simply applying small-region classifiers that cover only a part of the object (e.g. head-only detector) and therefore can handle more occlusions, strongly decrease the detection performance. Combining multiple classifiers increases the detection performance up to a certain optimal number of classifiers. Adopting more classifiers beyond this point only adds already found detections (redundancy) and generates false detections, thereby effectively decreasing the detection performance. We have proposed an automated selection method for classifiers using statistics based on the occurrences of occlusions. Although this method performs better in some cases, the automatic selection is strongly dependent on the dataset and is regularly outperformed by the manual selection of classifiers. Besides this, the selection of the first classifier has been found to be most critical for the final system operation. Automated selection can be further improved by an iterative classifier selection method that removes intermediate detections. This combination of multiple classifiers enables the detection of the persons that are strongly occluded. Personal communication

12 REFERENCES [1] Hoiem, D., Chodpathumwan, Y., and Dai, Q., Diagnosing error in object detectors, in [Proc. European Conference on Computer Vision (ECCV)], , Springer (2012). [2] Dalal, N. and Triggs, B., Histograms of oriented gradients for human detection, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], 1, vol. 1 (2005). [3] Cortes, C. and Vapnik, V., Support-vector networks, Machine learning 20(3), (1995). [4] Dollár, P., Wojek, C., Schiele, B., and Perona, P., Pedestrian detection: A benchmark, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], , IEEE (2009). [5] Dollár, P., Wojek, C., Schiele, B., and Perona, P., Pedestrian detection: An evaluation of the state of the art, Trans. Pattern Analysis and Machine Intelligence (PAMI) 34(4), (2012). [6] Winn, J. and Shotton, J., The layout consistent random field for recognizing and segmenting partially occluded objects, in [Proc. Conference Computer Vision Pattern Recognition], 1, 37 44, IEEE (2006). [7] Monroy, A. and Ommer, B., Beyond bounding-boxes: Learning object shape by model-driven grouping, in [Proc. European Conference on Computer Vision (ECCV)], , Springer (2012). [8] Yang, Y., Hallman, S., Ramanan, D., and Fowlkes, C., Layered object detection for multi-class segmentation, in [Proc. Conference Computer Vision Pattern Recognition (CVPR)], , IEEE (2010). [9] Gould, S., Fulton, R., and Koller, D., Decomposing a scene into geometric and semantically consistent regions, in [Proc. International Conference on Computer Vision (ICCV)], 1 8, IEEE (2009). [10] Gould, S., Gao, T., and Koller, D., Region-based segmentation and object detection, in [Advances in Neural Information Processing Systems], (2009). [11] Gao, T., Packer, B., and Koller, D., A segmentation-aware object detection model with occlusion handling, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], , IEEE (2011). [12] Wang, X., Han, T., and Yan, S., An hog-lbp human detector with partial occlusion handling, in [Proc. International Conference on Computer Vision (ICCV)], (2009). [13] Vazquez, D., Lopez, A., Amores, J., Kuncheva, L., et al., Occlusion handling via random subspace classifiers for human detection, Transactions on Systems, Man, and Cybernetics (Part B) 44(3), (2013). [14] Walk, S., Majer, N., Schindler, K., and Schiele, B., New features and insights for pedestrian detection, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], , IEEE (2010). [15] Dalal, N., Triggs, B., and Schmid, C., Human detection using oriented histograms of flow and appearance, in [Proc. European Conference on Computer Vision (ECCV)], , Springer (2006). [16] Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D., Object detection with discriminatively trained part-based models, Trans. Pattern Analysis and Machine Intelligence 32(9), (2010). [17] Leibe, B., Seemann, E., and Schiele, B., Pedestrian detection in crowded scenes, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], 1, , IEEE (2005). [18] Mikolajczyk, K., Schmid, C., and Zisserman, A., Human detection based on a probabilistic assembly of robust part detectors, in [Proc. European Conference on Computer Vision (ECCV)], 69 82, Springer (2004). [19] Wu, B. and Nevatia, R., Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors, in [Proc. International Conference on Computer Vision (ICCV)], 1, 90 97, IEEE (2005). [20] Fergus, R., Perona, P., and Zisserman, A., Object class recognition by unsupervised scale-invariant learning, in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR)], 2, II 264, IEEE (2003). [21] Tang, S., Andriluka, M., and Schiele, B., Detection and tracking of occluded people, International Journal of Computer Vision, 1 12 (2012). [22] Mathias, M., Benenson, R., Timofte, R., and Van Gool, L., Handling occlusions with franken-classifiers, Proc. International Conference on Computer Vision (ICCV) (2013). [23] Cheng, Y., Mean shift, mode seeking, and clustering, Trans. Pattern Analysis and Machine Intelligence (PAMI) 17(8), (1995). [24] Dalal, N., Finding people in images and videos, PhD thesis, Institut National Polytechnique de Grenoble- INPG (2006).

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Johnson Hsieh (johnsonhsieh@gmail.com), Alexander Chia (alexchia@stanford.edu) Abstract -- Object occlusion presents a major