High Performance Object Detection by Collaborative Learning of Joint Ranking of Granules Features

Size: px

Start display at page:

Download "High Performance Object Detection by Collaborative Learning of Joint Ranking of Granules Features"

Harry Norris
5 years ago
Views:

1 High Performance Object Detection by Collaborative Learning of Joint Ranking of Granules Features Chang Huang and Ram Nevatia University of Southern California, Institute for Robotics and Intelligent Systems Los Angeles, CA 90089, USA {huangcha Abstract Object detection remains an important but challenging task in computer vision. We present a method that combines high accuracy with high efficiency. We adopt simplified forms of APCF features [3], which we term Joint Ranking of Granules (JRoG) features; the features consists of discrete values by uniting binary ranking results of pairwise granules in the image. We propose a novel collaborative learning method for JRoG features, which consists of a Simulated Annealing (SA) module and an incremental feature selection module. The two complementary modules collaborate to efficiently search the formidably large JRoG feature space for discriminative features, which are fed into a boosted cascade for object detection. To cope with occlusions in crowded environments, we employ the strategy of part based detection, as in [9] but propose a new dynamic search method to improve the Bayesian combination of the part detection results. Experiments on several challenging data sets show that our approach achieves not only considerable improvement in detection accuracy but also major improvements in computational efficiency; on a Xeon 3GHz computer, with only a single thread, it can process a million scanning windows per second, sufficing for many practical real-time detection tasks.. Introduction Object detection is a fundamental task in computer vision. Although considerable progress has been achieved in recent years, detection of objects in real-life images remains a challenging task. We will focus on pedestrian detection examples in this paper though our methods should apply to other objects as well. To illustrate the difficulty of the task, consider the images shown in as Fig.; the one on the left is taken from the i-lids subway data set [] that includes considerable inter-occlusion between pedestrians, and the one on the right from the Zurich mobile pedestrian data set (ETHZ) [4] in which the camera is on a continuously movi-lids Subway Set Zurich Mobile Set Figure. Sample images of two challenging data sets ing platform, people are crowded, illuminations change significantly and the background is rather cluttered. There is a vast literature on techniques of object detection, and specifically pedestrian detection. Learning-based methods have come to be dominant; key issues here are the features and the learning algorithms that are used. Features can be global or local. Global features, such as edge templates [6] and shape models [5] can be highly discriminative but sensitive to changes in overall shape due to occlusions and articulations. Local features such as wavelet descriptors [2, ], SIFT-like features [0] and Histogram of Oriented Gradient (HOG) [2])are more flexible but a set of them needs to be selected and combined in some way, typically via a learning algorithm. There have also been efforts to combine a variety of complementary features such as Wu and Nevatia s Heterogeneous Local Features [2], Schwartz et al. s edge-based features augmented by texture and color [4] and Wang et al. s [8] HOG features + Local Binary Pattern (LBP) approach. Leibe et al. [9] combine both local and global cues via a probabilistic top-down segmentation. One thread, among the successful approaches, has been to build on the pioneering work of Viola and Jones for face detection [6]. Enhancements include development of new features, such as motion enhanced Haar-like features [7], Edgelet features [9] and covariance matrix descriptors [5]. Classifier structure has also been enhanced beyond the original cascade, for example the trees in [20]. We present a new method that also follows this paradigm /0/$ IEEE 4

2 but use a very different kind of feature set which, in turn, requires very different learning techniques. The resulting detectors exhibit considerable improvements in accuracy while also reducing the computational costs substantially. In a recent workshop paper, Duan et al. [3] introduced a novel class of features called Associated Pairing Comparison Features (APCF), which are built on earlier granule features which were demonstrated for face detection [7]. APCF features comprise simple comparisons of simple image properties such as color or gradient in small regions (called granules) of the detection window. As the APCF feature is defined by a sequence of unrestricted granules, the feature space can be very large, hence normal AdaBoost learning techniques which require exhaustive enumeration of features is inapplicable; instead, Duan et al. use a heuristic algorithm for feature selection. Our work builds on the concept of APCF but considers a simpler, special case of APCF. We term these features as Joint Ranking of Granules as our comparison thresholds are set to zero (the name will become more obvious when we introduce further details of the features). Based on Duan et al. s heuristic algorithm, we propose a collaborative learning algorithm enhanced by a simulated annealing (SA) step in combination with a Real AdaBoost algorithm and demonstrates the advantage of the new method. [3] shows impressive results for the task of pedestrian detection on some standard datasets. However, the method does not explicitly account for inter-object occlusions so may fail in more crowded environments. We incorporate a part-based approach in our work: we learn detectors for selected parts of the human body and final detection of pedestrians is based on a joint inference of part combination. This step of our method builds on earlier work of Wu and Nevatia [9]; however, we design a new dynamic search approach instead of the static search used in previous work. While our method builds on the above described earlier work, we demonstrate that our method achieves considerable improvements in accuracy compared to the state-of-art detectors which are already very good, and this gain does not come at the cost of enhanced computation; in fact, the detector is significantly more efficient than the others. We provide a detailed analysis of performance later in the paper. The rest of this paper is organized as follows: Section 2 outlines our approach; Section 3 introduces the JRoG feature; Section 4 elaborates the collaborative learning algorithm for JRoG features; Section 5 describes our part-based detection method; Section 6 presents experiment results on several challenging testing sets; Section 7 makes conclusions of this paper. 2. Outline of Our Approach In this paper, Joint Ranking of Granules (JRoG) features are adopted as the descriptors, which are in fact APCF fea- > > < 0 0 Figure 2. Computation of a 3-bit JRoG feature. An image patch is converted into the grey-level granular space, in which three pairs of granules are ranked respectively. Ranking result indicates the brightness of granule (the solid square) is higher than its opponent (the hollow square), and 0 otherwise. Finally, the three bits of ranking results constitute the output of this JRoG feature. tures [3] simplified by excluding gradient granules and setting all comparison thresholds as zero. Illustrated as Fig.2, a JRoG feature unites binary ranking results made by several granule pairs which are selected among thousands of grey-level granules in the granular space [7]. Such unrestricted combination endows JRoG features with remarkable flexibility but also makes the conventional exhaustive search method inapplicable due to the tremendous size of the entire feature set. Inspired by the heuristic algorithm used in [3] for APCF features, we propose a collaborative learning approach to alleviate the difficulty of learning such combinatorial features, which comprises a SA module and an incremental feature selection module. The former one samples the vast feature space in a probabilistic way, while the latter one progressively filters out ineffective features in an enumerated set and finally select an optimal one in a deterministic way. The two complementary modules successfully collaborate to efficiently search the formidably large JRoG feature space and select discriminative JRoG features for the training of domain-partitioning weak classifiers proposed by Schapire and Singer [3]. Moreover, we improve Wu and Nevatia s work [9] on Bayesian combination of part detection results by means of a more effective partition of body parts in crowded environments and a dynamic search method for optimal combination results. 3. JRoG Features JRoG features are a type of combinatorial feature whose outputs are discrete index numbers. Their elementary features come from the granular space of grey-level image instances. The granular space [7], denoted by G, is a computationally efficient multi-resolution space extended from the image instance space X, whose bases are granules that observe an image instance in different locations and scales. Denoting the intensity of pixel (u, v) in an instance x X by x(u, v), a granule is defined by a triplet 42

3 g(2,,0) g(3,0,) g(6,3,2) g(7,7,3) [3] point out that in the Real AdaBoost algorithm, given a domain partition function such as JRoG feature J(x : g), the corresponding optimal weak classifier h(x) receives a normalization factor of sample weights given by Z(S, h(x)) 2 j W + j W j. (4) Figure 3. Four granules of scale from 0 to 3 on a 6 6 image. g u,v,s (x) = 2 s 2 s 2 s i=0 2 s j=0 x(u + i, v + j), () which is the intensity average of pixels in a square whose left top point is (u, v) and width is 2 s. Fig.3 illustrates four granules of different scales on a 6 6 image instance. In this paper, we choose granules of scale from 0 to 3 to constitute the granular space. Notice that the granular space of a w h image instance has (w (2 s )) (h (2 s )) granules of scale s. By ranking two arbitrary granules, an efficient bipartition of the instance space X can be obtained as r ( g i (x), g j (x) ) {, if gi (x) > g = j (x), (2) 0, otherwise and k such bi-partitions jointly define a k-bit JRoG feature that provides a 2 k -partition of X : J(x : g) = [b 0 b b k ] {0,,, 2 k }, g = (g 0, g,, g 2k ), b i = r ( g 2i (x), g 2i+ (x) ), (3) in which g is a 2k-dimensional subspace of G that defines the JRoG feature by its 2k basis granules, and the i-th bipartition b i is given by the ranking result between two consecutive granules in g. In this way, the instance space X is divided into 2 k disjoint blocks {X 0,, X 2k }, and an instance x falls into block X i if and only if J(x) = i. Essentially, the JRoG features are special decision trees where all nodes at the same level share one bi-partition. They are derived from APCF features [3] by setting the comparison threshold as zero and excluding gradient granules. Such simplification helps JRoG features achieve even higher computational efficiency. In the following sections, the JRoG feature J(x : g) is sometimes termed g for abbreviation since it is defined by this granular subspace. 4. Collaborative Learning of JRoG Features for Real AdaBoost Let S = { (x i, y i, w i ) : x i X, y i = ±, w i R } be the training sample set, where x i is an instance, y i is its labe and w i is the sample weight. Schapire and Singer where Wj b is the weight sum of all samples labeled as b and falling into the j-th block. In other words, the learning of domain-partitioning weak classifiers for Real AdaBoost is now reduced to selecting optimal JRoG features that minimize this normalization factor (hereafter, we abbreviate Z(S, h(x)) to Z(S, g) as h(x) is determined by the JRoG feature g if given S). However, this is actually a nontrivial problem due to the tremendous size of the JRoG feature set. Take an instance image of size for example. It has 4725 granules from scale 0 to 3. As a typical 6-bit JRoG feature used in this paper consists of 2 granules (Equ.3), there are totally ! 2 different k-bit JRoG 6 features. Therefore, conventional exhaustive search method used for Haar-like features or HOG features is inapplicable to the selection of discriminative JRoG features. To alleviate this problem, we propose a novel collaborative learning method comprising an incremental feature selection module and a SA module. Before describing this method, we give two important distance definitions for JRoG features as follows. Let g p = (g p0,, g pm ) and g q = (g q0,, g qn ) be two JRoG features. The first distance is the number of different granules in the same bit between them: { n D (g p, g q ) = i=0 g pi g qi, if m = n (5) +, otherwise, where outputs if the inner condition is true, otherwise outputs 0. This distance is infinite if the two subspaces have different dimensions. The second distance is the largest granule-to-granule Euclidean distance as { D 2 (g p, g q ) = max d(gpi, g qi ) }, (6) i in which d(g pi, g qi ) = (u pi u qi )2 + (v pi v qi )2 + (e pi e qi ) 2, (7) where e = 2 s is the half side length of granule g u,v,s (Equ.), u = u+e and v = v+e are its center coordinates. Based on these two distances, neighbors of a JRoG feature g can be formulated as B θ,θ 2 (g) = { g : D (g, g) θ D 2 (g, g) θ 2 }, (8) where θ and θ 2 are thresholds for the two distances. These neighbors are used as candidates in the search of discriminative features since a good feature is likely to have a better neighbor nearby. 43

4 Given: Training sample set S = { (x i, y i, w i ) } N and JRoG feature set G = {g i } M ; Init: Initial feature set G 0 = G, sample subsets {S 0,, S K }, where S K = S and S i = 2 S i+ ; For r = 0,,, K β = G r Z(S r, g); g G r G r+ = { g : g G r and Z(S r, g) < β } ; Output: JRoG feature g = arg min Z(S K, g). g G K Figure 4. Incremental feature selection method. 4.. Incremental Feature Selection Module This module adopts the incremental feature selection method proposed by Huang et al. [7] to fast select an optimal JRoG feature from a large feature set. As formalized in Fig.4, it starts with a small sample subset S 0 and the entire JRoG feature set G 0, and computes normalization factors (Equ.4) of every candidate feature in G 0 with respect to training samples in S 0. The mean Z value of all candidate features is chosen as the threshold β to filter out inferior candidates in current feature set so that the remaining ones constitute a shrunk feature subset. This process repeats until all training samples are employed for evaluation. Since features usually retain similar discriminability in every training sample subset, discriminative ones are very likely to be preserved in the reduced feature subset. If the numbers of training samples and candidate features are N and M respectively, the time spent in computing Z(S, g) is O(N), the conventional exhaustive search method takes O(M N) to select an optimal JRoG feature from G, while the time required by this incremental feature selection method is remarkably reduced to O(M ln N) SA Module The Simulated Annealing [8] is a generic probabilistic meta-heuristic for the global optimization problem of applied mathematics. Fig.5 describes our way of applying this method for searching discriminative JRoG features. As the candidate set of the new state, G is composed of neighbors of current state g. The best feature g is maintained throughout the whole search process and finally output as the optimal state. In practice, we set θ = and θ 2 = 8 so that G is generated by replacing one granule of g with a nearby granule of distance no more than 8. Besides, for simplicity and efficiency, we heuristically set N = 000 dim(g 0 ) and γ = 0.0 N so that each granule can be changed by 000 times on average and the SA process ends at temperature 0.0 T 0. Choice of starting temperature T 0 is critical in the SA process; if T 0 is too high, the search will become a random walk and hardly converge; Given: Training sample set S, initial JRoG feature g 0, starting temperature T 0, temperature decreasing step γ, maximum iteration number R, two thresholds for neighbors, θ and θ 2 ; Init: g = g = g 0, E = E = Z(S, g 0 ), M = 0; For r = 0,,, R T = T 0 γ r Randomly select g B θ,θ 2 (g); E = Z(S, g ), and P acc = exp( E E T ); Generate a uniform random number λ [0, ]; If λ P acc, Then g = g, E = E, M++; If E > E, Then g = g, E = E; Output: The best JRoG feature g and the Jump/Keep ratio η = M N M. Figure 5. Simulated Annealing for Searching JRoG Features. if T 0 is too low, the search is likely to be trapped in a local minimum at the very beginning. Moreover, in AdaBoost algorithm, a series of weak classifiers are sequentially learned with respect to varying training sample weights, so it is difficult to find a universal starting temperature that is appropriate for every round. Understandably, the Jump/Keep ratio η, defined by the number of times the feature is changed or not, reflects the fitness of temperature cooling schedule. Based on this observation, an adaptive temperature tuning method is introduced in the coming section which aims for relatively stable Jump/Keep ratio of the SA in every round of classifier learning Collaborative Learning of JRoG features Formalized in Fig.6, the collaborative learning method constructs a k-bit JRoG feature in k iterations. In each iteration, it grows the current feature g by adding a pair of granules into it, takes the grown feature as the initial seed of SA, and seeks neighbors of the updated feature for refinement. The incremental feature selection method is employed both in the growth of the feature and the search among its neighbors. In our experiments, we define the granule pair candidate set by C = { g : dim(g) = 2, d(g 0, g ) 4 }, so that granules of any candidate pair are close enough to each other. Similarly, the neighbors to be searched are restricted within 2 and 4 in terms of the first and second distances respectively. Generally speaking, increasing/decreasing the starting temperature T 0 will raise/lower the Jump/Keep ratio η in the SA process. This causal relationship enables a negative feedback from η back to T 0. Denote the preferred target Jump/Keep ratio by η. Once a JRoG feature is learned and the corresponding η is computed, a compensation function 44

5 Given: Training sample set S, granule pair candidate set C = { g : dim(g) = 2 }, starting temperature T 0, and target Jump/Keep ratio η; Init: Set initial feature empty g 0 =, dim(g 0 ) = 0; For r =,, k (Grow): Call the incremental feature selection module to select an optimal JRoG feature g r from C r = { g : g = (g r, g), g C } ; (Simulated Annealing): Call the SA module with T 0 to update g r ; (Search Neighbors): Call the incremental feature selection module to select an optimal neighbor from B 2,4 (g r ) to refine g r ; Temperature Tuning: T0 = f T (T 0, η, η), η is the Jump/Keep ratio in the SA process. Output: the learned JRoG feature g k and the suggested starting temperature T0. Figure 6. Collaborative Learning of a k-bit JRoG Feature. k, T0, η Yes g No Accuracy is enough? Grow g T (, η, η ) f T T 0 0 Simulated Annealing No Add g into Strong Classifier Refine g dim( g) = 2 k? Figure 7. Flow Chart of Collaborative Learning for AdaBoost algorithm. The two yellow blocks are incremental feature selection modules, the green one is the SA module, and the red one is the adaptive temperature tuning. This procedure repeats until the boosted strong classifier achieves required accuracy. into three parts (head-shoulder, torso and leg, shown as the upper half of Fig.8), trains three independent detectors for them respectively, and fuses the part detection results with full body detections by means of a Bayesian combination method. Let Z be the detection responses and S be the state of multiple humans, the joint likelihood is formulated as Yes is defined as p(z S) = α p(z α S α ), α {F, H, T, L}, (0) T0 = f T (T 0, η, η) = η T 0, (9) η which adjusts the starting temperature for the coming round to make consequent Jump/Keep ratio approach the target. This provides a more controllable parameter η. To sum up, on one hand, SA is capable of escaping from local minima but hardly converges within limited time; on the other hand, the incremental feature selection significantly reduces the time required to find a discriminative feature in a large enumerated feature set but still insufficient in the enormous JRoG feature space. The collaborative learning method integrates the two complementary methods to alleviate the difficulty of searching the combinatorial feature space. An adaptive tuning method is designed to adjust the starting temperature for the successive round of feature learning to stabilize the SA process. Illustrated as Fig.7, the collaborative learning method serves the Real AdaBoost algorithm [3] by providing a series of discriminative JRoG features. If the SA module is removed, collaborative learning becomes similar to the heuristic learning algorithm used by Duan et al. [3]; we term this simplified version as solo learning (SL). The experiment in section 6. shows that the collaborative learning consistently improves upon solo learning, which justifies the usage of SA module. 5. Dynamic Search for Bayesian Combination of Part Detection Results To address partial occlusion problems in crowded scenes, Wu and Nevatia [9] partition the full human body where α is the index for full-body(f), head-shoulder(h), torso(t) and leg(l), Z α and S α are detection responses and states for part α. Based on this MAP formulation, Wu and Nevatia adopt a naive greedy method to seek the optimal state S that best explains the observation Z. They initialize the state S by all available hypotheses from full-body and head-shoulder detection responses, and test these hypotheses one by one in descending Y-axis order. In each test, a human hypothesis is removed if the joint likelihood increases by taking it out of S. A key problem of this method is that each hypothesis is tested only once and the testing order is predetermined so that each error in S has only one opportunity to be rectified. We follow Wu and Nevatia s Bayesian approach of combining part detection responses but choose a different decomposition of the human body and utilize a dynamic search method to obtain the optimal state. The new decomposition, defined in the lower half of Fig.8, divides the human body into four parts: upper body, lower body, left body and right body. Compared to the three-part decomposition, dividing the full body into four parts is more suitable to the humans on the periphery of crowd whose left or right halves are often occluded. Opposite to Wu and Nevatia s method which starts from full hypotheses set, our dynamic search method initiatalizes the multiple-human state S to be empty and increase the joint likelihood by iteratively adding or removing hypotheses (Fig.9). In each round, the best hypothesis in a candidate set generated by all part detection responses is selected to be added into the current state S or an existing hypothesis is removed if that achieves higher 45

6 Wu and Nevatia s [9] Part Definition Our Part Definition Figure 8. Part Definition of Wu and Nevatia s approach and Ours. Given: Part detection response set Z Init: S, and generate candidate set C from Z; Loop s a = arg max p(z S {s}); s C L a = p(z S {s a }) p(z S); s r = arg max p(z S {s}); s S L r = p(z S {s r }) p(z S); IF L a L r and L a > 0, S {s a } S ELSE IF L r > 0,S {s r } S; ELSE quit Loop, Output: the optimal multiple-human state S for Z. Figure 9. Dynamic Search for the Optimal Multiple-Human State. likelihood. Such a dynamic search process evaluates each hypothesis multiple times which improves the robustness against part detection errors. 6. Experiments In our experiments, the size of pedestrian training samples is set to be All granules are computed based on grey images. Testing images are scanned at 6 scales to detect pedestrians of size from to The rest of this section is made up of five parts: the first part analyzes the convergence of boosting a strong classifier by collaborative learning with different parameter settings; the second compares our approach with previous work on the popular INRIA data set [2]; the third evaluates our method on the challenging ETHZ data set [4]; the fourth presents the improvement in Bayesian part combination [9] by the dynamic search method; and the last part discusses the computational complexity of our approach. 6.. Collaborative Learning for Boosting a Strong Classifier In this section, we design a cross-validation experiment to choose a proper target Jump/Keep ratio for collaborative learning. The sample set includes 20,000 positive samples and 20,000 negative ones collected from internet, of which 70% are selected for training and the rest for testing. The initial starting temperature and the bit number of JRoG features are fixed as 0.03 and 6. The target Jump/Keep ratio is set to be.0, 0.5 and 0.25 respectively, denoted by CL.0, CL 0.5 and CL To validate the effectiveness of adaptive temperature tuning, we remove the corresponding module (the red one in Fig.7) from the collaborative learning and keep using the same start temperature. This setting is denoted by CL. The SA module (the green one in Fig.7) can also be removed and this degenerate version is termed solo learning (SL). With each setting, a strong classifier is trained by the collaborative learning served Real AdaBoost (Fig.7). Two scores are calculated to evaluate the performance of classifiers: Equal Error Rate (EER) and False Positive Rate (FPR) when false negative rate is 0.0 (the latter score is even more important for the cascade detector due to its bias in favor of classifiers with low false negative rate). The experiment is repeated 0 times; Table. lists the two scores of each setting and highlights the best one after 0, 20, 50, and 00 weak classifiers are learned. Table. Convergence of boosted classifiers with different settings Weak Classifier No ERR CL CL CL CL SL FPR CL CL CL CL SL In this comparison, CL.0 performs best when 0 weak classifiers are learned; CL 0.5 takes this position afterward. On one hand, the collaborative learning is not very sensitive to the target Jump/Keep ratio since CL.0, CL 0.5 and CL 0.25 have relatively close ERR and FPR scores; all of them outperform SL, which justifies the usage of SA in the collaborative learning. On the other hand, the ranking of CL varies as the number of weak classifiers increases, indicating that removing the negative feedback constructed by the adaptive temperature tuning may decrease the stability of SA module. Consequently, we choose 0.5 as the target Jump/Keep ratio for the next set of experiments INRIA Data Set The INRIA data set [2] has become a standard to compare results on; it contains 2,478 positive samples and,28 negative images for training, and,28 positive samples and 453 negative images for testing. We generate positive training samples by slightly rotating and scaling the origi- 46

7 0.2 INRIA ETH SEQ ETH SEQ 2 ETH SEQ 3 ilids Miss Rate Dalal et al. [2] Tuzel et al. [5] Wu and Nevatia [22] 0.0 Schwartz et al. [4] Duan et al. [3] Ours False Positive Per Window Detection Rate Ess et al. [4] Wu and Nevatia [23] Schwartz et al. [4] Ours False Positive Per Image (FPPI) Detection Rate Ess et al. [4] Wu and Nevatia [23] Schwartz et al. [4] Ours False Positive Per Image (FPPI) Detection Rate Ess et al. [4] Wu and Nevatia [23] Schwartz et al. [4] Ours False Positive Per Image (FPPI) Figure 0. ROC curves of different methods on multiple data sets. Detection Rate Wu's Full Body [20] Wu's Static Combination [20] Our Full Body Our Static Combination Our Dynamic Combination False Positive Per Image (FPPI) nal ones, and train a 6-layer cascade detector by collecting false alarms in negative training images. We empirically set the bit number of JRoG features as 6 for the first three layers, 5 for the next 6 layers and 4 for the rest. The learned detector contains 2533 weak classifiers and 2772 granules. Fig.0 shows ROC curves of state-of-art methods and ours on the INRIA testing set, in which our method is among the best especially in the region between False Positive Per Window FPPW 0 3 and 0 5. Notably, owing to the collaborative learning method, our approach is still comparable to Duan et al. s although their APCF features additionally utilize the gradient information. Some detection results on un-cropped INRIA testing images are shown in Fig ETHZ Data Set ETHZ data set [4] includes four videos (one for training and three for testing) captured on a moving platform in very cluttered environments. To cope with this challenging data set, we collected about 23,000 negative images and labeled more than 20,000 pedestrians from internet, and trained another 6-layer cascade detector based on these training data, which has the same number of weak classifiers and granules as the one for INRIA test. Only images from the left camera are used for testing. The first sequence contains 999 frames with 5,93 humans; the second one contains 450 frames with 2,359 humans; the third one contains 354 frames with,828 humans. These sequences are processed frame by frame, without usage of any temporal information. To compare with Ess et al s [4] method which utilizes scene knowledge, we use the simple ground plane estimation method used by Wu et al. s [22] to facilitate detection. Schwartz et al s method [4] is also included in this experiment. Following the same evaluation metric used in [4], we obtain ROC curves of our method shown as Fig.0. Our method outperforms other s in all three videos: Compared to the second best method, it increases the detection rate by 9%, 6% and 6% respectively when False Positive Per Frame (FPPF) is 2. Fig. gives some detection results of our method on this challenging data set ilids Data Set The ilids data set [] features a busy subway station where people are frequently occluded by each other. We selected 257 frames from this data set and annotated 33 pedestrians in them. Four parts detectors, shown as Fig.8, are learned from this training set. The sample size for upper body and lower body is and for left body, for right body it is We compare our method with Wu and Nevatia s Boosted Edgelet approach [9]. Here, the combination method proposed in [9] which makes sequential tests is denoted by static combination ; the dynamic search method of this paper is denoted by dynamic combination. For fair comparison, we implemented the static combination method and applied it to our part detection results. Shown in Fig.0, static combination is superior to the full body detection results. Fig. shows some differences in combination results of both methods, dynamic combination successfully removes a false alarm (the red arrow) and recovers a missed detection (the yellow arrow). Besides, the collaborative learning of JRoG features significantly improves the detection accuracy compared to [9], reducing missed detections by a half at the same FPPF in full body detection and combined results Computational Complexity Given the granular space, computing a k-bit JRoG feature only requires 2k times of memory access and k times of substraction. On a Xeon 3 GHz computer, the detector learned for ETHZ testing can process about one million scanning windows per second on a single processor. With the simple ground plane estimation, this detector takes only 70 ms to scan a ETHZ test image at 6 scales from.0 to 0.25; this includes the time spent in computing the granular space. This performance may be adequate for many real-time processing systems and can be scaled up by use of multiple processors or use of GPUs. The training of the 6-layer cascade costs about two days on the same computer. 7. Conclusion We described a novel collaborative learning method for JRoG features and a dynamic search method for Bayesian combination of part detection results. This approach achieves considerable improvements in both detection accuracy and computational efficiency on challenging real-life pedestrian detection problems. The collaborative learning is 47

8 INRIA ETHZ SEQ ETHZ SEQ 2 ETHZ SEQ 3 ilids Figure. Detection results of our method on INRIA, ETHZ and ilids data sets. The first row of the ilids block is the static combination results, and the second and the third rows are the dynamic combination results. The red arrow points at a false alarm given by the static combination, and the yellow arrow points at a detection given by the dynamic combination which is missed by the static combination. a general learning method, which can adapt to other combinatorial features and be used in the detection of other objects such as faces and cars. References [] d.html. [2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, [3] G. Duan, C. Huang, H. Ai, and S. Lao. Boosting associated pairing comparison features for pedestrian detection. Ninth IEEE International Workshop on Visual Surveillance, [4] A. Ess, B. Leibe, and L. V. Gool. Depth and appearance for mobile scene analysis. ICCV, [5] P. Felzenszwalb. Learning models for object recognition. CVPR, 200. [6] D. Gavrila. Pedestrian detection from a moving vehicle. ECCV, [7] C. Huang, H. Ai, Y. Li, and S. Lao. Learning sparse features in granular space for multi-view face detection. Proc. Seventh Intl Conf. Automatic Face and Gesture Recognition, [8] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 985. [9] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR, [0] C. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a probabilistic assembly of robust part detectors. ECCV, [] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. PAMI, 200. [2] C. Papageorgiou, T. Evgeniou, and T. Poggio. A trainable pedestrian detection system. In Proceeding of Intelligent Vehicles, 998. [3] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 999. [4] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis. Human detection using partial least squares analysis. ICCV, [5] O. Tuzel, F. Porikli, and P. Meer. Human detection via classification on riemannian manifolds. CVPR, [6] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR, 200. [7] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV, [8] X. Wang, T. X. Han, and S. Yan. An hog-lbp human detector with partial occlusion handling. ICCV, [9] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. ICCV, [20] B. Wu and R. Nevatia. Cluster boosted tree classifier for multi-view, multi-pose object detection. ICCV, [2] B. Wu and R. Nevatia. Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. CVPR, [22] B. Wu, R. Nevatia, and Y. Li. Segmentation of multiple partially occluded objects by grouping merging assigning part detection responses. CVPR,

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors Bo Wu Ram Nevatia University of Southern California Institute for Robotics and Intelligent