Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg Germany, 68131 Mannheim Email: njieutcheu@stud.uni-heidelberg.de Abstract In this paper a comprehensive review of human detection based on sliding windows technique is presented. The image data generated by a fixed camera mounted on a mobile agent is densely sampled, generating a large number of detection windows, which are separately presented to a linear Support Vector Machine (SVM) for classification. The classifier input vectors are Histogram of Oriented Gradients (HOG) features extracted from each detection window and used as human descriptor. In order to reduce the high computational cost due to the process of a large number of detection windows, magnitude and entropy filters are employed to discard unlikely windows or parts of the image with no relevant information to the human detection task. The experiments performed show that a high number of detection windows can be discarded by the proposed filters. But, a slight reduction on recall is observed, due to the fact that some windows cover only parts of a person, diminishing the accuracy. Keywords Sliding window; HOG features; SVM Classifier; Magnitde and Entropy Filter. I. INTRODUCTION Over the recent years, a large number of cameras available generate a huge amount of data, requiring the application processing the data to understand the scene. A useful question to ask is whether the generated image data contains one or more instances of a certain object: a car, a dog, a person, and so forth. Algorithms that answer this question in case of a human are called human detectors and are crucial for diverse application areas including pedestrian detection, video surveillance and monitoring, person tracking, action and activity recognition, person re-identification and human machine interaction. Therefore, human detection has become one of the most active and attractive research topics in the area of computer vision and pattern recognition. In this study, we focus on detecting humans and do not consider recognition of their activities. This paper is organized as follow: In section II, we formulate the human detection problem and present our solution approach. We briefly discuss related work in section III and give a detailed description of each stage of the detection process chain in section IV. Experiments results are evaluated in section V and in section VI, we summarize the main conclusion. II. PROBLEM FORMULATION AND SOLUTION APPROACH A. Problem formulation Detecting humans in images is one of the important challenges in computer vision. This is due to factors such as the large variation of appearance, changes in illumination, low quality of the acquired data and the different size of the human in the image. At that, human bodies are non-rigid and highly articulated. This implies that we have to deal with different poses and postures. Additionally, it is not possible to take advantage of specific textures and colour information due to the variability of worn cloths. However, besides all these challenges, low computational cost, high detection rate and reliable detection are needed to fulfill the requirement of most applications. B. Solution approach Here, an overview of our single detection window method is given. A flow chart of the human detection process is illustrated in Figure 1. The idea is to generate a set of images (S) with different resolutions based on the down sampling method. All generated images are then densely scanned with a sliding window, ensuring that all humans are covered. The scanning stage provides a set of windows (W) with a large number of detection windows. In order to reduce the number of detection windows, a filtering stage based on magnitude or entropy filter is applied. This stage reduces the search space and keep only potential Regions Of Interest (ROI) to be presented to the classifier. The output of the filtering stage is a set of selected windows (M), with M << W. Each selected windows is separately classified. Therefore, Histogram of Oriented Gradient (HOG) features vector (V) is extracted over each window and passed to the linear Support Vector Machine (SVM) as input. The linear SVM classifies each window as belongs to the human or non-human class, which are members of the set Y.
Figure 1: Overview of proposed methods. III. RELATED WORK Several human detection approaches have been proposed in the past years to address the referred problems. The key purpose of this paper is to provide a comprehensive review on studies conducted in the area of human detection based on sliding windows. The first need is a feature set that allows the human form to be discriminated cleanly. Therefore, the Histogram of Oriented Gradient (HOG) feature presented by Dalal and Triggs [1] is used as human descriptor. The development of methods reducing the computational cost is also desirable. One way of achieving that is to apply a filtering stage before the features extraction stage as proposed by Artur et al. [2]. This study is made based on the aforementioned papers. IV. METHODS This section gives details of our methods and highlights the need of each proposed method. A. Downsampling In order to remain scale-invariant, human detection algorithms utilize the rescaling of the input image frame. Furthermore, applying this technique allow us to deal with different human heights in the image data due to their distance to the camera. By decreasing the sampling rate with a fixed scale factor k, the number of samples that represent the original signal and the size of the input image frame are reduced. However, when a signal is down sampled in frequency domain, the high frequency portion of the signal will be aliased with the low frequency portion. In order to avoid this, the original image needs to be preprocessed with an alias (low pass) filter to remove the high frequency portion, so that aliasing will not occur. The process of image down sampling is illustrated in Figure 2, where I(m,n) denotes the input image data matrix, f(m,n) preprocessed image data matrix, d(m,n) down sampled image data matrix, m number of rows and n number of columns of the image data matrix. Keep in mind that the size of d(m,n) is less than I(m,n). Successively down sampling the input image and using the output as input for new images with low resolution, yields to a scale pyramid, which builds the set of images that will be segmented in the next step as explained below. B. Image segmentation The first step of any detector based on sliding window consists of generating a set of detection window based on the sliding windows approach, which is widely used in object recognition tasks. The sliding windows then searches for the corresponding humans in all scales of the image by sampling all images from the scale pyramid with a moving window of variable or fixed size according to the requirements of the application under consideration. For the human detection task, a moving window of fixed size (128x64 pixels) is used to densely scan the image as proposed by Dallal and Triggs [1], ensuring that all humans are covered. As these windows are generated in a wide range of scales and strides, we have a set of overlapping windows presenting redundancy, which highlights the need of the next method. C. Detection windows Filtering In order to reduce the amount of data processes by the human detection system, a filtering stage is applied. Here, candidate windows are presented to an optional filter, which selects a subset of generated detection windows that will be presented to a classifier and discard the remaining windows. The filtering stage does not perform any type of features extraction processing on the discarded windows and therefore provides a computational cost reduction. The following filtering approaches are used in our evaluation, the entropy and magnitude filters [2]. 1) Entropy filter The main idea behind this filter is to extract histogram of gradient orientation over each detection window. Windows with histogram presenting low entropy are rejected and those with high entropy are selected for further processing. The flow diagram of this filter is illustrated in Figure 3. The threshold value is experimentally set. Figure 3: Flow diagram of the entropy filter. Figure 2: Image down sampling flow.
2) Magnitude Filter This filter computes the average of the gradient magnitude within a detection window and uses it as a cue for selection as illustrated in Figure 4. The threshold value of this filter is different from the threshold value of the entropy filter and is also experimentally set. Hence, the gradient magnitude is a feature used to create Histogram of Oriented Gradient (HOG) features. Therefore, after this filtering stage, there is no extra computational cost. X R C R C R C Y R C R C R C The gradient is then transformed to polar coordinates, with the angle constrained between 0 and 180 degrees. The magnitude µ and positive orientation θ are obtained as stated in equation (3) and (4), where tan -1 2 is the four quadrant inverse tangent, which yields values between - and +. µ X Y θ tan X Y mod Figure 4: Flow diagram of the magnitude filter. D. Features Extraction However, before presenting the selected detection windows to a classifier, the first need is a feature set that allows the human form to be discriminated cleanly, even in cluttered backgrounds under different illumination. Nevertheless, a robust feature makes the classifier s job as easy as possible. Therefore, Histogram of Oriented Gradient (HOG) is used as human descriptor. Details of the HOG extraction step and effects of the parameters choices on detector performance are covered by Dalal and Triggs [1] in their work. Though, some details of the computation of each step of the feature processing chain may be found in the paper of Tomasi [3]. The following description of the features extraction chain is fleshed out using parameter values from the paper quoted above, keeping in mind that different imaging situations may need different parameters values. The input is assumed to be a window I from the set of selected detection windows (M). 1) Gradient computation Detector performance is sensitive by the way in which gradients are computed. Here, we approximate the two components I X and I Y of the gradient at pixel intensity I(R,C) of the sub-image represented by a selected detection window I by central differences as stated in equations (1) and (2), where R denotes the row and C column index of the corresponding pixel of the image data matrix. For colour images, we calculate seperate gradients for each of the three colour channel and take the one with the largest gradient magnitude as the pixel s gradient within its orientation. 2) Cell orientation histograms This step is the fundamental nonlinearity of the descriptor. The gradient image is divided into adjacent, non-overlapping cells of CxC pixels (C = 9). In each cell, histogram of gradient magnitude is computed based on the orientation of the gradient element centred on it. Hence, each pixel within a cell calculates a weighted vote. The votes are accumulated into B orientation bins (B = 9). The B orientation bins are evenly spaced over 0 to 180 (unsigned gradient). The vote is a function of the gradient magnitude representing soft presence/absence of an edge at the pixel. To reduce aliasing, votes are interpolated bilinearly between the neighbouring bin centres in both orientations and positions. Specifically, the bins are numbered 0 through B-1 and have width W = (180 /B). The bin with index i has boundaries [w i, w i+1 ] and center c i = w i+1/2. A pixel with magnitude µ and orientation θ contributes a vote v j to bin with index j as stated in equation (5) and (6). The resulting cell histogram is a vector of B non negative elements. j µ C j+1 - θ W j θ mod 3) Block normalization Due to local variations in illumination and foregroundbackground contrast, cells are grouped into overlapping blocks of C x C cells (C = 2) and each block is separately contrast normalized, in order to reduce the effect of changes in contrast between images of the same object and preserve some information carry by gradient magnitude in cells within the same block. To achieve that, the four cell histograms in each block is carried into a single block feature b and normalized by its Euclidean norm as stated in equation (7), where ɛ is a small positive constant that prevents division by zero. The final
features vector is then the vector of all components of the normalized cell responses from all of the blocks within a selected detection window. The HOG features vector h can also be normalized according to equation (8), in order to make the features vector independent to overall sub-image contrast. b = (b / ( (II b II 2 + ɛ)) (7) h = (h / ( (II h II 2 + ɛ)) (8) E. Features Classification After the features extraction stage, a linear Support Vector Machine (SVM) is used, in order to classifier features vector of each selected detection window. The classification problem is formulated as finding the mapping of each features vector as belonging to the human or non-human class. The linear SVM algorithm learns by example to assign labels to the features vector. Therefore, it was trained in offline in a supervised manner with data from the INRIA person dataset. Figure 5: Tradeoff between scale factor and number of windows generated for a 640x480 image. V. EXPERIMENTAL EVALUATION In this section, we evaluate the effect of the scaling factor on detector performance. In addition, we evaluate several aspects of the proposed filters and present their results. It goes without saying that the following plots were made based on experiments performed by Dallal and Triggs [1] and Artur et al. [2]. A. Scaling factor evaluation In their first experiment, Artur et al. [2] evaluate the impact of the scaling factor on the number of detection windows generated, as well as the miss rate obtained by the detector. As showed in Figure 5 the number of detection windows grows quickly, while diminishing the scale factor. For instance, decreasing the scale factor k from 1,15 to 1,01, increase the number of detection windows like about fifteen time. Indeed, decreasing the scaling factor also decreases the miss rate achieved by the detector at 10 0 False Positive Per Image (FPPI) as presented in Figure 6. This result indicates that denser sampling yields to a lower miss rate with a large number of generated detection windows. Therefore, enabling the usage of small scaling factor, implies the need of a filtering stage, in order to discard a large number of generated detection windows and reduce the computational cost of the features extraction stage. Figure 6: Miss-rate of detector at 10 0 False Positive Per Image (FPPI) with different scale factor. B. Windows filtering evaluation In their experiment Artur et al. [2] also evaluate the results achieved by the entropy and magnitude filters by using a scale factor k = 1,15 and assuming that an ideal detector was to be used after the filtering stage. Both filters was able to reject nearly 30% of the generated detection windows, while preserving approximately the same recall rate as obtained without detection windows rejection as showed in Figure 8. As presented in Figure 7, using the entropy filter and increasing the percentage of discarded windows from round 30% to approximately 50%, increases the miss rate achieved by the detector at 10 0 False Positive Per Image (FPPI). But, this remains nearly constant, when applying the magnitude
filter. After analyze of Figure 7 and Figure 8, it comes out that the best result is achieved with the magnitude filter, which is able to discard about 54% of detection windows, with a slight increase on miss rate at 10 0 False Positive Per Image (FPPI) and reduction on recall. filter was not specified and make an evaluation of the detector performance difficult. In Figure 10, the detector performance obtained by Dalal and Triggs [1] without a filtering stage is presented. Owing to different metrics used to plot the performance of the presented detectors, a comparison between both is hard. Nevertheless, the detector with filtering stage performed poorly, when evaluating the selected detection windows, due to the random nature of the filter. Some selected windows might be slightly dislocated from a person s body, which needs to be fixed before presenting them to the linear Support Vector Machine (SVM) classifier. Figure 7: Miss-rate at 10 0 False Positive Per Image (FPPI) by applying a filter on the detector. Figure 9: Performance obtained by detector using magnitude or entropy filter. Figure 8: Relationship between rejection percentage and recall achieved by filters. C. Detectors performance evaluation We will like to compare the performance of the detector with filtering stage and without filtering stage. Therefore, the detection miss rate at 10 0 False Positive Per Image (FPPI) results obtained by Artur et al. [2] using a magnitude or entropy filter were presented in Figure 9. The number of detection windows discarded by the magnitude or entropy Figure 10: Performance obtained by detector without filter.
VI. CONCLUSION AND FUTURE WORK This study proposed a review of mobile human detection systems based on sliding window approach, where the image data captures from a fixed camera mounted on a mobile agent is densely scanned in all scale and location, in order to cover all humans in the image. As experimentally showed, using a small scaling factor improves detector performance, but increase the number of detection windows, which also increases the computational cost of the Histogram of Oriented Gradient (HOG) features extraction process. Hence, a filtering stage based on magnitude or entropy filter is used to reduce the number of generated windows, yielding to a computational cost reduction with a slight reduction on recall. Though, after performing a quantitative analysis on the number of rejected windows achieved by both filters and the influence of this on recall, it turned out that the magnitude is better than the entropy filter. Compared to the application of a detector method without a filtering stage, experimental evaluation showed that the detector with a filtering stage performs poorly, since some positives selected windows were wrongly classified by the linear Support Vector Machine (SVM), due to a slight dislocation of the selected windows from person s body. As future work we intend to evaluate the detector performance, when adjusting the detection window to person location. We also intend to apply a filtering stage based on motion detection and employ the proposed filters after classification, in order to remove possible false positives generated by the linear SVM. References [1] N. D. a. B. Triggs, "Histograms of Oriented Gradients for Human Detection," INRIA Rhône-Alps, 655 avenue de l Europe, Montbonnot 38334, France, 2005. [2] V. H. C. d. M. W. R. S. Artur Jordao Lima Correia, A Study of Filtering Approaches for Sliding Window Pedestrian Detection, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, 2015. [3] C. Tomasi, Histograms of Oriented Gradients.