Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features

Similar documents
Supplementary material: Efficient pedestrian detection by directly optimizing the partial area under the ROC curve

1002 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 17, NO. 4, APRIL 2016

Fast detection of multiple objects in traffic scenes with a common detection framework

Research on Robust Local Feature Extraction Method for Human Detection

Towards Practical Evaluation of Pedestrian Detectors

An Empirical Comparison of Spectral Learning Methods for Classification

Discriminative Clustering for Image Co-segmentation

DPM Score Regressor for Detecting Occluded Humans from Depth Images

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

Selection of Scale-Invariant Parts for Object Class Recognition

Mixtures of Gaussians and Advanced Feature Encoding

TEXTURE CLASSIFICATION METHODS: A REVIEW

An efficient face recognition algorithm based on multi-kernel regularization learning

Discriminative Clustering for Image Co-segmentation

Object detection using non-redundant local Binary Patterns

Histograms of Oriented Gradients

Relational HOG Feature with Wild-Card for Object Detection

Subject-Oriented Image Classification based on Face Detection and Recognition

Subpixel Corner Detection Using Spatial Moment 1)

SVMs for Structured Output. Andrea Vedaldi

Posture detection by kernel PCA-based manifold learning

A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection

Bilevel Sparse Coding

Object of interest discovery in video sequences

ImageCLEF 2011

Histograms of Oriented Gradients for Human Detection p. 1/1

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Seminar Heidelberg University

Aggregating Descriptors with Local Gaussian Metrics

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Variable Selection 6.783, Biomedical Decision Support

FAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO

Pedestrian Detection with Deep Convolutional Neural Network

Parameter Sensitive Detectors

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Face Tracking in Video

Fast or furious? - User analysis of SF Express Inc

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

Image Processing. Image Features

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

High-Level Fusion of Depth and Intensity for Pedestrian Classification

Tracking in image sequences

Discriminative classifiers for image recognition

Development in Object Detection. Junyuan Lin May 4th

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Three Embedded Methods

A 2-D Histogram Representation of Images for Pooling

Face and Nose Detection in Digital Images using Local Binary Patterns

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Tensor Decomposition of Dense SIFT Descriptors in Object Recognition

Exploring Sensor Fusion Schemes for Pedestrian Detection in Urban Scenarios

Human detection using local shape and nonredundant

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Fast and Stable Human Detection Using Multiple Classifiers Based on Subtraction Stereo with HOG Features

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines

Object Detection Design challenges

Active learning for visual object recognition

CS4670: Computer Vision

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Facial Expression Classification with Random Filters Feature Extraction

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Scene Text Detection Using Machine Learning Classifiers

Human Motion Detection and Tracking for Video Surveillance

Software Documentation of the Potential Support Vector Machine

the relatedness of local regions. However, the process of quantizing a features into binary form creates a problem in that a great deal of the informa

Short Survey on Static Hand Gesture Recognition

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Sketchable Histograms of Oriented Gradients for Object Detection

Using temporal seeding to constrain the disparity search range in stereo matching

Study of Viola-Jones Real Time Face Detector

Countermeasure for the Protection of Face Recognition Systems Against Mask Attacks

AN EXAMINING FACE RECOGNITION BY LOCAL DIRECTIONAL NUMBER PATTERN (Image Processing)

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

Big and Tall: Large Margin Learning with High Order Losses

Edge Histogram Descriptor, Geometric Moment and Sobel Edge Detector Combined Features Based Object Recognition and Retrieval System

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Fast Edge Detection Using Structured Forests

Multiview Pedestrian Detection Based on Online Support Vector Machine Using Convex Hull

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Selection of Scale-Invariant Parts for Object Class Recognition

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Category-level localization

Local invariant features

Component-based Face Recognition with 3D Morphable Models

C. Premsai 1, Prof. A. Kavya 2 School of Computer Science, School of Computer Science Engineering, Engineering VIT Chennai, VIT Chennai

Co-occurrence Histograms of Oriented Gradients for Pedestrian Detection

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Random Projection Features and Generalized Additive Models

Integral Channel Features Addendum

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Context-Based Object-Class Recognition and Retrieval by Generalized Correlograms.

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Deep Face Recognition. Nathan Sun

Patch-Based Image Classification Using Image Epitomes

Out-of-Plane Rotated Object Detection using Patch Feature based Classifier

Transcription:

Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features Sakrapee Paisitkriangkrai, Chunhua Shen, Anton van den Hengel The University of Adelaide, Australia June 2014 In this supplementary document we provide a complete detail on the structural SVM problem presented in the main paper. We also provide in details the detection performance of different feature parameters and the analysis of selected features. 1 Optimizing the partial area under ROC curve Notation Vectors are denoted by lower-case bold letters, e.g., x, matrices are denoted by upper-case bold letters, e.g., X and sets are denoted by calligraphic upper-case letters, e.g., X. All vectors are assumed to be column vectors. The (i, j) entry of X is x ij. Let {x + i }m i=1 be a set of pedestrian training examples and {x j }n j=1 be a set of non-pedestrian training examples. The tuple of all training samples is written as S = (S +, S ) where S + = (x + 1,, x+ m) X m and S = (x 1,, x n ) X n. In this paper, we are interested in the partial AUC (area under the ROC curve) within a specific false positive range [α, β]. Given n negative training samples, we let j α = nα and j β = nβ. Let Z β = ( S ) jβ denote the set of all subsets of negative training instances of size j β. We define ζ = {x k j } j β j=1 Z β as a given subset of negative instances, where k = [k 1,..., k jβ ] is a vector indicating which elements of S are included. Our goal is to learn a scoring function f(x) = w x such that the final detector achieves the optimal area under the ROC curve in the false positive rates (FPR) range [α, β]. Approach The pauc risk for a scoring function f( ) between two pre-specified FPR [α, β] can be defined [5] as : ˆR ζ (f) = m jβ i=1 j=j α+1 1(f(x+ i ) < f(x (j) f ζ )). (1) Here x + i denotes the i-th positive training instance and x (j) f ζ denotes the j-th negative training instance sorted by f in the set ζ Z β. Both x + i and x (j) f ζ represent the output vector of weak classifiers learned from AdaBoost. Clearly (1) is minimal when all positive samples, {x + i }m i=1, are ranked above {x (j) f ζ } j β j=j α+1, which represent negative samples in our prescribed false positive range [α, β] (in this case, the log-average miss rate would be zero). The structural SVM framework can be adopted to optimize the pauc risk by considering a classification problem of all m j β pairs of positive and negative samples. We define a new label matrix Y Y m,jβ = {0, 1} m j β whose value for the pair (i, j) is defined as: { 0 if f(x + i y ij = ) > f(x j ) (2) 1 otherwise. The true pair-wise label is defined as Y where yij = 0 for all pairs (i, j). The pauc loss of Y with respect to Y can then be written as: (Y, Y ) = 1 m jβ i=1 j=j mn(β α) y α+1 i,(j) Y, (3) 1

where (j) Y denotes the index of the negative instance in S ranked in the j-th position by any fixed ordering consistent with the matrix Y. We define a joint feature map, φ ζ : (X m X n ) Y m,jβ R t, which takes a set of training instances (m positive samples and n negative samples) and an ordering matrix of dimension m j β and produce a vector output in R t as: φ ζ (S, Y) = m jβ i=1 j=1 (1 y ij)(x i x k j ). (4) This feature map ensures that the variable w that optimizes w φ ζ (S, Y) will also produce the optimal pauc score for w x. We can summarize the above problem as the following convex optimization problem [5]: min w,ξ 1 2 w 2 2 + C ξ s.t. w (φ ζ (S, Y ) φ ζ (S, Y)) (Y, Y ) ξ, (5) ζ Z β, Y Y m,jβ and ξ 0. Here Y denotes the correct relative ordering. Y denotes any arbitrary orderings and C controls the amount of regularization. A cutting plane method can be used to solve (5). The cutting plane algorithm begins with an empty initial constraint set and adds the most violated constraint set at each iteration. For our problems, finding the most violated constraint involves a combinatorial search over an exponential number of orderings of positive and negative training instances, Y m,jβ. The cutting plane algorithm continues until no constraint is violated by more than ɛ. Since the cutting plane method converges in a constant number of iterations [4] and (5) is solved once during training, most of the computation time in our experiments is spent in bootstrapping hard negative samples and weak learner training. 2 Experiments on spatially pooled covariance In this section, we conduct additional experiments on sp-cov with different subset of low-level features, multi-scale patches and spatial pooling parameters. 2.1 Feature representation In the main paper we extract a covariance matrix from nine low-level visual features: [x, y, I x, I y, I xx, I yy, M, O 1, O 2 ] where x and y represent the pixel location, and I x and I xx are first and second intensity derivatives along the x-axis. The last three terms are the gradient magnitude (M = Ix 2 + Iy), 2 edge orientation as in [6] (O 1 = arctan( I x / I y )) and an additional edge orientation O 2 in which, { atan2(i y, I x ) if atan2(i y, I x ) > 0, O 2 = atan2(i y, I x ) + π otherwise. In this supplementary, we measure the performance improvement of our detectors with sp-cov features. The baseline detector is trained with LUV and 7 low-level visual features as described above (we exclude pixel location x and y). We then combine baseline features with covariance features extracted from a different combination of low-level features. We categorize low-level features, which we use to extract covariance features, into four categories: pixel locations [x, y] (LOC), magnitude and orientations, [M, O 1, O 2 ] (ORI), first-order intensity derivative of two-dimensional images, [ I x, I y ] (G1) and second-order intensity derivative of two-dimensional images, [ I xx, I yy ] (G2). We train these detectors 1 using the INRIA training set and evaluate these detectors on the INRIA, ETH and TUD-Brussels test sets. Logaverage miss rates on three benchmark data sets are reported in table 1. Comparing BASELINE + LOC + ORI, BASELINE + LOC + G1 and BASELINE + LOC + G2, we observe that covariance features extracted from LOC + ORI performs best and covariance features extracted from LOC + G2 performs worst on all 1 We set the shrinkage parameter to be 0.1 and the depth of decision trees to be 3. 2

three benchmark data sets. This indicates that first-order derivative features are more discriminative than second-order derivative features for the task of human detection. One interesting observation is that applying a non-linear transformation to the first-order derivative features can further improve the final detection performance, i.e., ORI is derived from G1 but it outperforms G1, e.g., M = Ix 2 + Iy 2 and O 1 = arctan( I x / I y ). The best detection performance is observed when we extract sp-cov from all low-level visual features. Table 1: An improvement in log-average miss rate by combining BASELINE features with sp-cov features extracted from different low-level visual features: LOC (pixel locations), ORI (magnitude and orientations), G1 (first-order intensity derivative) and G2 (second-order intensity derivative). BASELINE features consists of LUV, I x, I y, I xx, I yy, M, O 1 and O 2 (no variance and correlation information) Features INRIA ETH TUD-Brussels Average BASELINE 24.5% 49.5% 61.4% 45.1% BASELINE + LOC + ORI 13.7% 43.6% 53.1% 36.8% BASELINE + LOC + G1 14.9% 45.4% 55.7% 38.7% BASELINE + LOC + G2 18.9% 47.7% 58.5% 41.7% BASELINE + LOC + ORI + G1 13.8% 42.3% 51.0% 35.7% BASELINE + LOC + ORI + G2 12.9% 42.8% 50.2% 35.3% BASELINE + LOC + G1 + G2 16.8% 47.5% 53.2% 39.2% ALL 12.8% 42.0% 47.8% 34.2% 2.2 Multi-scale patches In the main paper we independently extract sp-cov features from multi-scale patches with size 8 8, 16 16 and 32 32 pixels. Each scale generates a different set of visual descriptors. In this experiment, we compare the performance of pedestrian detectors trained using sp-cov features extracted from single-scale patches. We set the patch spacing stride (step-size) to be 1 pixel. The pooling region is set to be 4 4-pixels and the pooling spacing stride is set to 4 pixels. Log-average miss rates are reported in table 2. We observe that the patch size of 16 16 pixels performs best for single-scale patches. The best detection performance is observed when we extract sp-cov from multi-scale patches. Table 2: Log-average miss rate of sp-cov using patches at multiple scales Patch size (pixels) INRIA ETH TUD-Brussels Average 8 8 15.1% 44.8% 52.2% 37.4% 16 16 13.7% 43.3% 47.8% 34.9% 32 32 20.7% 45.5% 52.2% 39.5% 8 8 and 16 16 13.8% 42.5% 49.1% 35.1% 8 8 and 32 32 13.7% 43.6% 49.3% 35.5% 16 16 and 32 32 15.0% 43.9% 47.5% 35.5% ALL 12.8% 42.0% 47.8% 34.2% 2.3 Spatial pooling parameters Spatial pooling has been proven to be invariant to various image transformations and demonstrate better robustness to noise. Several empirical results have indicated that a pooling operation can greatly improve the recognition performance [1, 2, 3]. There exist two common pooling strategies in the literature: average pooling and max-pooling. Table 3 compares the detection results of both pooling operations. Similar to results reported in [3, 1], we observe that max-pooling slightly outperforms average pooling on average. In the next experiment, we compare pedestrian detection results by varying the pooling region size from 4 4-pixels to 10 10-pixels and report the log-average miss rate in table 4. From the table, the pooling size of 4 4-pixels achieves the best detection result on average. 3

Table 3: Log-average miss rate of sp-cov with average pooling and max-pooling Patch size (pixels) INRIA ETH TUD-Brussels Average average pooling 16.3% 41.7% 46.8% 34.9% max-pooling 12.8% 42.0% 47.8% 34.2% Table 4: Log-average miss rate of sp-cov using different pooling sizes Pooling size (pixels) INRIA ETH TUD-Brussels Average 4 4 12.8% 42.0% 47.8% 34.2% 6 6 11.9% 43.1% 50.6% 35.2% 8 8 13.5% 43.9% 49.9% 35.8% 10 10 12.1% 45.7% 51.6% 36.5% 2.4 Selected features and their spatial distribution In this section, we illustrate the proportion of feature types selected by AdaBoost. We count the number of times each feature type is selected during the weak learner training and show its proportion in Fig. 1. From the figure, both sp-lbp and sp-cov features are often selected by AdaBoost. Fig. 2 shows the spatial distribution of regions selected by different feature types. White pixels indicate that a large number of features are selected in that region. From the figure, most selected regions typically contain human contours (especially the head and shoulders). Colour features are selected around the human face (skin colour) while edge features are mainly selected around human contours (head, shoulders and feet). sp-lbp features are selected near human head and human hips while sp-cov features are selected around human chest and regions between two human legs. References [1] Y. Boureau, N. L. Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multi-way local pooling for image recognition. In Proc. IEEE Int. Conf. Comp. Vis., 2011. [2] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. of British Mach. Vis. Conf., 2011. [3] A. Coates and A. Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proc. Int. Conf. Mach. Learn., 2011. [4] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Mach. Learn., 77(1):27 59, 2009. [5] H. Narasimhan and S. Agarwal. SVM tight pauc : a new support vector method for optimizing partial auc based on a tight convex upper bound. In ACM Int. Conf. on Knowl. disc. and data mining, 2013. [6] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classification on Riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 30(10):1713 1727, 2008. 4

0.45 Proportion of features selected 0.4 0.35 0.3 Proportion 0.25 0.2 0.15 0.1 0.05 0 Colour Mag Edges sp LBP sp Cov Feature type Figure 1: Proportion of features selected by AdaBoost. Colour represents LUV channels (3 channels). Mag reresents gradient magnitude (1 channel). Edges represents vertical lines, diagonal line and horizontal lines (6 channels). sp-lbp represents LBP features and spatially pooled LBP features (116 channels). sp-cov represents spatially pooled covariance features (133 channels). (a) Colour (b) Magnitude (c) Edges (d) sp-lbp (e) sp-cov (f) ALL Figure 2: Spatial distribution of selected features based on their feature types. White pixels indicate that a large number of features are selected in that area. Often selected regions correspond to human contour and human body. 5