Histogram of Template for Pedestrian Detection

Similar documents
Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Edge Detection in Noisy Images Using the Support Vector Machines

Discriminative classifiers for object classification. Last time

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Cluster Analysis of Electrical Behavior

Support Vector Machines

Lecture 5: Multilayer Perceptrons

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

CMPSCI 670: Computer Vision! Object detection continued. University of Massachusetts, Amherst November 10, 2014 Instructor: Subhransu Maji

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

Categorizing objects: of appearance

Fast Feature Value Searching for Face Detection

Collaboratively Regularized Nearest Points for Set Based Recognition

Face Detection with Deep Learning

Metrol. Meas. Syst., Vol. XXIII (2016), No. 1, pp METROLOGY AND MEASUREMENT SYSTEMS. Index , ISSN

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Local Quaternary Patterns and Feature Local Quaternary Patterns

Support Vector Machines

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Detection of an Object by using Principal Component Analysis

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Discriminative Dictionary Learning with Pairwise Constraints

Image Alignment CSC 767

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

S1 Note. Basis functions.

Classification / Regression Support Vector Machines

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

A Binarization Algorithm specialized on Document Images and Photos

Learning-based License Plate Detection on Edge Features

Robust Inlier Feature Tracking Method for Multiple Pedestrian Tracking

Multi-view 3D Position Estimation of Sports Players

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

An Image Fusion Approach Based on Segmentation Region

The Research of Support Vector Machine in Agricultural Data Classification

A Probabilistic Approach to Detect Urban Regions from Remotely Sensed Images Based on Combination of Local Features

Classifier Selection Based on Data Complexity Measures *

Recognizing Faces. Outline

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

TN348: Openlab Module - Colocalization

Implementation of Robust HOG-SVM based Pedestrian Classification

Image Matching Algorithm based on Feature-point and DAISY Descriptor

Combination of Color and Local Patterns as a Feature Vector for CBIR

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Machine Learning 9. week

What is Object Detection? Face Detection using AdaBoost. Detection as Classification. Principle of Boosting (Schapire 90)

Module Management Tool in Software Development Organizations

Feature Reduction and Selection

Gender Classification using Interlaced Derivative Patterns

An Efficient Face Detection Method Using Adaboost and Facial Parts

3D vector computer graphics

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Large-scale Web Video Event Classification by use of Fisher Vectors

Improved SIFT-Features Matching for Object Recognition

WIRELESS CAPSULE ENDOSCOPY IMAGE CLASSIFICATION BASED ON VECTOR SPARSE CODING.

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Real-time Motion Capture System Using One Video Camera Based on Color and Edge Distribution

Object-Based Techniques for Image Retrieval

An efficient method to build panoramic image mosaics

Comparing Image Representations for Training a Convolutional Neural Network to Classify Gender

Using Neural Networks and Support Vector Machines in Data Mining

MOTION BLUR ESTIMATION AT CORNERS

Mercer Kernels for Object Recognition with Local Features

Scale Selective Extended Local Binary Pattern For Texture Classification

Support Vector Machine for Remote Sensing image classification

Detection of hand grasping an object from complex background based on machine learning co-occurrence of local image feature

Smoothing Spline ANOVA for variable screening

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Face Recognition Based on SVM and 2DPCA

Research on Robust Local Feature Extraction Method for Human Detection

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

Multiclass Object Recognition based on Texture Linear Genetic Programming

2 ZHENG et al.: ASSOCIATING GROUPS OF PEOPLE (a) Ambgutes from person re dentfcaton n solaton (b) Assocatng groups of people may reduce ambgutes n mat

Comparison Study of Textural Descriptors for Training Neural Network Classifiers

The Study of Remote Sensing Image Classification Based on Support Vector Machine

Computer Aided Drafting, Design and Manufacturing Volume 25, Number 2, June 2015, Page 14

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Extraction of Texture Information from Fuzzy Run Length Matrix

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Writer Identification using a Deep Neural Network

PERFORMANCE EVALUATION FOR SCENE MATCHING ALGORITHMS BY SVM

Incremental Multiple Kernel Learning for Object Recognition

Integrated Expression-Invariant Face Recognition with Constrained Optical Flow

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Face Tracking Using Motion-Guided Dynamic Template Matching

Brushlet Features for Texture Image Retrieval

Using the Visual Words based on Affine-SIFT Descriptors for Face Recognition

Active Contours/Snakes

Tone-Aware Sparse Representation for Face Recognition

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

A Deflected Grid-based Algorithm for Clustering Analysis

Transcription:

PAPER IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. xx JANUARY 20xx Hstogram of Template for Pedestran Detecton Shaopeng Tang, Non Member, Satosh Goto Fellow Summary In ths paper, we propose a novel feature named hstogram of template (HOT) for human detecton n stll mages. For every pxel of an mage, varous templates are defned, each of whch contans the pxel tself and two of ts neghborng pxels. If the texture and gradent values of the three pxels satsfy a predefned formula, the central pxel s regarded to meet the correspondng template for ths formula. Hstograms of pxels meetng varous templates are calculated for a set of formulas, and combned to be the feature for detecton. Compared to the other features, the proposed feature takes texture as well as the gradent nformaton nto consderaton. Besdes, t reflects the relatonshp between 3 pxels, nstead of focusng on only one. Experments for human detecton are performed on INRIA dataset, whch shows the proposed HOT feature s more dscrmnatve than hstogram of orentated gradent (HOG) feature, under the same tranng method. Key words: Human detecton, Hstogram of template, SVM 1. Introducton Human detecton technque s wdely used n many applcatons rangng from mage analyss, smart cars, and vsual survellance to behavoral analyss. In recent years, lots of research work has been focused on ths feld. But human detecton s stll a challengng task because of many dffcultes. Most natural humans have large varatons, such as the appearance, the pose and so on. Dfference n clothes brngs further challenge because some features such as skn color n the face detecton can t be used n ths applcaton. Besdes, complex backgrounds, llumnaton, occlusons and dfferent scales must be consdered n the detecton. A robust detector must be ndependent for all these varatons. The gradent nformaton s effcent for the object detecton. A lot of human descrptors contan the gradent nformaton more or less [1-6]. Hstogram of orentated gradent (HOG) [1] and covarance matrx [3] are excellent descrptors usng the gradent nformaton. HOG s a graylevel mage feature formed by a set of normalzed gradent hstogram. In [1-2], HOG feature s compared wth many other features, such as Harr-lke feature, Wavelet feature and so on, and t gets the best performance. Covarance matrx ntegrates coordnates and ntensty dervatves nto a matrx. They represent the gradent nformaton well, and can get a good result on INRIA human dataset. But only usng the gradent nformaton may be not enough to detect humans from complex backgrounds or mages n low resoluton. The texture nformaton also has some dscrmnatve abltes n the human detecton. Some research work has been done on the feature of local bnary pattern (LBP) [7] and Gabor flter [8]. Gabor flter and LBP are wdely used n texture classfcaton and face recognton. They represent the ntensty nformaton well. But only usng these features s not enough to get the good result. The orgnal defnton of LBP s not sutable for the human detecton. It must be combned wth other features such as Laplacan EgenMap (LEM) n [8]. In [7], two varants of LBP: Semantc-LBP and Fourer-LBP are proposed. The modfed defnton of LBP makes t sutable for the human detecton. Besdes the feature extracton, the tranng method s also very mportant for the human detecton. They are two key components for the pattern classfcaton problem. The features extracted from a large number of tranng samples are used to tran a classfer. Support vector machne (SVM) [9] and varous boostng methods [10] are effcent to tran a classfer n practcal applcatons. The SVM has some advantages. It s easy to tran and the global optmum s guaranteed. The varance caused by the suboptmal tranng s avoded for the far comparson. The boostng method combned wth the cascade strategy s wdely usng n real-tme applcatons. The boostng method ams at producng an accurate combned classfer from a sequence of weak classfers, whch are ftted to teratvely reweghted versons of the data. The cascade strategy saves detecton tme and makes t possble to detect object real tme. So there are two research drectons for the human detecton: fndng more dscrmnatve local features [1, 3], and developng more effcent tranng methods [11]. The man contrbuton of ths paper s along the frst drecton. It focuses on buldng a more powerful local feature for the human detecton. A new feature, hstogram of template, s proposed. It extracts the texture nformaton as well as the gradent nformaton, and makes the two dfferent types of nformaton homologous. Compared wth features usng the gradent nformaton, such as the HOG feature, the proposed feature shows more dscrmnatve ablty. Besdes, ths feature can encode the relatonshp of three pxels n one template. Compared wth features that deal wth each pxel ndependently, HOT feature can get hgher detecton rate. Last, the HOT feature has some propertes of local bnary pattern, such as Manuscrpt receved January xx, 20xx. Manuscrpt revsed March xx, 20xx. The author s wth NTT, Musashno-sh, 180-8585, The author s wth IEICE Offce, Mnato-ku, Tokyo, 105-0011 Japan.

[*author name] SAKURADA :. INSTRUCTIONS FOR THE IEICE TRANS. AUTHOR S TEMPLATE 2 llumnaton-nvarance. So the normalzaton s not so mportant for the detecton result. Ths property can be used to reduce computaton complexty n some crcumstance. 2. Related Work Human detecton algorthms now can be separated nto three groups. The frst group of methods s based on local features [1-4, 6-7, 11-13]. They extract some features from sub regons of mages n the tranng dataset, to tran a classfer by support vector machne (SVM) or boostng methods, such as Adaboost or Logtboost. For a new mage, they extract the same features and send them to the classfer whch wll gve a classfcaton result. In [14], a local receptve felds (LRF) feature s extracted usng multlayer perceptrons by means of ther hdden layer. In [13], Haar wavelet s used as human descrptor. SVM n [9] s used to tran the classfer. [1] uses the HOG feature as descrptor for the human detecton, and [2] s developed from ths one. It ntegrates the cascade-of-rejecter approach, and uses the Adaboost method n [11] to choose best sub wndow n each stage. In [11] Haar-lke feature s used to detect humans. It uses the ntegral mage to speed up the detecton process. Cascade rejecton method s proposed to make real-tme human detecton possble. In [3] the covarance matrx feature s used as human descrptor, to represent the coordnates, and the gradent nformaton of humans. Covarance matrx can be formulated as connect Remannan manfold. Each matrx can be treated as a pont n Remannan manfold, and can be mapped nto a vector space. An edge let descrptor s used n [12] for the human detecton. Dfferent from just combnng the orentatons n horzontal and vertcal drecton n [4], t combnes the orentatons n edge let defned drecton, whch makes t more effcent for the human detecton. Ths group of methods has a good performance, and f enlarge the tranng dataset, the detecton rate can be mproved. The second group of methods s based on local appearance, and [15-18] are based on ths. They detect the nterestng ponts n the tranng mages and use the patches around the nterest ponts to construct a codebook. When gven a new mage, they frst fnd the smlar patches n codebook and all patches vote for the postons of humans. The thrd group s based on chamfer matchng. They use human templates to fnd the most marchng regons n the edge map of an nput mage. [19-20] are based on ths method. In [19] a drect template matchng approach for the global shape-based human detecton s proposed, and [20] s developed from ths but uses some herarchcal templates to reduce the detecton tme and solve the occluson problem to some extent. These methods may not gve a good result when there are too many edge clusters n the edge map. Our method belongs to the frst group. It uses HOT feature to extract the texture nformaton and the gradent nformaton for the human detecton. Two types of nformaton are made homologous to ncrease the dscrmnatve abltes of the proposed feature. The covarance matrx feature n [3] gets hgher detecton rate than HOG feature n [1]. But the tranng method s dfferent. [3] uses the logtboost method and the sze of sub wndows s varable, but the SVM tranng method and the fxed sub wndow strategy are used n [1]. So t s hard to say that whether the covarance matrx feature s more dscrmnatve or the tranng method s better. The HOT feature s compared wth the HOG feature usng the same tranng method, for the far comparson. 3. Feature Extracton 3.1 Prevous Feature HOG s developed from the SIFT algorthm [5]. For calculatng the HOG feature, the mage s dvded nto blocks. The blocks overlap wth each other. Each block contans four cells. Cell s the basc unt for the feature calculaton. For each pxel I( x, y ), the orentaton θ ( x, y) and the magntude mxy (, ) of the gradent are calculated by dx = I( x + 1, y) I( x 1, y) (1) dy = I( x, y+ 1) I( x, y 1) (2) 2 2 mxy (, ) = dx + dy (3) 1 θ ( x, y) = tan ( dy / dx) (4) A hstogram s calculated for each cell, and the length of each bn s the sum of magntude of the pxels whose orentatons are n the correspondng nterval. In [4], each block contans 2 2 cells, so a block can be represented by a 36-dmensonal vector. COV calculates a vector for each pxel n a sub wndow: 2 2 I x T [ xy,, Ix, Iy, Ix + Iy, Ixx, Iyy, arctan ] (5) I I yy Where xy, are pxel locatons, and y I x, I, I, xx y are ntensty dervatves. The last term s the edge orentaton. So for each sub regon, we calculate a set of 8- dmensonal vectors, and a covarance matrx can be obtaned from these vectors:

3 IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 JANUARY 2002 S 1 T CR = ( z μ)( z μ) (6) S 1 = 1 Where μ s the mean, S s the number of these vectors. Due to the symmetry of covarance matrx, only the upper trangular part s stored as the feature for the detecton. A descrptor of a sub regon s a 36-dmensonal vector. 3.2 Lmtaton The HOG and the COV feature are manly depended on the gradent nformaton. There are some dsadvantages of gradent-based features. Sometmes, the gradent nformaton s ambguous. The same gradent may correspond to the dfferent curves. See Fg.1 for example. Pont P s the ntersecton of curve A and curve B. Only usng the gradent nformaton of P s not enough to dscrmnate A and B. But f the template feature s used, because the smooth degrees are dfferent, P on A s more lkely to meet the second template and the P on B s more lkely to meet the frst template. calculatng. 12 templates can be used for more accurate result. These templates are used n some formulas. The texture nformaton and the gradent nformaton are also used n these formulas, to gve a concrete defnton of ths feature. The formulas are desgned to capture the shape of the human body, and have reasonable computaton complexty. For texture nformaton, two formulas are gven as followng. Frst s: I( P) > I( P1)&& I( P) > I( P2) (7) For each template, f the ntensty value of P s greater than the other two, t s regarded that the pxel P meets ths template. It can capture the pxels that have the greatest value n one template, and the hstogram of pxels that satsfy each template n a sub wndow can reflect the propertes of local part of human body well. Fg.2 There s 12 templates here. They are three pxels combnaton. Fg.1 Dsadvantage of Gradent based feature. It may be ambguous n some crcumstance, f only gradent nformaton s used. Gradent based features almost only use the gradent nformaton for the detecton, and drop the texture nformaton n the orgnal mage, although three channels of color mage are used n gradent calculaton. The texture nformaton also shows dscrmnatve abltes n LBP based features [7-8] and local appearance based features [15-18]. So f texture nformaton can be used wth the gradent nformaton, more accurate detecton result can be obtaned. For each sub wndow, the number of pxels meetng each template s calculated to get a hstogram. See Fg. 3. For example, eght templates are used to extract the feature. 3.3 Hstogram of Template Feature Hstogram of Template feature s proposed here. Some templates are gven to defne the spacal relatonshp of three pxels. See Fg.2 for example. In Fg.2, 12 connected templates are gven. In our experment, the templates (1) to (8) are used for the feature Fg.3 Example of hstogram of template for one formula; 8 templates are used, and they correspond to 8 bns. The value of each bn s the number of pxels that meetng correspondng template. The hstogram has eght bns and each bn corresponds to one template. The value of each bn s the

[*author name] SAKURADA :. INSTRUCTIONS FOR THE IEICE TRANS. AUTHOR S TEMPLATE 4 amount of pxels whch meet the requrement of ths template n ths sub regon. The second formula s: k == I( P) + I( P1 ) + I( P2 )} (8) arg max{ The sum of ntensty values of three pxels n template k s greater than the values of other templates; t s can be regarded that P meets template k. A hstogram can be calculated by usng formula (8). By usng ths formula, we could fnd the template that has the greatest sum. They can be regarded as the basc unt of human body shape and the shape of human body can be represented well. For the gradent magntude nformaton, there exst smlar formulas: Mag( P) > Mag( P1)&& Mag( P) > Mag( P2) (9) k == { Mag( P ) + Mag( P1 ) + Mag( P2 )} (10) arg max Eght templates are usually used to extract the feature, so for each formula, an eght-dmensonal vector can be obtaned. These vectors are combned together as the fnal feature. See Fg.4. Fg.4. Fnal HOT feature for a sub wndow. It s a m n dmensonal vector. In our experment, m =8 and n =4. The ntegral mage can be used for feature extracton. For example, f 4 formulas and 8 templates are used, the hstogram has 32 bns. 32 addtonal mages are used. One mage corresponds to one bn. If the pxel n the orgnal mage satsfes one template for one formula, the value of the pxel n the correspondng addtonal mage s 1; otherwse t s 0. Then, by constructng the ntegral mages of the addtonal mage, we could get the 32-bn hstogram for each sub wndow quckly. Compared wth HOG feature, HOT feature has three advantages. Frst s that t not only uses gradent nformaton, but also uses texture nformaton. Although HOG feature also uses three channels of color mage for gradent calculaton, the texture nformaton s gnored and t s not treated as a cue for detecton. The second s that HOT feature s more macrostructures. HOG s actually an orentaton votng, and after the gradent s computed, the feature s calculated from pxel level. The HOT feature s specfc pattern votng and the feature s extracted from mddle level, whch contans several three pxels combnaton. So HOT feature s more dscrmnatve, and experment confrms ths pont. The thrd s that HOT feature s llumnaton-nvarant, so the normalzaton s not necessary as HOG feature. 4. Tranng Method The tranng method s also very mportant for the detecton result. A reasonable tranng method mproves the result effcently. So for the far comparson of dfferent features, the effect of tranng method should be consdered. Support vector machne and many boostng methods, such as Adaboost, Logtboost and Gentleboost, are wdely used n many tasks. In our experment, SVM s used for comparson. SVM n [1] s effectve for learnng wth small samplng n hgh-dmensonal spaces. The objectve of SVM s to fnd a decson plane that maxmzes the nterclass margn. The feature vectors are projected nto a hgher dmensonal space by kernel functon. The kernel functon makes t possble to solve the lnear nonseparable problems and the mappng functon s not necessarly known explctly. So the decson rule s gve by the followng formula. N s f( x) = β K( x, x) + b (11) = 1 Where x are support vectors, Ns s the number of support vectors. K( x, y) s the kernel functon. So the tranng process of SVM s to fnd the proper parameters of (11). Compared wth boostng methods, SVM needs more computatonal resources and t s dffcult for real-tme applcaton. The sze of sub wndows should be fxed. It s hard to take the varable sub-wndow sze strategy due to the computaton problem, although the varable wndow strategy can mprove the performance effcently. But for the comparson purpose, SVM s sutable. The tranng tme s less and the optmzaton s guaranteed. The dfference of the performance caused by the optmzaton can be gnored. The parameters of SVM are controllable. The sutable parameters can be selected avodng the dfference caused by the parameter dfference. In our experment, LbSVM [21] s used. RBF and lnear kernel functons are used n our experment respectvely. 5. Experment 5.1 Dataset The experment s performed on INRIA dataset [22]. It s wdely used for the human detecton n stll mages. The

5 IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 JANUARY 2002 database contans 1774 human annotatons and 1671 person free mage. Ths dataset s made up of a tranng dataset and a testng dataset. 1208 human annotatons and 1218 non-human mages are used for the tranng stage, and the left mages for testng. For postve samples, leftrght reflectons are also used. So, 2416 postve samples are used for tranng. More detal can be seen n [22]. There are varetes of varatons n human pose, clothng, lghtng, clutters and occlusons, so t s dffcult for the human detecton and t s sutable as a benchmark for comparson between dfferent methods. Some examples can be seen n Fg. 5. Fg. 5 Selected postve samples n INRIA dataset 5.2 Comparson wth other features In order to show the advantage of the proposed feature, we desgn the followng three experments. In the frst experment, we compare our feature wth HOG feature and COV feature. We use the same strategy wth [1]. The re-sample strategy and normalzaton strategy are also used n our experment. The sze of sub wndow and the strde between sub wndows are provded by [1]. In the second experment, a random ensemble strategy s used. So we don t have to consder the sze of sub wndow and the strde. The comparson s farer. In the thrd experment, we evaluate the length of the proposed feature. Only 8 templates are used n the frst experment, and we show that f more templates are used, the performance would be further mproved. In the forth experment, we show the results wth respect of the change of parameters and tranng strateges. We want to show that the result of our feature n dfferent confguratons. (1) In the frst experment, we compare the HOT feature wth the HOG and COV feature. We use the same strategy wth the HOG feature. The Re-sample and normalzaton are used n our experment, just lke the HOG n [1]. The sze of sub wndow and the strde are also provded by [1]. In the re-sample stage, 2416 postve samples and 12800 negatve samples random selected are used as the ntal tranng dataset. And there are 39000 hard negatve samples n our experment. The ntal tranng dataset and the hard negatve samples are used to get the fnal classfer. The Lnear kernel and the RBF kernel are used for tranng respectvely. C value and g value of RBF kernel are selected by cross-valdaton method, whch s a common method n SVM. LbSVM provdes ths tool. C and g are obtaned by only usng the tranng dataset. The C value s 128 and g value s 0.00048828125. We also use the default C and g to evaluate the performance of our feature. From Fg.6, t can be seen that the result s nearly the same. The sze of block s set as 16 16 and the strde between two blocks s 8. They are the same wth the HOG feature. The feature length of a block s 32, snce the frst eght templates n Fg.2. So the length of the feature for a 64 128 mage s 3360. It s shorter than HOG and COV, whch means that the computaton complexty and the memory consumpton s less. The comparson result can be seen n Fg.6. The data of HOG and COV are coped from [3] for comparson. The confguraton for each feature can be seen n Table.1. Table.1 the experment confguratons of three features HOG COV HOT Feature Dmenson 36 36 32 Re-Sample Y Y Y Sub wndow sze 16 16 Varable 16 16 Strde 8 N 8 Tranng method SVM Logtboost SVM Normalzaton Y Y Y Unbalance data N N N From fgure 6, t can be seen that the HOT feature gets hgher detecton rate than HOG and COV feature at 10-4 FPPW. Usually, we compare the mss rate at 10-4 FPPW [1]. Take nto account that COV feature uses varable sub wndow. It can mprove the detecton rate a lot compared wth usng fxed sub wndow [2-3, 7]. We could say that our feature outperforms HOG and COV. Fg.6. Comparson wth methods of HOG [1] and COV [3]. The curves of HOG and COV are coped from [3]. (2) In the second experment, we try to reduce the nfluence of block sze and strde. A random ensemble strategy n [7] s used. From an mage (64 128) n our experment, lots of sub wndows n dfferent szes and on dfferent postons

[*author name] SAKURADA :. INSTRUCTIONS FOR THE IEICE TRANS. AUTHOR S TEMPLATE 6 can be extracted. The mnmum sub wndow sze s set as K K. Ths sze s ncremented n a step of K horzontally and vertcally, or both. Fnally we can get all possble sub wndows: W = { r}. In our experment, subwn K =8. So the cardnalty of W subwn s 4896. Smaller K wll gve more sub wndows. Random ensemble means that n w sub wndows are random selected from W subwn. In our experment, n w =150. r j, j = 1, 2,... n. w A set of sub wndows can be obtaned: { } For each sub wndow, a feature s calculated. The features for all random selected sub wndows can be obtaned as: { f j, j = 1, 2,... n }. So the fnal feature for ths detecton w wndow can be represented as F = { f1, f2... f nw }. If the lengthen for each sub wndow s d, the dmenson of the fnal feature s d n. w n the above experments. If more templates are used, the detecton result wll be mproved. The performance s evaluated n ths experment. The template s actually the three pxels combnaton. There are 9 8 7 possble templates n all n a 3 3 regon. We only consder the connected one. Some connected templates contanng central pxel can be seen n Fg.2 and Fg.8. Other connected templates can be obtaned by shftng these 20 templates. Fg.8 Another 8 connected templates contanng central pxel The detecton result of 12 templates and 20 templates can be seen n Fg.9. Fg. 7 Comparson of three features Random ensemble strategy s used here. The nfluence of parameters can be gnored. By usng the random ensemble strategy, the nfluence of block sze and strde can be gnored, because all sub wndows are random selected from W subwn. HOT, HOG and COV are evaluated by usng ths strategy. The ntal tranng dataset n the frst experment s used for tranng. Ln-Ker SVM s used, so there s no C value and g value. In ths evaluaton strategy, there s no any parameter for the feature extracton. The comparson of three features can be seen n Fg.7. Our feature outperforms HOG and COV n ths experment, whch shows the dscrmnatve ablty of our feature. (3) In the thrd experment, lengthen of HOT feature s evaluated. The HOT feature s computed by usng formulas and templates. The frst eght templates n Fg.2 are used for the 32-dmensonal feature for a sub wndow Fg.9 Detecton result when more templates are used. In Fg.9, the ntal tranng dataset and SVM of RBF kernel are used for tranng. For 8 templates case, the frst 8 templates n Fg.2 are used. For 12 templates case, all templates n Fg.2 are used. For 20 templates case, all corrected templates contanng central pxel n Fg.2 and Fg.8 are used. From Fg.9, t can be seen that when ncrease the number of templates, the detecton result s mproved. But the mprovement s lmted when the number of template s ncreased from 12 to 20. When use more templates, the length of the feature wll ncrease. It means that more tme s needed for the classfcaton. So t may be not necessary to use all the three-pxel combnaton. 8 templates or 12 templates are sutable for the human detecton. It s a tradeoff between the detecton rate and the computaton complexty. Menton that n the

7 IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 JANUARY 2002 frst and second experment, only the frst 8 templates are used. (4) In the fourth experment, the performances of HOT feature n dfferent parameter confguratons and tranng strateges are evaluated. We want to show the performance of proposed feature n the dfferent confguraton. For the tranng strategy, we consder the normalzaton strategy, unbalance data strategy; for the parameter, we evaluate the sze of sub wndow and the strde between two neghborng sub wndows. The ntal tranng dataset n the frst experment s used for tranng. Normalzaton schemes: In experment, Non-norm, L1-norm and L2-norm strategy are used for comparson. Let v be un-normalzed feature vector. The schemes are: (a) Non-norm v > v ; (b) L1-norm v > v/( v 1 + ξ ) ; 2 2 (c) L2-norm v > v/ v 2 + ξ. See Fg.10 for performance comparson. L2-norm outperforms Non-norm and L1-norm schemes, but the dfference s not too much. So ths step s not necessary for HOT feature. HOT feature has llumnaton-nvarant property tself. Ths operaton can't be gnored for HOG feature because of llumnaton. It means that we could accelerate the feature extracton by abandonng the normalzaton. So only the nteger calculaton s needed n the feature extracton. It s easy for hardware based acceleraton. In the frst experment, L2 s used for far comparson because HOG uses L2. negatve samples, the performance can be mproved. But the dfference s not too much. In the frst experment, we don t consder t, and the same penalty s used for postve samples and negatve samples for far comparson. Sub wndow sze: For an nputtng detecton mage (64 128), t s frst dvded nto many sub wndows (block). Sub wndows can overlap wth each others. Snce the sze of sub wndow s fxed, sutable sze should be decded for tranng and detecton. In experment, 12 12, 16 16 and 20 20 are used for comparson. See Fg.12. 20 20 has the best performance. The result of 16 16 and 12 12 are nearly the same as 20 20. For far comparson, 16 16 s used n frst experment. Strde between sub wndows: the dstance between two neghborng blocks. The area of overlap regon of two sub wndows s decded by ths value. The less ths value s, the longer the fnal feature s. In experment, 4, 8 and 16 are evaluated. See Fg.13. 8 s used n the frst experment, just lke the HOG feature. Fg.11 Dfferent penalty values for the dfferent classes Fg.10 Comparson of dfferent normalzaton schemes Unbalance data: Snce the number of negatve samples s larger than that of postve samples, more negatve mages are used for tranng than postve mages. It s reasonable that dfferent penalty value for dfferent classes may ncrease the detecton rate. C of negatve samples: C of postve samples s set as 3:1, 2:1, 1:1, 1:2 and 1:3 for comparson. See Fg.11. It can be seen that f the penalty value of postve samples s greater than that of Fg.12 The sze of the sub wndow

[*author name] SAKURADA :. INSTRUCTIONS FOR THE IEICE TRANS. AUTHOR S TEMPLATE 8 Fg.13 Strde between two sub wndows Fnally, some detecton results from natural mages by usng the classfer obtaned n the frst experment can be seen n Fg.14. Fg.14 Detecton result of natural mages

PAPER IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. xx JANUARY 20xx 6. Concluson A new feature for human detecton s proposed n ths paper. A hstogram of pxels meetng dfferent templates s used as a feature for the human detecton. It ntegrates the texture nformaton and the gradent nformaton together, and shows more dscrmnatve ablty than only gradent based feature, even f the length of feature s shorter. Other advantage s that HOT feature s llumnaton-nvarant, so the normalzaton s not the necessary step for detecton, whch s very useful for some hardware accelerators that only support nteger calculaton. In our experment, the sze of the sub wndow s fxed. It s expected that the varable wndow sze and the boostng method wll further mprove the performance of the HOT feature. The computaton of the HOT feature s parallel, so t s easy for hardware acceleraton. Besdes, ntegral mage can also be used for computaton. These factors make t possble for real-tme applcaton. Acknowledgments Ths research was supported by Ambent SOC Global COE Program of Waseda Unversty of the Mnstry of Educaton, Culture, Sports, Scence and Technology, Japan, and CREST Program. References [1] N.Dalal and B.Trggs, "Hstograms of orented gradents for human detecton," n Conference on Computer Vson and Pattern Recognton, 2005. [2] Q.Zhu, et al., "Fast human detecton usng a cascade of hstograms of orented gradents," n IEEE Conf. on Computer Vson and Pattern Recognton, New York, 2006, pp. 1491-1498. [3] T.Oncel and P.Fath, "Pedestran detecton va classfcaton on remannan manfolds," IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 30, Oct. 2008. [4] K.Mkolajczyk, et al., "Human detecton based on a probablstc assembly of robust part detector," n ECCV, 2004, pp. 69-82. [5] D.Lowe, "Dstnctve mage features from scalenvarant keyponts," IJCV, vol. 60, pp. 91-110, 2004. [6] Z.Ln and S.Larry, "A Pose-Invarant Descrptor for Human Detecton and Segmentaton," n ECCV, 2008. [7] M.Yadong and Y.Shucheng, "Dscrmnatve local bnary patterns for human detecton n personal album," n CVPR, 2008. [8] L.Nann and A.Lumn, "Ensemble of multple pedestran reprssentatons," IEEE Transactons on Intellgent Transportaton Systems, vol. 9, June 2008. [9] B. Scholkopf and A. Smola. (2002) Learnng wth kernels support vector machnes, regularzaton, optmzaton and beyond. [10] J.Fredman, et al., "Addtve logstc regresson: a statstcal vew of boostng," Ann. Stat., vol. 28, pp. 337-407, 2000. [11] P.Vola and M.Jones, "Rapd object detecton usng a boosted cascade of smple features," n Coference on Computer Vson and Pattern Recognton, 2001. [12] B.Wu and R.Nevata, "Detecton and trackng of Multple, partally occluded humans by bayesan combnaton of edgelet based part detectors," IJCV, 2007. [13] P. Papageorous and T. Poggo, "A tranable system for object detecton," IJCV, vol. 38, pp. 15-33, 2000. [14] S.Munder and D.M.Gavrla, "An expermental study on pedestran classfcaton," IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 28, November 2006. [15] B.Lebe, et al., "Pedestran detecton n crowded scenes," n IEEE Conf. on Computer Vson and Pattern Recognton, San Dego, 2005, pp. 878-885. [16] B.Lebe, et al., "Combned object categorzaton and segmentaton wth an mplct shape model," n ECCV, 2004, pp. 17-32. [17] S.Agarwal and D.Roth, "Learnng a sparse representaton for object detecton," n ECCV, 2002. [18] E.Seemann, et al., "Towards robust pedestran detecton n crowded mage sequences," n CVPR, 2007. [19] D.M.Gavrla and V.Phlomn, "Real-tme object detecton for smart vehcles," 1999, pp. 87-93. [20] Z.Ln, et al., "Herarchcal part-template matchng for human detecton and segmentaton," n IEEE Internatonal Conference on Computer Vson, Ro de Janero, Brazl, 2007. [21] LbSVM [Onlne]. Avalable: http://www.cse.ntu.edu.tw/~cjln/lbsvm/ [22] INRIA Dataset [Onlne]. Avalable: http://lear.nralpes.fr/data Manuscrpt receved January xx, 20xx. Manuscrpt revsed March xx, 20xx. The author s wth NTT, Musashno-sh, 180-8585, The author s wth IEICE Offce, Mnato-ku, Tokyo, 105-0011 Japan.