Efficient Object Detection Using Cascades of Nearest Convex Model Classifiers

Size: px

Start display at page:

Download "Efficient Object Detection Using Cascades of Nearest Convex Model Classifiers"

Nathaniel Dawson
5 years ago
Views:

1 Effcent Object Detecton Usng Cascades of Nearest Convex Model Classfers Hakan Cevkalp Esksehr Osmangaz Unversty Meselk Kampusu, 26480, Esksehr Turkey Bll Trggs Laboratore Jean Kuntzmann B.P. 53, Grenoble Cedex 9, France Abstract An object detector must detect and localze each nstance of the object class of nterest n the mage. Many recent detectors adopt a sldng wndow approach, reducng the problem to one of decdng whether the detecton wndow currently contans a vald object nstance or background. Machne learnng based dscrmnants such as SVM and boostng are typcally used for ths, often n the form of classfer cascades to allow more rapd rejecton of easy negatves. We argue that one class methods ones that focus manly on modellng the range of the postve class are a useful alternatve to bnary dscrmnants n such applcatons, partcularly n the early stages of the cascade where one-class approaches may allow smpler classfers and faster rejecton. We mplement ths n the form of a short cascade of effcent nearest-convex-model one-class classfers, startng wth lnear dstance-to-affne-hyperplane and nteror-of-hypersphere classfers and fnshng wth kernelzed hypersphere classfers. We show that our methods have very compettve performance on the Faces n the Wld and ESOGU face detecton datasets and state-of-theart performance on the INRIA Person dataset. As predcted, the one-class formulatons provde sgnfcant reductons n classfer complexty relatve to the correspondng two-class ones. 1. Introducton Object class detecton s an mportant computer vson task n whch all nstances of a gven generc object class that occur n an mage must be recovered and labeled wth ther correct mage postons and scales. It s dffcult owng to the hghly varable shape and appearance of common object categores, changng scales, vew-ponts and lghtng condtons, complex backgrounds, occluson and clutter. The methods that currently domnate the feld are based on scannng the mage at multple scales wth wndow-level object / non-object classfers that use machne learnng dscrmnants such as Support Vector Machnes over hghdmensonal vsual feature sets [7]. Both the feature set and the classfer are crtcal for obtanng good performance. Here we concentrate on the classfer. We ntroduce a novel short cascade approach that uses one-class component classfers based on smple convex geometrc models and graduated nonlnearty. Geometrc one-class approaches prortze the accurate and effcent approxmaton of the feature space regons occuped by the postve object class over the explct dscrmnaton of postves from negatves. Partcularly n the earler stages of the cascade, ths smplfes the component classfers and allows early rejecton of easy negatves. Specfcally, our method combnes dstance-to-affne-hyperplane, lnear hypersphere and kernelzed hypersphere classfers. We also enhance some exstng one-class classfcaton software to handle large-scale problems and ntroduce a new face detecton dataset. The well-establshed MIT+CMU face dataset [21] s somewhat dated n the sense that t ncludes only a lmted number of mages, and that these are grayscale wth relatvely low resolutons. The majorty of the faces n the newer Faces n the Wld dataset 1 appear n the mddle of the mage wth smlar scales, lmtng ts value as a test set for multscale face detectors (t s prncpally a face recognton dataset, not a face detecton one). We therefore developed ESOGU Faces, a new frontal face detecton dataset contanng 285 hgher-resoluton color mages of complex real-world scenes taken under a wde range of dfferent llumnaton condtons. 2. Prevous Work Regardng feature sets, early detectors used raw pxel values [22], wavelets [19], edges [3], and Gabor flter responses [23]. More recently, hstogram based features have become very popular owng to ther performance and effcency. Many of these are based on orented mage gradents, ncludng SIFT [16], SURF [4], Hstogram of Orented Gradents (HOG) [6], PHOG [26], Generalzed Shape Context [5] and Local Edge Orentaton Hstograms [15]. Oth- 1

2 ers are based on local patterns of qualtatve graylevel dfferences, ncludng Local Bnary Patterns (LBP) [1,28] and Local Ternary Patterns (LTP) [24]. The best feature set depends on the applcaton and new ones are beng developed all the tme. Current detectors often combne several feature sets for better results, ether smply concatenatng them to form an extended feature vector [10], or fndng optmal combnaton coeffcents at the learnng stage [26]. Regardng the decson rule, most methods reduce the detecton problem to bnary classfcaton,.e. determnng whether the detector wndow currently contans a correctly framed true class nstance or somethng else (background, a partal or ncorrectly framed nstance, another class, etc.). Machne learnng classfers rangng from nearest neghbors to neural networks, convoluton neural networks, probablstc methods and classfcaton trees have been used, but two approaches have receved much of the attenton owng to ther nterestng propertes: boostng based cascades, and Support Vector Machnes. Vola & Jones [27] produced a very effcent face detector by usng AdaBoost to tran a cascade of pattern-rejecton classfers over rectangular wavelet features. Each stage of the cascade s desgned to reject a consderable fracton of the negatve cases that survve to that stage, so most of the wndows that do not contan faces are rejected early n the cascade wth comparatvely lttle computaton. As the cascade progresses, rejecton typcally gets harder so the sngle-stage classfers grow n complexty. Although cascades gve excellent results for real-tme face detecton, Support Vector Machne (SVM) classfers are currently a more common choce for more general object detecton under less strngent tme constrants [10,9,6,26,2]. Lnear SVM s are usually preferred for ther smplcty and speed, although t s well-establshed that kernel SVM s typcally gve hgher accuracy at the cost of greatly ncreased computatonal complexty [26]. For ths reason, several state-of-the art methods use short cascades n whch the early stages use lnear SVM s to reject most of the negatve wndows quckly whle the later stages use nonlnear SVM s to make the fnal decsons [10,26]. Several prevous detectors have used one-class formulatons such as the Support Vector Data Descrpton (SVDD) method of Tax and Dun [25], whch approxmates classes wth hyperballs (the nterors of hyperspheres) n feature space. Jn et al. [13] use a kernelzed hypersphere model for face detecton. However to reduce the computatonal complexty, they frst dvde the putatve face wndow nto 9 blocks usng heurstc rules such as the eye regons beng darker than the cheeks and the brdge of nose, etc., only applyng the nonlnear classfer f the regon passes all of these tests. Ther method thus apples only to face detecton. Mele & Maver [18] use lnear hypersphere classfers to detect specfc shapes n bnary segmented mages, but ther approach s not applcable to general object detecton n color or grayscale mages. In contrast, our method can be used to detect any more or less rgd object class n natural mages, and t s sgnfcantly faster than nonlnear two class SVM based detectors whle mantanng comparable overall accuracy. 3. Our Approach In object localzaton any sample that does not belong to the object class s consdered to be background, so n feature space, the class samples typcally le n specfc regons surrounded by a dffuse sea of background samples. Gven that such backgrounds are defned negatvely (as anythng at all that s not a well-framed class nstance), dscrmnatve tranng methods typcally need to process very large numbers of negatve tranng samples to represent them well. Ths explans both ther need for multple cycles of search for hard negatves and retranng [6], and the extremely unbalanced non-class to class ratos that result from ths. We argue that ths s counterproductve, and that (at least n the early stages of the cascade) t s preferable to concentrate on modelng the extent of the postve class well, dscardng any sample that does not conform to ths model as an easy negatve and postponng the more nuanced decsons untl later. To acheve ths we treat each stage of the object detecton cascade as a one-class nearest-convex-model classfcaton problem, not a two-class dscrmnaton problem. For each stage of the cascade we would deally use the postve tranng samples alone to learn a convex approxmaton to the regon occuped by the class n feature space, then classfy samples as possble-class or background by thresholdng ther feature-space dstances to the convex model. In practce, we fnd that t s useful to nclude some background nformaton from the negatve samples n each stage, whle stll retanng one-class style models and phlosophy. Below we wll use ether affne hyperplanes or boundng spheres for our convex cascade-stage models because t s very effcent to fnd dstances to these, thus potentally allowng real-tme performance. However many other models are possble at the cost of more expensve dstance computatons, ncludng lower-dmensonal affne subspaces, convex hulls, hyper-dsks and hyper-ellpsods, etc. In ths paper we wll focus on a partcular form of cascade n whch each stage uses a dfferent knd of nearest convex model classfer, as llustrated n Fg. 1. Our cascades have three stages graded by the computatonal complexty of the model. The frst stage s a lnear classfer that uses an ntersecton of affne hyperplanes to approxmate the class. Ths s computatonally effcent and also easy and fast to tran, even for large tranng sets. Its goal s to reject as many of the background samples as possble, whle stll passng almost all of the class samples to the next stage. The

(a) (b) (c) (d) Fgure 1. (Best vewed n color). An llustraton of prunng n our proposed three-stage cascade.

(b) In the frst stage of the cascade, the postve samples are bounded wth a seres of hyperplane shaped slabs (the dashed lne and ts two borders).

(c) The second stage of the cascade s a lnear-space one-class classfer that approxmates the object class regon wth a boundng hypersphere.

(d) The fnal stage of the cascade s a kernelzed one-class classfer that approxmates the postve regon more accurately and makes the fnal decson. second stage s a lnear SVDD classfer [25],.e. one based on a hypersphere model n the (non-kernelzed) nput space.

3 (a) (b) (c) (d) Fgure 1. (Best vewed n color). An llustraton of prunng n our proposed three-stage cascade. (a) Input data: ponts from the object (postve) and background (negatve) classes are shown respectvely as blue trangles and black dots. (b) In the frst stage of the cascade, the postve samples are bounded wth a seres of hyperplane shaped slabs (the dashed lne and ts two borders). Samples outsde the slabs are classfed as negatves and rejected. The background examples that survved ths stage are shown as red dots. (c) The second stage of the cascade s a lnear-space one-class classfer that approxmates the object class regon wth a boundng hypersphere. Most of the false postves that survve the frst stage are dscarded here. (d) The fnal stage of the cascade s a kernelzed one-class classfer that approxmates the postve regon more accurately and makes the fnal decson. second stage s a lnear SVDD classfer [25],.e. one based on a hypersphere model n the (non-kernelzed) nput space. Ths s equally fast to run, but somewhat more expensve to tran as t uses a maxmum-margn formulaton rather than smple lnear fttng. It turns out to be complementary to the frst stage, rejectng most of the false postves that stage 1 passes. The thrd stage of the cascade makes the fnal decsons, usng a kernelzed hypersphere model to approxmate the object class. Ths s slower, but t only needs to test the small number of postves and dffcult negatves passed by the frst two stages. We now present each of these three stages n detal Lnear Hyperplane Approxmaton The frst stage of our cascade tests the dstance of the sample to a seres of affne hyperplanes. Let x be the sample s feature vector and let w 1x + b 1 = 0 be the equaton of the frst hyperplane, where w 1 s a unt vector of feature weghts and b 1 s a bas. We reject samples for whch w 1x + b 1 > τ 1, where τ 1 s a threshold determned by cross-valdaton. For the survvng samples, we fnd the orthogonal complement x 1 = x w 1 (w 1x), and pass t on to the next hyperplane n the seres for testng. Ths contnues throughout the seres, at each stage testng the dstance of the current vector to the current hyperplane and passng on ts orthogonal complement f t survves. Let X + and X be matrces whose rows are the tranng samples of respectvely the object and background classes. Let e + and e be correspondng column vectors of ones, and for convenence defne extended matrces X + = [X + e + ] and X = [X e ]. To tran the method, the smplest one-class approach would be to fnd the best least-squares ft to the postve data arg mn X + w 1 + e + b 1 2 w 1,b 1, w 1 =1 = arg mn z z G z w 1 2 (1) where z = ( w 1 b 1 ) and G = X + X +, passng the orthogonal complement X + (I w 1 w 1) to the next stage of the seres as tranng data. Instead we use a background-senstve ft [17], mnmzng the regularzed Raylegh quotent arg mn w 1,b 1, w 1 =1 whch can be re-expressed as X + w 1 + e + b 1 2 X w 1 + e b δ ( w b 2 (2) 1 ), arg mn z z G z z H z where H = X X + δi and δ s a user-set regularzaton constant. Agan the orthogonal complements of X + and X are passed to the next hyperplane as tranng data. The soluton of ths problem reduces to fndng the smallest-λ egenvector of the generalzed egenproblem G z = λ H z and renormalzng t to fnd w 1, b Lnear Hypersphere Approxmaton The second stage of the cascade conssts of a sngle lnear SVDD classfer [25]. SVDD uses boundng hyperspheres to approxmate classes. As Fg. 1 suggests, the hypersphere classfers turn out to complement the precedng hyperplane ones well, rejectng most of the false postves that survve the frst stage. The boundng hypersphere of a pont set {x = 1...n} s characterzed by ts center c and radus r. These can be found by solvng the quadratc programmng problem ( arg mn r 2 + γ ) ξ c, r 0, ξ 0 s.t. x c 2 r 2 + ξ, = 1,..., n, (3) (4)

4 or ts dual arg mn α s.t. α α j x, x j α x 2,j α = 1, 0 α γ, (5) n each teraton. We revsed the CMP quadratc programmng software 2 to allow us to solve problems wth mllons of varables n a reasonable tme. Gven the optmal multplers α, the center of the boundng hypersphere can be computed as c = α x l α l x l, (8) where represents the (possbly kernelzed) nner product. The α are Lagrange multplers and γ [1/n, 1] s a celng parameter that can be set to a value less than one to reduce the nfluence of outlers. The objectve functon s convex so a global mnmum exsts. In the kernelzed case, the dual formulaton yelds a sparse soluton n terms of the support vectors (the examples lyng exactly on the hypersphere), whch makes evaluatng the model more effcent. If we are gven negatve tranng samples, they can be used to mprove the model by forcng them to le outsde of the boundng hypersphere. Suppose that we have n 1 tranng samples from the postve (object) class enumerated by ndces, j, and n 2 from the negatve (background) class enumerated by l, m. The most compact hypersphere that ncludes the postve samples and excludes the negatve ones can be found by solvng the quadratc programmng problem ( ) arg mn r 2 + γ 1 ξ + γ 2 ξ l c, r 0, ξ 0 l (6) s.t. x c 2 r 2 + ξ, = 1,..., n 1 x l c 2 r 2 ξ l, l = 1,..., n 2 or ts dual arg mn α α j x, x j + α l α m x l, x m α l,m 2 l,j,j α l α j x l, x j + ( l α l x l 2 α x 2 ) s.t. α α l = 1,, j 0 α γ 1, 0 α l γ 2. l (7) Ths one-class model dffers from a classcal SVM n that t fnds a closed hypersphere surroundng the object class, not a lnear hyperplane separatng t from the background. We fnd that the ncluson of negatve tranng samples sgnfcantly mproves the performance of our detecton cascades partcularly n cases where there are relatvely few postve tranng examples, as n the person detector below so we use the formulaton (7) above. Lke (5), (7) s a quadratc program wth a global mnmum. Large-scale problems can be solved usng Sequental Mnmal Optmzaton (SMO) [20]. In partcular, t s not necessary to construct the full Hessan matrx: only the Hessans of the actve sets of samples need to be consdered after whch ts radus can be found usng the constrants from (6). Durng object detecton, we fnd the dstance from the feature vector of each local wndow to the center of the hypersphere, rejectng the sample as a negatve f ths dstance s greater than the radus. Ths can be done very effcently as t only requres vector subtracton and norm Nonlnear One-Class Classfer The thrd stage of our cascade contans a sngle kernelzed hypersphere classfer that makes the fnal decsons. Kernelzaton allows fner dscrmnaton than the precedng lnear stages n return for ncreased computaton for the few examples that reach ths stage. The hypersphere model can be kernelzed smply by replacng the nner products x, x j wth kernel evaluatons k(x, x j ) = φ(x ), φ(x j ) n (7), where φ() s the mplct feature space mappng mplemented by the kernel. Tranng remans straghtforward, but evaluatng dstances from ncomng samples x to the center of the boundng hypersphere requres kernel evaluatons k(x, x) aganst the support vectors x. Ths makes kernelzed SVDD sgnfcantly more expensve than ts lnear counterpart as the number of support vectors can be consderable (albet typcally much smaller than the number of tranng samples). However n practce we fnd that our kernelzed SVDD classfers are an order of magntude faster than the analogous kernelzed SVM s because they have far fewer support vectors. The SVM s typcally have many negatve support vectors owng to the need to reject large numbers of hard negatves, whereas the SVDD support vectors come predomnantly from the postve tranng samples. Ths makes kernel SVDD more sutable for use n effcent detecton cascades than kernel SVM. 4. Face Detecton Experments We wll evaluate our approach 3 on face detecton and human detecton tasks. Frst consder face detecton. We tested on two datasets, the mage Faces n the Wld one [11] and ESOGU 4, a new frontal face detecton dataset that ncludes 285 hgh-resoluton color mages wth 970 annotated frontal faces. The mages n Faces n the Wld For our code, see mlcv/softwares.html 4

5 Faces n the Wld LBP+HOG LTP+HOG LBP+LTP LBP+LTP+HOG Method DR FP AP DR FP AP DR FP AP DR FP AP Cascade I Cascade II Cascade III ESOGU Faces LBP+HOG LTP+HOG LBP+LTP LBP+LTP+HOG Method DR FP AP DR FP AP DR FP AP DR FP AP Cascade I Cascade II Cascade III Table 1. % Detecton Rates (DR), numbers of False Postves (FP) and % Average Precson (AP) scores for our cascade detectors on the Faces n the Wld and ESOGU Faces datasets. The Cascade I detectors nclude only the lnear hyperplane and hypersphere stages. The Cascade II and Cascade III ones respectvely add a kernelzed hypersphere classfer and a kernelzed SVM as the fnal stage. For comparson, the OpenCV Vola-Jones detector [27] has DR 95.80%, FP 1074 and AP 98.50% on Faces n the Wld and DR 75.36%, FP 103 and AP 98.60% on ESOGU, and the FDLb detector [14] has DR 59.28% and FP 5393 on Faces n the Wld and DR 63.81% and FP 344 on ESOGU. (We can not report AP scores for the FDLb detector as t does not return a real-valued confdence measure for ts detectons). Note that the Faces n the Wld results are probably based towards Vola-Jones because a detector of ths knd was used to obtan the ntal detectons for ths dataset [11] Precson Vola&Jones Cascade I Cascade II Cascade III Precson Vola&Jones Cascade I Cascade II Cascade III Recall Recall Fgure 2. Precson-Recall curves for LBP+HOG features on the Faces n the Wld (left) and ESOGU Faces (rght) datasets. are somewhat dealzed n the sense that they are relatvely small and normalzed such that most of the faces appear near the mddle of the mage wth smlar scales. To provde more realstc testng on mages from real world consumer snapshot collectons, we therefore developed the ESOGU (ESksehr OsmanGaz Unversty) dataset, whose mages contan faces appearng at a wde range of mage postons and scales, and also complex backgrounds, occlusons and llumnaton varatons c.f. Fg. 3 bottom. Tranng: Gven the lmtatons of the current publcly avalable face detector tranng datasets, we collected submages of frontal uprght faces from the web for tranng. Most of these are from real-world mages and there s a hgh degree of varablty n appearance and lghtng condtons. The face mages are rescaled and algned to a resoluton of (further reductons n resoluton reduce the performance). For the negatve set, we randomly selected wndows from face-free regons wth complex backgrounds. We tested several vsual features ncludng LBP [1], LTP [24], HOG [6], and combnatons of these. The combnatons gave better results than the ndvdual descrptors. For LBP and LTP, we dvded the mages nto four non-overlappng quadrants and extracted descrptors from each regon usng crcular (8,1) neghborhoods. The resultng hstograms were normalzed to sum 1 and concatenated to produce the fnal feature vector. For HOG, we used a grd of 6 6 pxel cells wth 9 bns of unsgned gradent orentaton over color mages, groupng each cell nto overlappng 2 2 cell blocks for normalzaton as n [9]. Classfers traned wth the ntal samples were used to scan a set of thousands of mages n order to collect both false negatves and false postves. These hard examples were added to the tranng set, ncreasng the number of postve examples to about 20k and the number of negatve ones to about 93k, and the methods were retraned. The fnal sze of the tranng set s thus 113k. When scannng an mage the detecton wndow s stepped by 3 pxels horzontally and 4 pxels vertcally, and we scan an mage pyramd whose scales are spaced by a factor of For nonmaxmum suppresson we sort the survvng wndows by score, then teratvely take the frst and elmnate all detectons

detectors on mages from the Faces n the Wld

Most of the faces are correctly detected, but

than 4 overlappng wndows (8 n the lnear case)

wndows)/3 s heurstcally added to the score.

The frst ncludes only the lnear hyperplane and

ones, respectvely wth kernelzed one-class

better results than RBF kernels based on the χ

The kernelzed hypersphere classfers always had

hypersphere classfer had 2398 support vectors

On average, the fnal stages of cascades usng

6 Fgure 3. Some examples of the output of our cascade detectors on mages from the Faces n the Wld (top) and ESOGU Faces (bottom) datasets. Most of the faces are correctly detected, but there are a few mssed detectons and false postves. overlappng t. To penalze narrow supports, groups wth less than 4 overlappng wndows (8 n the lnear case) are suppressed, and otherwse log(# partcpatng wndows)/3 s heurstcally added to the score. Detectors: We traned three knds of cascades. The frst ncludes only the lnear hyperplane and lnear hypersphere stages, wth three lnear hyperplane classfers n the frst stage. The second and thrd cascades are three-stage ones, respectvely wth kernelzed one-class hypersphere classfers and kernelzed SVM s n the fnal stage. For these, only two lnear hyperplane classfers were ncluded n the frst stage. We used Gaussan RBF kernels as they produced better results than RBF kernels based on the χ 2 hstogram dstance. The kernelzed hypersphere classfers always had fewer support vectors than the correspondng kernelzed SVM s. For example for LBP+HOG features, the hypersphere classfer had 2398 support vectors whle the SVM had tmes more. On average, the fnal stages of cascades usng kernelzed hypersphere classfers were 8 tmes faster than ones usng kernelzed SVM s. We compared our results wth those of the OpenCV Vola-Jones cascade [27] and the FDLb detector [14] 5, usng the PASCAL VOC crtera [7] to assess detecton per- 5

7 formance. Brefly, detectons are consdered to be true postves f the boundng box R returned by the classfer overlaps the boundng box Q of the ground truth annotaton by more than 50%, where overlap s measured as Area Q R Area Q R. We report the Detecton Rate (DR) and total number of False Postves (FP) at the default detector threshold (the one chosen by the tranng algorthm), as well as the Average Precson (AP) (.e. area under curve) over the whole Precson-Recall curve. DR s the rato of the number of correctly detected faces to the total number of labeled faces n the test set. Results: The results are gven n Table 1 and Fg. 2, and Fg. 3 shows some examples of detectons on the two face test sets. For Faces n the Wld, cascades usng fnal kernel SVM s have slghtly hgher AP s than ones usng fnal kernel hypersphere classfers, but ths s reversed for ESOGU and n any case the dfferences are very small. For both datasets the Vola-Jones method comes a close thrd to the two cascades, whle the FDLb detector gves poor results. The best feature set for Faces n the Wld s LTP+HOG whereas the best for ESOGU s LBP+HOG, but for both datasets the feature combnatons LBP+HOG, LTP+HOG and LBP+LTP+HOG all gve smlar results, wth LBP+LTP and the ndvdual feature sets (not shown) beng weaker. Ths suggests that HOG manages to capture useful cues (probably shape nformaton) that LBP and LTP gnore, and conversely LBP and LTP capture cues (probably local texture) that HOG gnores. Gven that there s no clear wnner among LBP+HOG, LTP+HOG and LBP+LTP+HOG, we recommend LBP+HOG for ths applcaton as t has lower computatonal complexty than the other combnatons. To get an dea of the degree of prunng provded by the cascades, for LBP+HOG features on the ESOGU dataset, of the 23M wndows scanned, 2.9M (13%) passed the frst hyperplane classfer, 0.8M (3.6%) passed the second hyperplane, 64k (0.28%) passed the lnear hypersphere stage and 21k (0.09%) passed the kernel hypersphere stage. Nonmaxmum suppresson then merged these nto 915 detectons (an average of 22.7 wndows per detecton), of whch 897 were correct. 5. Human Detecton Experments Tranng: We used the INRIA Person dataset [6] for our human ( pedestran ) detecton experments. LBP+HOG features were used, wth a grd of 8 8 pxel cells for HOG and the detecton wndow dvded nto a 5 3 set of rectangular regons for LBP. We artfcally enlarged the postve tranng set by slghtly perturbng the locatons provded n the ground-truth annotatons, and randomly sampled negatve wndows from the provded negatve (person-free) tranng mages. For each method tested, ntal detectors traned on these examples were used to scan all of the tran- Method Det.Rate False Pos. Ave.Prec. Cascade I Cascade II Cascade III Dalal&Trggs [6] Hussan&Trggs [12] Felzenszwalb et al. [9] Table 2. % Detecton Rates, total numbers of False Postves, and % Average Precson scores) for our detectors on the INRIA Person dataset. Precson Cascade I Cascade II Cascade III Recall Fgure 4. Precson-Recall curves for our detectors on the INRIA Person dataset. ng mages to collect hard examples, followed by retranng. Durng detecton, the search wndow was shfted by steps of 4 pxels horzontally and 6 pxels vertcally, and the pyramd scales were spaced by a factor of We traned the same three knds of cascades as n the face case. The fnal kernel hypersphere classfer had 2818 support vectors whereas the fnal kernel SVM had 10 tmes more (28 251). Results: The Detecton Rates and Average Precson scores for our person detectors are gven n Table 2 and the correspondng Precson-Recall curves are gven n Fgure 4. For comparson, we also nclude the publshed AP s of Dalal & Trggs [6] (HOG wth lnear SVM), Hussan & Trggs [12] (a two stage lnear + quadratc sngle root latent SVM classfer usng HOG+LBP+LTP) and Felzenszwalb et al. [9, 8] (a lnear latent SVM classfer usng multple roots and parts over HOG). All of our cascades gve better results than these methods, despte the fact that they use only a sngle root and no parts. The cascade wth the fnal kernel SVM clearly domnates, gvng the best Detecton Rate, False Postves and AP scores and offerng about a 9% mprovement n AP over the prevous state-of-art. The cascade wth the fnal kernel hypersphere classfer comes second, wth ts fnal stage beng about 20 tmes faster than that of the kernel SVM based one. The two-stage cascade based on lnear classfers also acheves very respectable results, whch suggests that our strategy of boundng the regon occuped by the postve class more tghtly than the smple lnear separator provded by an SVM s bearng frut.

8 6. Summary and Conclusons We have developed sldng wndow object detectors based on short cascades of lnear and nonlnear nearestconvex-model classfers, argung that the one-class nature of the latter provdes an attractve combnaton of accuracy and speed. Our cascades have three stages: a set of lnear dstance-to-hyperplane classfers for fast prunng of easy negatves; a lnear hypersphere classfer for addtonal prunng; and fnally (and optonally) ether a kernelzed hypersphere classfer or a kernelzed SVM. We tested our detectors on two challengng face datasets and the INRIA Person dataset, concludng that the cascade methods are very promsng relatve to exstng approaches. In partcular, the cascades wth fnal kernelzed classfers acheve hgh Average Precsons, wth the hypersphere ones havng accuracy smlar to or better than the SVM ones on the face datasets and somewhat lower on the INRIA dataset, but beng an order of magntude faster n both cases because they have far fewer support vectors. For human detecton, the two-stage lnear cascade already gves much better performance than well-establshed lnear SVM detectors, whch suggests that ncludng multple stages of lnear or hypersphere prunng may be a useful strategy for mprovng other exstng object detectors. Future work: Unlke [8], our current detectors do not ncorporate multple roots, parts, and latent poston and scale adjustments durng tranng (although the use of perturbed tranng examples partally compensates for the latter). We are currently workng on ncludng these refnements. Acknowledgments: We would lke to thank Jfeng Shen for supplyng some of the tranng mages. Ths work was funded n part by the Scentfc and Technologcal Research Councl of Turkey (TUBİTAK) under Grant number EEEAG-109E279. References [1] T. Ahonen, A. Hadd, and M. Petkanen. Face descrpton wth local bnary patterns: Applcaton to face recognton. IEEE T-PAMI, 28(12): , [2] D. Aldavert, A. Ramsa, R. L. Mantaras, and R. Toledo. Fast and robust object segmentaton wth the ntegral lnear classfer. In CVPR, [3] Y. Amt and D. Geman. A computatonal model for vsual selecton. Neural computaton, 11: , [4] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. CVIU, 110(3): , [5] S. Belonge, J. Malk, and J. Puzcha. Shape matchng and object recognton usng shape contexts. IEEE T-PAMI, 24(24): , [6] N. Dalal and B. Trggs. Hstograms of orented gradents for human detecton. In CVPR, [7] M. Everngham, L. Van Gool, C. Wllams, J. Wnn, and A. Zsserman. The PASCAL Vsual Object Classes Challenge. IJCV, 88(2): , [8] P. Felzenszwalb, R. B. Grshck, D. McAllester, and D. Ramanan. Object detecton wth dscrmnatvely traned part based models. IEEE T-PAMI, 32(9), Sept [9] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dscrmnatvely traned, multscale deformable part model. In CVPR, [10] H. Harzallah, F. Jure, and C. Schmd. Combnng effcent object localzaton and mage classfcaton. In ICCV, [11] G. Huang, M. Ramesh, T. Berg, and E. Learned-Mller. Labeled faces n the wld: A database for studyng face recognton n unconstraned envronments. Techncal Report 07-49, Unversty of Massachusetts, Amherst, Oct [12] S. Hussan and B. Trggs. Feature sets and dmensonalty reducton for vsual object detecton. In BMVC, [13] H. Jn, Q. Lu, and H. Lu. Face detecton usng one-classbased support vectors. In Internatonal Conference on Automatc Face and Gesture Recognton, [14] W. Kenzle, G. Bakr, M. Franz, and B. Scholkopf. Face detecton effcent and rank defcent. In NIPS, pages , [15] K. Lev and Y. Wess. Learnng object detecton from a small number of examples: the mportance of good features. In CVPR, [16] D. G. Lowe. Dstnctve mage features from scale nvarant keyponts. IJCV, 60:91 110, [17] O. L. Mangasaran and E. W. Wld. Multsurface proxmal support vector machne classfcaton va generalzed egenvalues. IEEE T-PAMI, 28:69 74, [18] K. Mele and J. Maver. Object recognton usng herarchcal SVMs. In Computer Vson Wnter Workshop, [19] C. Papageorgou and T. Poggo. A tranable system for object detecton. IJCV, 38:15 33, [20] J. C. Platt. Fast tranng of support vector machnes usng sequental mnmal optmzaton, Advances n Kernel Methods-Support Vector Learnng, Cambrdge, MA, MIT Press. [21] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detecton. IEEE T-PAMI, 20:22 38, [22] H. A. Rowley, S. Baluja, and T. Kanade. Neural networkbased face detecton. IEEE T-PAMI, 20:23 38, [23] L. Shams and J. Speslstra. Learnng Gabor-based features for face detecton. In World Congress n Neural Networks, [24] X. Tan and B. Trggs. Enhanced local texture feature sets for face recognton under dffcult lghtng condtons. IEEE Transactons on Image Processng, 19: , [25] D. M. J. Tax and R. P. W. Dun. Support vector data descrpton. Machne Learnng, 54:45 66, [26] A. Vedald, V. Gulshan, M. Varma, and A. Zsserman. Multple kernels for object detecton. In ICCV, [27] P. Vola and M. J. Jones. Robust real-tme face detecton. IJCV, 57(2): , [28] X. Wang, T. X. Han, and S. Yan. A HOG-LBP human detector wth partal occluson handlng. In ICCV, 2009.

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: