Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Onlne Detecton and Classfcaton of Movng Objects Usng Progressvely Improvng Detectors Omar Javed Saad Al Mubarak Shah Computer Vson Lab School of Computer Scence Unversty of Central Florda Orlando, FL 32816 Abstract Boostng based detecton methods have successfully been used for robust detecton of faces and pedestrans. However, a very large amount of labeled examples are requred for tranng such a classfer. Moreover, once traned, the boosted classfer cannot adjust to the partcular scenaro n whch t s employed. In ths paper, we propose a co-tranng based approach to contnuously label ncomng data and use t for onlne update of the boosted classfer that was ntally traned from a small labeled example set. The man contrbuton of our approach s that t s an onlne procedure n whch separate vews (features) of the data are used for co-tranng, whle the combned vew (all features) s used to make classfcaton decsons n a sngle boosted framework. The features used for classfcaton are derved from Prncpal Component Analyss of the appearance templates of the tranng examples. In order to speed up the classfcaton, background modelng s used to prune away statonary regons n an mage. Our experments ndcate that startng from a classfer traned on a small tranng set, sgnfcant performance gans can be made through onlne updaton from the unlabeled data. 1. Introducton & Related Work The detecton of movng objects, specfcally people or vehcles, n a scene s of utmost mportance for most survellance system. In recent years, consderable progress has been made for detecton of faces and pedestrans through supervsed classfcaton methods. In ths context, a varety of approaches have been used ncludng nave Bayes classfers [9], Support Vector Machnes [7] and Adaboost [10]. Specfcally for survellance related scenaros, Adaboost s partcularly sutable snce t has been demonstrated to gve hgh detecton rates usng smple Haar-lke features n realtme [10]. However, one problem n tranng such a classfer s that an extremely large number of tranng examples are requred to ensure good performance n the test phase. For example, Zhang et al. [11] used around 11000 postve and a 100000 negatve labeled mages for face detecton. Another ssue related to the use of Boosted classfers n the survellance scenaro s that the classfer parameters are fxed n the test stage. However t s preferable to have a system that automatcally learns from the examples n a specfc scenaro. One possble way around the requrement of a large labeled tranng set s the co-tranng approach proposed by Blum and Mtchell [2]. The basc dea s to tran two classfers on two ndependent vews (features) of the same data, usng a relatvely small number of examples. Then to use each classfer s predcton on the unlabeled examples to enlarge the tranng set of the other. Blum and Mtchell prove that co-tranng can fnd a very accurate classfcaton rule, startng from a small quantty of labeled data f the two feature sets are statstcally ndependent. However, ths assumpton does not hold n many realstc scenaros [8]. Levn et al. [5] use the co-tranng framework, n the context of boosted bnary classfers. Two boosted classfers are employed for co-tranng. If one classfer predcts a label for a certan example wth a hgh confdence then that labeled example s added to the tranng set of the other, otherwse the example s gnored. One of the two boosted classfers employed for co-tranng uses background subtracted mage regons, whle the other classfer s traned on the mage grey-levels drectly. Note that the features are closely related. However, ther approach emprcally demonstrates that co-tranng s stll possble even n the case the ndependence assumpton does not hold. The co-tranng based learnng approach have also been used successfully for text retreval and classfcaton by Collns and Snger [3]. One mportant pont to note s that co-tranng s not a classfcaton framework. It s actually a tranng method 1

from unlabeled data. Co-tranng requres two separate vews of the data for labelng, however a better classfcaton decson can be made by combnng the two vews of data by a fully traned classfer. Thus, co-tranng s used to tran a classfer n an offlne settng. Once tranng s complete the combned vew s used to make classfcaton decsons. The prncpal contrbuton of our approach s that t s an onlne method, n whch separate vews (features) of the data are used for co-tranng, whle the combned vew s used to make classfcaton decsons n a sngle framework. To acheve ths, we have exploted the fact that the the boosted classfer s a lnear combnaton of smpler base classfers and that the adaptve boostng selecton mechansm dscourages hgh correlaton among the selected features. The cotranng s performed through the base classfers,.e., f a partcular unlabeled example s labeled very confdently by a small subset of the base classfers then t s used to update both the base classfers and the boostng parameters usng an onlne varant of the mult-class adaboost.m1 algorthm [6]. Note that only few of the observed examples mght qualfy for co-tranng. Meanwhle the classfcaton decson for each example s made by the boosted classfer, whose parameters have been updated from the labeled examples observed so far. The advantage of ths approach s that the classfer s attuned to the characterstcs of a partcular scene. Note that, a classfer traned to gve the best average performance n a varety of scenaros wll usually be less accurate for a partcular scene as compared to a classfer traned specfcally for that scene. Obvously, the specfc classfer would not perform well n other scenaros and thus t wll not have wdespread applcaton. Our proposed approach tackles ths dlemma by usng a classfer traned on a general scenaro that can automatcally label examples observed n a specfc scene and use them to fne tune ts parameters onlne. We demonstrate the performance of our classfer n the context of detecton of pedestrans and vehcles observed through fxed cameras. In the frst step of our detecton framework, we use a background model [1] to select regons of nterest n the scene. The boosted classfer searches wthn these regons and classfes the data nto pedestrans, vehcles and non-statonary background. Co-tranng decsons are made at the base classfer level, and both the base classfers and the boostng parameters are updated onlne. In the next secton we dscuss the features used for object representaton and the base classfers learned from these features. In Secton 3, we descrbe the co-tranng framework n the context of an onlne boosted classfer. In Secton 4, we present the results and gve the concludng remarks n Secton 5. Fgure 1: Frst Row: The top 3 egenvectors for the pedestran subspace. Second Row: The top egenvectors for the vehcle subspace. 2. Feature Selecton and Base Classfers One approach for object representaton n boosted classfers s to use local Haar lke features. The advantage of usng the Haar features s that they can be calculated very effcently [10]. However, t has been shown, n the context of face detecton, by Zhang et al. [11] that base classfers traned from global features are more relable and the resultng boosted classfer has a hgher detecton rate. The drawback s that global features are usually more expensve to compute. However, n our approach, background subtracton s used to dscard most of the statonary regons n an mage before further processng, therefore we can afford to use global features for classfcaton and stll handle realtme processng requrements. We employ Prncpal Component Analyss (PCA) to obtan the global features. The prncpal component model s formed by takng m example mages of dmensonalty d n a column vector format, subtractng the mean, and computng the d d dmensonal covarance matrx C. The covarance matrx s then dagonalzed va an egenvalue decomposton C = ΦEΦ T, where Φ s the egenvector matrx and E s the correspondng dagonal matrx of ts egenvalues. Only m egenvectors, correspondng to the m largest egenvalues are used to form a projecton matrx S m to a lower dmenson subspace. We construct a pedestran subspace wth a d m 1 dmensonal projecton matrx S m1 and a vehcle subspace wth a d m 2 dmensonal projecton matrx S m2 by performng PCA on the respectve tranng mages. The parameters m 1 and m 2 are chosen such that the egenvectors account for 99% of the varance n pedestran and vehcle data respectvely. The top three egenvectors for the pedestrans and vehcles are shown n Fgure 1. The features for the base learners are obtaned by projectng each tranng ex- 2

Onlne Co-Tran -f atleast k base classfers confdently predct a label c p for ncomng example x, where p {1,..., numclasses}, then ( ( f log( 1 n:hn(x)=cp β n ) ) / ( N n=1 log( 1 β n ) )) < Tc ada p β n OnlneBoostng(H N, x, c p ) add example wth assgned label c to the valdaton set. for j = 1,..., N for = 1,..., numclasses Tj,c base =max posteror probablty, for class c by h j, of a negatve example n the valdaton set for = 1,..., numclasses Tc ada =max H N normalzed score, for class c, of a negatve example n the valdaton set - returns B n :OnlneBoost(H N, x, label) -Set the example s ntal weght λ x = 1. - For each base model h n,n the boosted classfer 1. Set z by samplng Posson(λ x). 2. Do z tmes : h n OnlneBase(h n, x, label) 3. f h n(x) s the correct label, 4. else λ sc n λ sw n = λ sc n + λ x, ɛ n = λsw n λ sc n +λsw n = λ sw n + λ x, ɛ n = λsw n 5. calculate β n = log( 1 ɛn ɛ n ) λ sc n +λsw n 1, λ x = λ x( 2(1 ɛ n ) ) 1,λ x = λ x( 2(ɛ n ) ) Fgure 2: The co-tranng method. Note that both T base and T ada are automatcally computed from the valdaton set. The subfuncton onlneboost() was proposed n [6]. λ sc n are sum of weghts for examples that were classfed correctly by the base model at the stage n whle λ sw n s sum for ncorrectly classfed examples. ample r n the two subspaces and obtanng a feature vector v = [v 1,..., v m1, v m1+1,..., v m1+m 2 ], where [ v1,..., v m1 ] = r T S m1, [ vm1 +1,..., v m1 +m 2 ] = r T S m2. We construct each base classfer from a sngle subspace coeffcent. Thus we wll have a total of m 1 + m 2 base classfers. We use the Bayes classfer as our base classfer. Let c 1, c 2 and c 3 represent the pedestran, vehcle and the nonstatonary background classes respectvely. The classfcaton decson by the q th base classfer s taken as c f P (c v q ) > P (c j v q ) for all j. The posteror s p(vq c)p (c) gven by the Bayes rule,.e., P (c v q ) = p(v q ). The pdf p(v q c ) s approxmated through smoothed 1D hstogram of the of the q th subspace coeffcents obtaned from the tranng data. The denomnator p(v q ) s calculated as Σ 3 =1 p(v q c )p(v q ). Note that the sum of posteror probabltes over all classes for a partcular coeffcent nstance s one,.e., for the three class case, Σ 3 =1 P (c v q ) = 1. Once the base classfers are learned, the next step s to tran the boosted classfer from the ntal set of labeled data. We use the Adaboost.M1 algorthm [4] for learnng the boosted classfer. In the next secton, we dscuss the cotranng framework for augmentng the ntal tranng set. 3. The Co-Tranng Framework Boostng s an teratve method of fndng a very accurate classfer by combnng many base classfers, each of whch may only be moderately accurate. In the tranng phase of the Adaboost algorthm, the frst step s to construct an ntal dstrbuton of weghts over the tranng set. Then the boostng mechansm selects a base classfer that gves the least error, where the error s proportonal to the weghts of the msclassfed data. Next, the weghts assocated wth the data msclassfed by the selected base classfer are ncreased. Thus the algorthm encourages the selecton of another classfer that performs better on the msclassfed data n the next teraton. If the base classfers are constructed such that each classfer s assocated wth a dfferent feature, then the boostng mechansm wll tend to select features that are not completely correlated. Note that, for cotranng we requre two classfers traned on separate features of the same data. Therefore, we propose to label the unlabeled data by usng the base classfers selected by Adaboost. Bascally, f a base classfer selected through the boostng mechansm confdently predcts the label of the data, then we can add ths data to our tranng set to update the rest of the classfers. The confdence thresholds for the base classfers can be determned through the tranng data or by usng a small valdaton set. Suppose H N s the strong classfer learned through the Adaboost.M1 [4] algorthm. Let h j, where j {1,..., N}, be the base classfers selected by the boostng algorthm. In order to set confdence thresholds on the labels gven the base classfers, we use a valdaton set of labeled mages. For the class c, the confdence threshold Tj,c base s set to be the hghest posteror probablty acheved by a negatve example. Ths means that all examples n the valdaton set labeled as c by h j wth a probablty hgher than Tj,c base actually belong to the class c. Thus durng the onlne phase of the classfer, any example whch has a probablty hgher than T base j,c s very lkely to belong to the class c. The thresholds for all base classfers selected by the boostng algorthm are smlarly calculated. 3

Ideally, f a sngle base classfer confdently predcts a label wth a probablty hgher than the establshed threshold then we should assume that the label s correct and use that example for further tranng the classfer. However, tranng from only a few wrongly labeled examples can severely degrade the performance of the classfer. Therefore, we choose to be more conservatve and only select an unlabeled example f k, where k.1n, base classfers confdently label an example. It would be very neffcent to use every confdently labeled example for onlne tranng. The example labeled through co-tranng wll mprove the performance of the boosted classfer only f t has a small or negatve margn,.e., f the example les close to the decson boundary n the soluton space. If the example has been labeled unambguously by the boosted classfer,.e., t has a large margn, then usng t for tranng wll have lttle effect on the boosted classfer. Thus, we need unlabeled examples whch have a small (or negatve) margn and are also confdently labeled by the base classfers. The lmts on the score of the boosted classfer can also be establshed through the valdaton set. The score of an example for the label c s computed by Adaboost.M1 as Σ n:hn (x)=c log( 1 β n ), where β n s the coeffcent of the n th classfer selected by the algorthm. The label that gets the hghest score s assgned to the example. For the class c, the threshold to determne the usefulness of employng the example for retranng,.e.,, s set to be the hghest normalzed score acheved by a negatve example. Thus an example, assgned the label c by base classfers should only be used for retranng f t gets a score of less than Tj,c base by the boosted classfer. Once an example has been labeled and f t has a small margn, the next ssue s to use ths example for updatng the boostng parameters and the base classfers onlne. The co-tranng and onlne updaton algorthm s gven n Fgure 2. T base j,c 3.1 Onlne Learnng Note that an onlne algorthm does not need to look at all the tranng data at once, rather t process each tranng nstance wthout the need for storage and mantans a current hypothess that has been learned from the tranng examples encountered so far. To ths end we use an onlne boostng algorthm proposed by Oza and Russel [6]. The nputs to the algorthm are the current boosted classfer H N, the consttuent base classfers, and parameters λ sc n and λ sw n, where n = 1,..., N. λ sc n and λ sw n are the sums of the weghts of the correctly classfed and msclassfed examples, respectvely, for each of the N base classfers. The man dea of the algorthm s to update each base classfer and the assocated boostng parameter usng the ncomng example. The example s assgned a weght λ at the start of the algorthm. For the frst teraton, the base classfer s updated z tmes, where z = P osson(λ). Then, f h 1 msclassfes the example, λ sw 1 s updated whch s the sum of weght of all ncorrectly classfed examples by h 1. The weght of the example λ s ncreased and t s presented to the next base classfer. Note that n the regular batch adaboost method the weght of the example s also ncreased n case of msclassfcaton. However, all the weghts are assumed to be known at the next teraton. In the onlne boostng method only the sums of weghts of correctly classfed and msclassfed examples (see so far) are avalable. The boostng parameters, β n, are also updated usng these weghts. Note that the algorthm also needs to update the base classfers onlne. Snce our base classfers are represented as normalzed hstograms, they can easly be updated,.e., the tranng example s added to the hstogram representng the probablty dstrbuton of the feature, and the hstogram s re-normalzed. The onlne learnng algorthm s shown n the bottom half of Fgure 2. 4. Results For the ntal tranng of the mult-category classfer, we used 50 tranng mages per class. Images of pedestrans and vehcles from a varety of poses were used. For the non-statonary background class, we selected the scenaros where the background modellng s lkely to fal, for example sporadcally movng tree branches, or waves n a pond. All extracted objects were scaled to the same sze (30x30 pxels). Features were obtaned by projectng all the mage regons n the pedestran and vehcle subspaces. The base and boosted classfer thresholds were determned for a valdaton set consstng of 20 mages per class for a total of 60 mages. We evaluated our algorthm for person and vehcle detecton n three dfferent locatons. In each locaton, the vew conssted of the road, wth walkways near by. The pedestran and vehcular traffc along the paths was farly consstent. We demonstrated the mprovement through onlne co-tranng at each locaton n two dfferent ways. Frstly, we dvded the sequences n equal sze chunks and show that classfcaton accuracy mproves wth tme though onlne learnng. Fgure 4 shows classfcaton results over two mnute subsets for the three sequences. Note that wth the excepton of one nterval n the second sequence, the performance ether consstently mproves wth tme or remans stable. The performance measure was the classfcaton accuracy,.e., the percentage of the number of vald vehcle and pedestran detectons to the total number of detectons. For further analyss of the method, we dvded each se- 4

Fgure 3: Some classfcaton results from sequence 1. Performance over Tme Performance over Tme Performance over Tme 2 3 4 5 6 7 8 9 10 Tme Interval 2 3 4 5 6 7 8 9 10 Tme Interval 2 3 4 5 6 7 8 9 10 Tme Interval Fgure 4: Change n performance wth ncrease n tme for sequence 1,2 and 3 respectvely. The performance was measured over two mnute ntervals. Approxmately 150 to 200 possble detectons of vehcles or pedestrans were made n each tme nterval. quence nto two sets. In the frst set the classfcaton results were obtaned usng the mult-class Adaboost.M1 classfer wthout co-tranng. Then the other set was run wth the cotranable classfer, stoppng when a pre-determned number of labeled examples had updated the classfer parameters. Once the updated parameters were obtaned, the boostng algorthm was re-run on the frst sequence wth the classfer parameters frozen and the change n performance was measured. The mprovement n the performance of the algorthm n the frst setup s shown n Fgure 6. The horzontal axs shows the number of examples obtaned through co-tranng from the second sequence, and the vertcal axs shows the detecton rates on the test sequence. The detecton rates mprove sgnfcantly even wth a small number of new tranng examples. Snce the automatcally labeled tranng examples are from the specfc scene on whch the classfer s beng evaluated on, only a few co-traned examples are suffcent to ncrease the detecton accuracy. Some detecton results are shown n Fgures 3 and 5. Upon analyss of the examples selected for co-tranng by the base classfers we found out that approxmately 98% of these were correctly labeled. The small number of msclassfcaton were caused manly by occluson. One mportant pont n the use of examples obtaned through cotranng for update of classfer parameters s that, f the examples are msalgned, or the target object s only partally vsble, then updatng the classfer parameters wth that example can lower the classfcaton accuracy. We reduce the lkelhood of such a scenaro by forcng the detected regon to be wthn the foreground regons as determned by the background modelng algorthm. Moreover we only select those examples that are at peaks of the (boosted) classfer scorng functon, as suggested n [5]. Another problem that mght arse durng co-tranng s that f examples of one class are observed n much greater numbers than other classes. Updatng the classfer parameters by tranng through examples of one class only can bas the classfer. Ths problem always occurs n a scenaro when the background has to be dstngushed from the object by the classfer. In ths case, the examples of the background class outnumber by far the examples of the object class. Snce, we are removng most of the background regon by background subtracton, ths scenaro s less lkely to occur. To avod ths problem completely, f examples of one class are beng confdently labeled n much greater number than others, then one can store the examples and 5

Fgure 5: Movng object classfcaton results from sequence 2. Classfer Performace Classfer Performace Classfer Performace 72 0 10 20 30 40 50 60 Number of Tranng Examples 72 0 10 20 30 40 50 60 Number of Tranng Examples 0 10 20 30 40 50 60 Number of Tranng Examples Fgure 6: Performance vs. the number of co-traned examples, for sequences 1,2 and 3 respectvely. The graphs for each sequence show the mprovement n performance wth the ncrease n the use of examples labeled by the co-tranng method. Note that, relatvely few examples are requred for mprovng the detecton rates snce these examples are from the same scene n whch the classfer s beng evaluated. The classfcaton accuracy was relatvely low for sequence 2 snce there was persstent occluson between vehcles. sample them n numbers comparable to other classes, rather than usng all of them for tranng. 5. Concludng Remarks In ths paper, we presented a unfed boostng based framework for onlne tranng and classfcaton of objects. The examples that were confdently labeled by a small subset of base classfers were used to update both the boostng coeffcents and the base classfers. We have demonstrated that a classfer s performance can be sgnfcantly mproved just by usng a small numbers of examples from the specfc scenaro n whch the classfer s employed. Ths s because the varaton n the poses of objects, backgrounds and llumnaton condtons n a specfc scene s far less than the possble varaton n all possble detecton scenaros. The use of co-tranng n an onlne classfcaton framework allows us to focus on the specfc subset of poses and backgrounds lkely to be vewed n each scenaro. References [1] Anonymous. for blnd revew. [2] A. Blum and T. Mtchell. Combnng labeled and unlabeled data wth co-tranng. In 11th Annual Conference on Computatonal Learnng Theory, 1998. [3] M. Collns and Y. Snger. Unsupervsed models for named entty classfcaton. In Emprcal Methods n Natural Language Processng, 99. [4] Y. Freund and R. Schapre. Experments wth a new boostng algorthm. In Internatonal Conference on Machne Learnng, 1996. [5] A. Levn, P. Vola, and Y. Freund. Unsupervsed mprovement of vsual detectors usng co-tranng. In Internatonal Conference on Computer Vson, 2003. [6] N. Oza. Onlne ensemble learnng. In Ph.D. dssertaton, 2002. [7] C. Papageorgou and T. Poggo. Tranable pedestran detectons. In Internatonal Conference on Image Processng, 1999. [8] D. Perce and C. Carde. Lmtatons of co-tranng for natural language learnng from large datasets. In Conference on Emprcal Methods n Natural Language Processng, 2001. [9] H. Schnederman and T. Kanade. A statstcal method for 3d object detecton appled to faces and cars. In Internatonal Conference on Computer Vson and Pattern Recognton, 2000. [10] P. Vola, M. Jones, and D. Snow. Detectng pedestrans usng patterns of moton and appearance. In Internatonal Conference on Computer Vson, 2003. [11] D. Zhang, S. Z. L, and D. Perez. Real-tme face detecton usng boostng n herarchcal feature spaces. In Int. Conf. on Image Processng, 2004. 6