Improved Vehicle Classification in Long Traffic Video by Cooperating Tracker and Classifier Modules

Improved Vehile Classifiation in Long Traffi Video by Cooperating Traker and Classifier Modules Brendan Morris and Mohan Trivedi University of California, San Diego San Diego, CA 92093 {b1morris, trivedi}@usd.edu Abstrat Visual surveillane systems intend to extrat meaning from a sene. Two initial steps for this extration are the detetion and traking of objets followed by the lassifiation of these objets. Often times these are viewed as separate problems where eah is solved by an individual module. These tasks should not be done individually beause they an help one another. This paper demonstrates the benefit gained both in traking and lassifiation through the ommuniation between these individual modules. This is shown on a real-time system monitoring highway traffi. The system retrieves online video at 10 frames/se and onduts traking and lassifiation simultaneously. Results show an improvement from 74% to 88% auray in lassifiation results. 1. Introdution Video surveillane has prompted a wide variety of researh with traking being one of the foremost [7]. Aurate traking is possible even through many diffiult situations suh as hanging lighting onditions, olusion, or adverse weather onditions. There is also the objet reognition amp that seeks to determine the identity of an objet visually [1]. Usually these are seen as two different problems. In a sene where the objets of interest are in motion they are in fat omplementary tasks [9]. Traking and lassifiation should both be implemented in a visual surveillane system beause they are inherently linked in many higher level analyses. Aurate vehile lassifiation an be used for strutural health monitoring [4], environmental studies on impats from emissions [3], and road management and traffi planning. In a more general setting, traking with lassifiation an be partiularly useful for re-identifiation [8] of vehiles through larger video networks with non-overlapping views or without time syn- Figure 1. Standard output frame from the system. Text aompanying eah detetion gives the detetion number d, trak number t, lassifiation (detetion, trak), and veloity {d# t# # # v#}. hronization. Classifiation an also be used to provide ontext to systems that learn normal and abnormal behavior patterns [6]. (In a highway appliation one expets large truks to travel in the slower lanes). This paper demonstrates the benefit gained both in traking and lassifiation results through the ommuniation between the individual modules. This is demonstrated with a real-time system monitoring highway traffi. The system retrieves online video at 10 frames/se and onduts traking and lassifiation simultaneously. Results show an improvement from 74% to 88% auray in lassifiation results. 2. System Overview The system presented in this paper is a general traking system to be used as a utility for lab experiments. The goal was to develop traking and lassifiation software that an

Objet Detetion blobs Detetion Classifiation w Traking w Trak Classifiation x t Traks Figure 2. System blok diagram with interonnets between lassifiation and traking. be used as a front end for higher level analyses. The experimental test bed onsists of 10 ameras situated around ampus, offering a wide variety of senes from highway to foot traffi. Video is streamed via the internet using Axis video servers at 10 frames a seond. This software an be run in real-time for long periods of time (data for this paper was olleted over 24 hours) olleting data and statistis that an be stored for future investigation [2]. A blok diagram, with four main bloks, the Objet Detetion, Detetion Classifiation, Traking, and Trak Classifiation modules, for this system is in Fig. 2. The Objet Detetion module loates potential objet pixels by onstruting a bakground model and performing bakground subtration. The Detetion Classifiation module takes measurements on onneted omponent objet blobs to lassify the objet type. The Traking module traks blobs using a Kalman filter and the objet measurements. Finally, the Trak Classifiation module uses the traking information to refine the objet lass estimation from the Detetion Classifiation blok. A typial output frame from this system is shown Fig. 1. The labels above eah vehile are of the form {d# t# # # v#} with d being the detetion number, t the trak number, the lassifiation (detetion lass, trak lass), and finally a rough veloity in mph. 3. Objet Detetion The Objet Detetion module quikly determines foreground pixels by using an adaptive bakground subtration sheme. The bakground model is omposed of two parameters µ, a time averaged bakground image of the sene, and σ, a measure of the variability in the sene. The bakground model is adaptively updated as eah new video frame is reeived by omputing a running average where the ontribution of the newest frame, I t, is ontrolled by the parameter α [0, 1], L T µ t = (1 α)µ t 1 + αi t, (1) σ 2 t = (1 α)σ 2 t 1 + α(i t µ t ) 2. (2) The foreground pixels are extrated by bakground subtration and thresholding, where the threshold is determined by the past deviations of a pixel (σ 0 is a small onstant to suppress noise), I foreground = (I t µ t ) > T (σ n + σ 0 ). (3) The foreground is further proessed to fill in any holes with morphologial operations. Eah blob is then labeled by onneted omponent analysis generating a unique identifier for further proessing. 4. Detetion Classifiation The Detetion Classifiation module takes measurements of eah foreground blob. The measurements are intended to haraterize an objet by providing a unique signature of any potential sene objet. The measurement vetor used here is omposed of 17 simple blob features {area, breadth, ompatness, elongation, perimeter, onvex hull perimeter, length, long and short axis of fitted ellipse, roughness, entroid, 5 image moments}, x = [m 0,..., m 16 ] T. The objet lass is determined by transforming x and omparing the transformed vetor with a set of training examples. The lassifier is trained by olleting measurement samples and performing linear disriminant analysis (LDA) [5] to projet the data onto a lower dimensional spae better suited for lassifiation. The objets are then ompared in this projetion spae using a weighted K nearest neighbor (wknn) lassifier. The training set is hosen to have the same number of examples of eah lass to maintain omparison fairness. The training set is made up of prototype measurement vetors learned by lustering using fuzzy means (FCM). The details of the lassifiation sheme are given in the following setions. 4.1. LDA Classifiation is performed in a lower dimensional spae onstruted using linear disriminant analysis. LDA designs a spae by transforming the features in a training set to maximize the distane between lasses. Let D = {x 1,..., x N } be a set of N training vetors for lass, eah of dimension d, with mean µ = 1 N N x i. The full training set, D = {D 1,..., D C }, is omposed of the training samples from all lasses and has mean µ = 1 N N x i, where N = N. The LDA projetion is found by the

maximization problem P LDA = argmax w w T S B w w T S W w, (4) where S B is the between lass satter matrix and S W is the within lass satter matrix, given by S B = S W = C N i (µ i µ)(µ i µ) T, (5) C x k D C (x k µ i )(x k µ i ) T. (6) The solution to this maximization leads to the generalized eigen problem S B w = λs W w. The top M eigenvetors are retained to obtain the LDA projetion matrix, x LDA = P LDA x = [w 1,..., w i,..., w M ]x (7) The detetion measurements are transformed by projeting them onto the LDA spae using P LDA where lassifiation an our using weighted K nearest neighbors. 4.2. wknn The wknn rule [10] is a modifiation of the nearest neighbor (NN) lassifier. The advantage of wknn is that eah sample is assigned to every lass while NN only gives a binary indiation of lass membership. This lass weight is a soft membership to eah lass, whih builds robustness to noise and outliers. The weight for lass, w, is determined by adding the similarity of the K losest training samples with label. The similarity is defined as the inverse of the Eulidean distane between vetors. The label of an individual detetion, L D, is the lass that has highest weight, 4.3. FCM w = K x i D L D = argmax 1 x i x test, (8) w. (9) Using a NN derivative makes lassifiation inherently dependent on the training set. The training set must be diverse enough to apture all desired lasses and ontain ample variability to distinguish between these lasses. When olleting samples, the training set will be biased toward the most often ourring lass. (The number of sedans far exeeds the number of semi truks in highway surveillane). Fairness is introdued to the wknn lassifier by normalizing eah lass to have the same number of training samples (N p ). These prototype training vetors are learned using Fuzzy C Means [11] to iteratively minimize the loss funtion N p N Q = u m ik x k v i 2. (10) k=1 With membership onstraint N p u ik = 1. (11) x k is a test point, v i a luster prototype, u ik [0, 1] is the membership of sample k to prototype i, and m > 1 is a fuzzifiation fator. This problem is solved by minimizing the objetive funtion (10) subjet to the onstraint (11) by using the method of Lagrange multipliers. The minimization leads to the following updates for the prototype vetors v i and membership u ik, v i = u ik = N k=1 um ik x k N k=1 um ik N p ( xk v i 2 x k v j 2 j=1 ) 2 m 1 1 (12). (13) The prototype vetors are used as the training set for wknn. (The training set an be adapted to new samples by using the membership sore, v j = u ij x i + (1 u ij )v j, but this has not been implemented). 5. Traking The Traking module is based on the entroid of deteted blobs. The entroid of eah blob is traked using a onstant veloity model Kalman filter. The state of the filter is the entroid loation and veloity, s = [ x, y, v x, v y ] T, and the measurement is an estimate of this entire state, y = ŝ = [ ˆ x, ĉ y, vˆ x, ˆv y ] T. The data assoiation problem between multiple blobs is solved by omparison of the predited entroid loation with the entroids of the detetions in the urrent. The blob with entroid losest to the predited loation is hosen as a math for the trak. In addition to the Kalman filter, eah trak maintains a history of the measurements of detetions belonging to the trak. When a new detetion is assoiated to a trak, the trak history is updated x trak t = (1 α)x trak t 1 + αx detetion t. (14)

Class 0-Sedan 1-Truk 2-SUV 3-Semi 4-Van 5-TSV 6-MT Total % 81.7 76.2 63.3 62.5 62.2 62.2 100 74.4 Table 1. Detetion lassifiation auray results Class 0-Sedan 1-Truk 2-SUV 3-Semi 4-Van 5-TSV 6-MT Total % 94.3 87.5 75.0 100 90.5 0 85.71 88.4 Table 2. Traking lassifiation auray results has not had the time to initialize a veloity before it traking makes the mistake of linking the wrong vehile. This inorret linking atually ours twie in the last 2 frames as the middle ar gets assoiated with the trak atually belonging to the bottom sedan. (a) Frame 4: Two vehiles merged from bakground detetion. (b) Frame 5: Two new traks are instantiated beause the trak measurement onstraint was violated. (a) Trak 40: Van mislassified as SUV (2) by Detetion Classifier but orretly labeled by Trak Classifier as Van (4) Figure 3. Example of trak measurement onsisteny onstraint. 3(a) and 3(b) show a trak being split. Similar to the bakground update, α [0, 1], but now ontrols how similar measurements from suessive detetions must be along the trak. Larger α is used when objets have larger variability along a trak. The trak measurement history is used to enfore onsisteny between a potential detetion and trak. In addition to being in the predited loation, a mathed objet must also have similar measurements (S meas > T S ). The similarity between a trak and a test detetion is defined as (b) Trak 58: SUV mislassified as Van (4) by Detetion Classifier but orretly labeled by Trak Classifier as SUV (2) Figure 4. Examples of trak lassifiation orreting for mislassified detetion. S meas = [(x trak x test ) T Σ 1 (x trak x test )] 1, (15) where Σ is a diagonal matrix with entries equal the the variane of the partiular measurement learned during training. Fig. 3 shows a trak orretly being split into 2 new traks beause the measurement onstraint was violated. Even with the measurement onstraint there are still ases when diffiult to disambiguate traks as seen in Fig. 5. When the merged sedans are split into 3 new traks the Kalman filter 6. Trak Classifiation Traking gives a reord of an objet while in the amera field of view. Eah time instant along a trak is an example of the objet, giving us T examples over the ourse of a trak (T does not have to be the end of a trak). Given these T samples, the Trak Classifiation module generates the objet lass by maximum likelihood estimation.

Figure 5. Diffiulties traking even with measurement onstraints. After the split all three of the vehiles appear quite similar onfusing the trak orrespondenes and ausing multiple trak splits. L T = argmax = argmax T ln p(x t ) (16) t=1 T ln w w. (17) t=1 The likelihood p(x t ) of lass is approximated by normalizing (8) to be a valid probability. The trak lass is refined eah frame as the trak is updated. The trak label takes into aount all the evidene throughout the entire trak to make a deision on lass type rather than a single frame measurement that ould potentially be orrupted by many sorts of noise. The final trak label is the lass assigned last before the trak ends. Fig. 4 gives examples of the trak lassifier overoming inorret detetion lassifiation results. In Fig. 4(a) the Detetion Classifiation is 2 (SUV) but the Trak Classifiation is 4 (Van). Fig. 4(b) is the opposite ase where the detetion label is inorret (4 - Van) but the traking label maintains the true vehile identity (2 - SUV). 7. Results The proposed traking + lassifiation system was run for 24 hours to test the improvements of trak based lassifiation. The video stream analyzed was a highway sene streamed over the internet and proessed between 10-15 frames/se with 352x240 resolution. Sine this system is able to run for long periods of time it is not feasible to store all the video results. Instead 5 minute output lips were saved every hour for evaluation. The training data onsisted of 1700 vehiles divided into 7 lasses. The 7 different vehile lasses were 0 - Sedan, 1 - Truk, 2 - SUV, 3 - Semi, 4 - Van, 5 - Truks+SUV+Van (TSV), and 6 - Moving Truks (MT). The LDA projetion was found by retaining the top M = 5 eigenvetors of 7. Using FCM, 1043 training prototypes were generated (149 for eah lass). All lassifiation results were then omputed using wknn with K = 5 with respet to these prototypes. Tables 1 and 2 give the lassifiation auray after hand labeling the true vehile lasses for two of the videos. The lower rates seen in the Detetion Classifier is beause of the similarity between vehile lasses. Vans and SUVs are quite similar as well as the Semi and Moving Truks. Note that none of the TSV vehiles were properly lassified after traking. This is beause the TSV lass was a wrapper lass for Truk, SUV, and Van beause they were previously found to be the most often onfused vehiles [9]. This label was used sparingly for the rare ourrene of a vehile that even a human ould not distinguish, making it a lass of hard examples. Beause of its rarity and strong similarity to 3 other lasses the Trak Classifier hose to label all vehiles with a less general label. The TSV examples were plaed into either one of the Truk, SUV, or Van lasses. Even with the diffiulties disambiguating lasses based on single frame detetions the Trak Classifier performs quite well with a total improvement of over 10%. Unfortunately this lassifier did not work well for all times of the day. At night the lassifier was useless beause low light onditions produed inomplete detetions. 8. Conlusions Separately, traking and lassifiation are two important tasks of any surveillane system. Performing both these operations in onjuntion delivers improved performane in both. This paper demonstrates this improvement through experiments run on live video. Data was aptured and proessed in real-time over long periods of time. Analysis of the system output showed an improvement of 10% over sin-

gle frame lassifiation using a trak based lassifier as well as more onsistent vehile traks. Referenes [1] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaes vs. Fisherfaes: Reognition using lass speifi linear projetion. IEEE Trans. Pattern Anal. Mahine Intell., 19(7):771 720, July 1997. [2] S. Bhonsle, M. Trivedi, and A. Gupta. Database-entered arhiteture for traffi inident detetion, management, and analysis. In Pro. IEEE Conf. on Intell. Transport. Syst., pages 149 154, Dearborn, Mihigan, Ot. 2000. [3] C. Cardelino. Daily variability of motor vehile emissions derived from traffi ounter data. Journal of the Air and Waste Management Assoiation, 48(7), July 1998. [4] R. Chang, T. Gandhi, and M. M. Trivedi. Vision modules for a multi-sensory bridge monitoring approah. In Pro. IEEE Conf. on Intell. Transport. Syst., pages 971 976, Ot. 2004. [5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifiation. Wiley-Intersiene, New York, NY, seond edition, 2001. [6] W. Hu, X. Xiao, D. Xie, T. Tan, and S. Maybank. Traffi aident predition using 3-d model-based vehile traking. IEEE Trans. Veh. Tehnol., 53(3):677 694, May 2004. [7] V. Kastrinaki, M. Zervakis, and K. Kalaitzakis. A survey of video proessing tehniques for traffi appliations. Image and Vision Computing, 21(4):359 381, Apr. 2003. [8] G. T. Kogut and M. M. Trivedi. Maintaining the identity of mulitiple vehiles as they travel through a video network. In Pro. IEEE Conf. on Intell. Transport. Syst., pages 756 761, Oakland, California, Aug. 2001. [9] B. T. Morris and M. M. Trivedi. Robust lassifiation and traking of vehiles in traffi video streams. In Pro. IEEE Conf. on Intell. Transport. Syst., Toronto, Canada, Sept. 2006. to be published. [10] T. K. Osamu Hasegawa. Type lassifiation, olor estimation, and speifi target detetion of moving targets on publi streets. Mahine Vision and Appliations, 16:116 121, Feb. 2005. [11] W. Pedryz. Knowledge-Based Clustering: From Data to Information Granules. John Wily, Hoboken, New Jersey, 2005.