Segmentation and Tracking of Multiple Humans in Crowded Environments

Size: px

Start display at page:

Download "Segmentation and Tracking of Multiple Humans in Crowded Environments"

Preston Roberts
5 years ago
Views:

1 Segmentaton and Trackng of Multple Humans n Crowded Envronments Tao Zhao, Member, IEEE, Ram Nevata, Fellow, IEEE, and Bo Wu, Student Member, IEEE, Abstract Segmentaton and trackng of multple humans

1 1 Segmentaton and Trackng of Multple Humans n Crowded Envronments Tao Zhao, Member, IEEE, Ram Nevata, Fellow, IEEE, and Bo Wu, Student Member, IEEE, Abstract Segmentaton and trackng of multple humans n crowded stuatons s made dffcult by nter-object occluson. We propose a model based approach to nterpret the mage observatons by multple, partally occluded human hypotheses n a Bayesan framework. We defne a jont mage lkelhood for multple humans based on the appearance of the humans, the vsblty of body obtaned by occluson reasonng, and foreground/background separaton. The optmal soluton s obtaned by usng an effcent samplng method, data-drven Markov chan Monte Carlo DDMCMC), whch uses mage observatons for proposal probabltes. Knowledge of varous aspects ncludng human shape, camera model, and mage cues are ntegrated n one theoretcally sound framework. We present expermental results and quanttatve evaluaton, demonstratng that the resultng approach s effectve for very challengng data. Index Terms Multple Human Segmentaton, Multple Human Trackng, Markov chan Monte Carlo the objects to be ntalzed before occluson happens. Ths s usually nfeasble for crowded scene. We beleve that use of a shape model s necessary to acheve ndvdual human segmentaton and trackng n crowded scenes. a) Sample frame b) Moton blobs I. INTRODUCTION AND MOTIVATION Segmentaton and trackng of humans n vdeo sequences s mportant for a number of applcatons, such as vsual survellance and human computer nteracton. Ths has been a topc of consderable research n the recent past and robust methods for trackng solated or small number of humans havng only transent occluson exst. However, trackng n a more crowded stuaton where several people are present and exhbt persstent occluson, remans challengng. The goal of ths work s to develop a method to detect and track humans n the presence of persstent and temporarly heavy occluson. We do not requre that humans be solated,.e. un-occluded, when they frst enter the scene. However, n order to see a person, we requre that at least the head-shoulder regon must be vsble. We assume a statonary camera so that moton can be detected by comparson wth a background model. We do not requre the foreground detecton to be perfect, e.g. the foreground blobs may be fragmented, but we assume that there are no sgnfcant false alarms due to shadows, reflectons, or other reasons. We also assume that the camera model s known and that people walk on a known ground plane. Fg.1a) shows a sample frame of a crowded envronment and Fg.1b) shows the moton blobs detected by comparson wth the learned background. It s apparent that segmentng humans from such blobs s not straght-forward. One blob may nclude multple objects; whle one object may splt nto multple blobs. Blob trackng over extended perods, e.g. [20], may resolve some of these ambgutes but such approaches are lkely to fal when occluson s persstent. Some approaches have been developed to handle occluson, e.g. [9], but requre 1 T. Zhao s wth Intutve Surgcal Inc, 950 Kfer Road, Sunnyvale, CA Emal: taozhao@alumn.usc.edu. R. Nevata and B. Wu are wth Insttute for Robotcs and Intellgent Systems, Unversty of Southern Calforna, Los Angeles, CA Emal: {nevata bowu}@usc.edu. c) Our result Fg. 1. An sample frame, the correspondng moton blobs and our segmentaton and trackng result for crowded stuaton. In earler related work [54], Zhao and Nevata model human body as a 3D ellpsod and human hypotheses are proposed based on head top detecton from foreground boundary peaks. Ths method works reasonably well n presence of partal occlusons f the number of people n the feld of vew s small. As the complexty of the scene grows, head tops can not be obtaned by smple foreground boundary analyss and more complex shape models are needed to ft more accurately wth the observed shapes. Also, jont reasonng about the collecton of objects s needed, rather than the smpler one-byone verfcaton method n [54]. The consequence of ths jont consderaton s that the optmal soluton has to be computed n the jont parameter space of all the objects. To track the objects n multple frames, temporal coherence s another desred property besdes accuracy of the spatal segmentaton. We adapt a data-drven Markov chan Monte Carlo approach to explore ths complex soluton space. To mprove the computatonal effcency, we use drect mage features from bottom-up mage analyss as mportance proposal probabltes to gude the moves of the Markov chan. The man features of ths work nclude 1) a 3-dmensonal part based human body model, whch enables segmentaton and trackng of humans n 3D and nference of nter-object occluson naturally; 2) a Bayesan framework whch ntegrates segmentaton and trackng based on a jont lkelhood for the appearance of multple objects; Dgtal Object Indentfer /TPAMI /$ IEEE

2 2 3) desgn of an effcent Markov chan dynamcs, drected by proposal probabltes based on mage cues; and 4) the ncorporaton of a color based background model n a mean shft trackng step. Our method s able to successfully detect and track humans n scenes of complexty shown n Fg.1 wth hgh detecton and low false alarm rates; the trackng results for the frame n Fg.1a) s shown n Fg.1c) the result ncludes ntegraton of multple frames durng trackng). In the result secton, we gve graphcal and quanttatve results on a number of sequences. Parts of our system have been partally descrbed n [53] and [55]; ths paper provdes a unfed presentaton of the methodology, addtonal results and dscussons. Ths approach has been bult on by other researchers, e.g. [41]. The same framework has also been successfully appled to vehcle segmentaton and trackng n challengng cases [43]. The rest of the paper s organzed as follows: Secton II gves a bref revew of the related works; Secton III presents an overvew of our method; Secton IV descrbes the probablstc modelng of the problem; Secton V descrbes our MCMC based soluton; Secton VI shows expermental results and evaluaton; conclusons and dscussons are gven n the last secton. II. RELATED WORK We summarze related work n ths secton; some of these are referred to n more detal n the followng sectons. Due to the sze of the lterature n ths feld, t s not possble for us to provde a comprehensve survey but we attempt to nclude the major trends. The observatons for human hypotheses may come from multple cues. Many prevous approaches [20], [9], [54], [37], [44], [15], [18], [40], [24], [3], [45] use moton blobs detected by comparng pxel colors n a frame to learned models of the statonary background. When the scene s not hghly crowded, most part of the humans n the scene are detected n the foreground moton blob; multple humans may be merged nto a sngle blob but they can be separated by rather smple processng. For example, Hartaoglu et al. [15] uses vertcal projecton of the blob to help segment a bg blob nto multple humans. Sebel and Maybank [40], Zhao and Nevata [54] detect head canddates by analyzng the foreground boundares. Snce dfferent humans have small overlappng foreground regons, they could be segmented n a greedy way. However, the utlty of these methods n crowded envronments such as n Fg.1 s lkely to be lmted. Some methods, e.g. [50], [31], [7], [13] detect appearanceor shape-based patterns of humans drectly. [50] and [31] learn human detectors from local shape features; [7] and [13] bulds contour templates for pedestrans. These learnng based methods need a large number of tranng samples and may be senstve to magng vew-pont varatons as they learn 2-D patterns. Besdes moton and shape, face and skn-color are also useful cues for human detecton, but envronments where these cues could be utlzed are lmted, usually ndoor scenes where llumnaton s controlled and the objects are maged wth hgh resoluton, e.g. [42], [12]. Wthout a specfc model of objects, trackng methods are lmted to blob trackng e.g. [3]. The man advantage of model-based trackng s that t can solve the blob merge and splt problems by enforcng a global shape constrant. The shape models could be ether parametrc, e.g. an ellpsod as n [54], or non-parametrc, e.g. the edge template as n [13]; ether n 2D, e.g. [46] or n 3D, e.g. [54]. Parametrc models are usually generatve and of hgh dmensonalty, whle non-parametrc models are usually learned from real samples. 2D models make the matchng of hypotheses and mage observatons straghtforward, whle 3D models are more natural for occluson reasonng. The choce of the model complexty depends on both the applcaton and the vdeo resoluton. For human trackng from a md-dstant camera, we do not need to capture the detaled body artculaton, a rough body model, such as the generc cylnder n [19], the ellpsod n [54], and the multple rectangles n [46] suffce. When the body pose of humans s desred and the vdeo resoluton s hgh enough, more complex models could be used, such as the artculated models n [54] and [34]. Trackng of multple objects requres matchng of hypotheses wth the observatons both spatally and temporally. When objects are hghly nter-occluded, ther mage observatons are far from beng ndependent, hence a jont lkelhood for multple objects s necessary [46], [27], [19], [35], [30], [51]. Smth et al. [41] use a par-wse Markov Random Feld MRF) to model the nteracton between humans and defne the jont lkelhood. Rttscher et al. [36] nclude a hdden varable, whch ndcates a global mappng from the observed features to human hypotheses, n the state vector. As the soluton space s of hgh dmenson, searchng for the best nterpretaton by brute force s not feasble. Partcle flters based methods, e.g. [19], [46], [30], [51], [27], become unsutable when the dmensonalty of the search space s hgh as the number of samples needed usually grows exponentally wth the dmenson. [41], [21] use some varatons of MCMC algorthm to sample the soluton space whle [45], [36] uses an EM style method. For effcency the canddate solutons could be generated from some mage cues, not pure randomly, e.g. [36] propose hypotheses from local slhouette features. Informaton from multple cameras wth overlappng vews can reduce the ambguty of a sngle camera. Such methods usually assume that at least from one vew pont, the object can be detected successfully e.g. [11]) or many cameras are avalable for 3-dmensonal reconstructon e.g. [28]). The dffculty of segmentng multple humans whch overlap n mages from a stereo camera s allevated by analyzng n the 3-dmensonal space where they are separable [52]. In a mult-camera context, an object can be tracked even when t s fully occluded from some of the vews; however, many real envronments do not permt use of multple cameras wth overlappng vews. In ths paper, we consder stuatons where vdeo from only one camera s avalable. However our approach can utlze multple cameras wth lttle modfcaton. MCMC-based methods are recevng ncreasng popularty for computer vson problems due to ts flexblty n optmzng an arbtrary energy functon as opposed to energy functons of specfc type as n graph cut [2] or belef propagaton [49].

3 3 It has been used for varous applcatons ncludng segmentng multple cells [38], mage parsng [48], mult-object trackng [21], estmatng artculated structures [23], etc. Data-drve MCMC was proposed by [48] to utlze bottom-up mage cues to speed up the samplng process. We want to pont out the dfference between our approach and another ndependently developed work [21] whch also used MCMC for mult-object trackng. [21] assumes that the objects do not overlap by applyng a penalty term for overlap whle our approach explctly uses a lkelhood of appearance under occluson. Our approach focuses on the doman of trackng human whch s the most mportant subject for vsual survellance. We consder the 3-dmensonal perspectve effect n typcal camera settng whle the ant trackng problem descrbed n [21] s almost a 2-dmensonal problem. We utlze acqured appearance where each object s of dfferent appearance where ants n [21] are assumed to have the same appearance. We developed a full set of effectve bottom-up cues for human segmentaton and hypotheses generaton. III. OVERVIEW Our approach to segmentng and trackng of multple humans emphaszes the use of shape models. An overvew dagram s gven n Fg.2. Based on a background model, the foreground blobs are extracted as the basc observaton. By usng the camera model and the assumpton that objects move on a known ground plane, multple 3D human hypotheses are projected onto the mage plane and matched wth the foreground blobs. Snce the hypotheses are n 3D, occluson reasonng s straghtforward. In one frame, we segment the foreground blobs nto multple humans and assocate the segmented humans wth the exstng trajectores. Then the tracks are used to propose human hypotheses n the next frame. The segmentaton and trackng are ntegrated n a unfed framework and nter-operate along tme. Fg. 2. Overvew dagram of our approach. We formulate the problem of segmentaton and trackng as one of Bayesan nference to fnd the best nterpretaton gven the mage observatons, the pror models, and the estmates from prevous frame analyss.e. the maxmum a posteror, MAP, estmaton). The state to be estmated at each frame ncludes the number of objects, ther correspondences to the objects n the prevous frame f any), ther parameters e.g. postons), and the uncertanty of the parameters. We defne a color-based jont lkelhood model whch consders all the objects and the background together, and encodes both the constrants that the object should be dfferent from the background and that the object should be smlar to ts correspondence. Usng ths lkelhood model gracefully ntegrates segmentaton and trackng, and avods a separate, sometmes ad hoc, ntalzaton step. Gven multple human hypotheses, before calculatng the jont mage lkelhood nterobject occluson reasonng s done. The occluded parts of a human should not have correspondng mage observatons. The soluton space contans subspaces of varyng dmensons, each correspondng to a dfferent number of objects. The state vector conssts of both dscrete and contnuous varables. Ths dsqualfes many optmzaton technques. Therefore we use a hghly general reversble jump/dffuson MCMC-based method to compute the MAP estmate. We desgn dynamcs for mult-object trackng problem. We also use varous drect mage features to make the Markov chan more effcent. Drect mage features alone do not guarantee optmalty because they are usually computed locally or usng partal cues. Usng them as proposal probabltes of the Markov chan results n an ntegrated top-down/bottom-up approach whch has both the computatonal effcency of mage features and the optmalty of a Bayesan formulaton. A mean shft technque [5] s used as effcent dffuson for the Markov chan. The data-drven dynamcs and the n-depth exploraton of the soluton space make the approach less senstve to dmensonalty compared to partcle flters. Our experments show that the descrbed approach works robustly n very challengng stuatons wth affordable computaton; some results are shown n Secton VI. IV. PROBABILISTIC MODELING Let θ represent the state of the objects n the scene at tme t; t conssts of the number of objects n the scene, ther 3D postons and other parameters descrbng ther sze, shape and pose. Our goal s to estmate the state at tme t, θ t), gven the mage observatons, I 1),...,I t), abbrevated as I 1,...,t).We formulate the trackng problem as computng the maxmum a posteror MAP) estmaton, θ t). θ t) = arg max P θ t) I 1,...,t)) θ t) { Θ = arg max P I t) θ t)) P θ t) I 1,...,t 1))} 1) θ t) Θ where Θ s the soluton space. Denote by m the state vector of one ndvdual object. A state contanng n objects can be wrtten as θ = {k 1, m 1 ),...,k n, m n )} Θ n, where k s the unque dentty of the -th object whose parameters are m, and Θ n s the soluton space of exactly n objects. The entre soluton space s Θ = N max n=0 Θ n, where N max s a upper bound of the number of objects. In practce, we compute an approxmaton of P θ t) I 1,...,t 1)) detals are gven later n secton IV-D). A. 3D Human Shape Model The parameters of an ndvdual human, m, are defned based on a 3D human shape model. Human body s hghly artculated, however, n our case, the human moton s mostly lmted to standng or walkng, and we do not attempt to

4 capture the detaled shape and artculaton parameters of the human body. Thus we use a number of low dmensonal models to capture the gross shape of human bodes. Fg. 3.

4 4 capture the detaled shape and artculaton parameters of the human body. Thus we use a number of low dmensonal models to capture the gross shape of human bodes. Fg. 3. A number of 3D human models to capture the gross shape of human bodes. Ellpsods ft human body parts well and has the property that ts projecton s an ellpse wth a convenent form [16]. Therefore we model human shape by a composton of multple ellpsods correspondng to the head, the torso and the legs, wth fxed spatal relatonshp. A few such models at characterstc poses are suffcent to capture the gross shape varatons of most humans n the scene for md-resoluton mages. We use the mult-ellpsod model to control the model complexty whle mantanng a reasonable level of fdelty. We have used three such models 1 for legs close to each other and 2 for legs well-splt) n our prevous work on multhuman segmentaton [53]. However, n ths work we use only a sngle model wth three ellpsods whch we found suffcent for trackng. The model s controlled by two parameters called sze and thckness. The sze parameter s the 3D heght of the model; t also controls the overall scalng of the object n the three drectons. The thckness parameter captures extra scalng n the horzontal drectons. Besdes sze and thckness, the parameters also nclude mage poston of the head 1,3D orentaton of the body, and 2D nclnaton of the body. The orentatons of the models are quantzed nto a few levels for computaton effcency. The orgn of the rotaton s chosen so that 0 corresponds to human facng the camera. We use 0 and 90 to represent frontal/back and sde vew n ths work. The 3D models assumes that humans are perfectly uprght, but there are chances that they nclne ther body slghtly. We use one parameter to capture the nclnaton n 2D as opposed to two parameters n 3D). Therefore, the parameters of the -th human are m = {o,x,y,h,f, } whch are orentaton, poston, sze, thckness, and nclnaton respectvely. We also wrte x,y ) as u. Wth a gven camera model and a known ground plane, the 3D shape models automatcally ncorporates the perspectve effect of camera projecton change n object mage sze and shape due to the change n object poston and/or camera vewpont). Compared to 2D shape models e.g. [13]) or prelearnt 2D appearance models e.g. [50]), the 3D models are more easly applcable for a novel vewpont. 1 The mage head locaton s a equvalent parameterzaton of the world locaton on the ground plane x w,y w ) gven the human heght. The two are related by [x, y, 1] T [p 1, p 2, p 3 h+p 4 ][x w,y w, 1] T, where p s the -th column of the camera projecton matrx and h s the heght of the human. For clarty of presentaton, we chose the ground plane to be z =0. B. Object Appearance Model Besdes the shape model, we also use a color hstogram of the object, p = { p 1,..., p m } m s the number of bns of the color hstogram) defned wthn the object shape, as a representaton of ts appearance whch helps establsh correspondence n trackng. We use color hstogram because t s nsenstve to the non-rgdty of human moton. Furthermore, there exsts effcent algorthm, e.g. the mean shft technque [5], to optmze a hstogram-based object functon. When calculatng the color hstogram, a kernel functon K E ) wth Epanechnkov profle [5] s appled to weght pxel locatons so that the center has a hgher weght than the boundary. Such a representaton has been used n [6]. Our mplementaton uses a sngle RGB hstogram wth 512 bns 8 for each dmenson), of all the samples wthn the three ellptc regons of our object model. C. Background Appearance Model The background appearance model s a modfed verson of a Gaussan dstrbuton. } Denote by r j, ḡ j, b j ) and Σ j = dag {σ 2 rj,σ 2 gj,σ 2 the mean and the covarance of the bj color at pxel j. The probablty of pxel j beng from the background s P b I j )=P { b r [ j,g j,b j ) ) 2 ) 2 ) ] } 2 rj r max exp j gj ḡ σ rj j bj b σ gj j σ bj,ɛ 2) where ɛ s a small constant. It s a composton of a Gaussan dstrbuton and a unform dstrbuton. The unform dstrbuton captures the outlers whch are not modeled by the Gaussan dstrbuton to make the model more robust. The Gaussan parameters mean and covarance) are updated contnuously by the vdeo stream only wth the non-movng regons. More sophstcated background model e.g. mxture of Gaussan [44] or non-parametrc [10]), could be used to account for more varatons but ths s not the focus of ths work; we assume that comparson wth background model yelds adequate foreground blobs. D. The Pror Dstrbuton The pror dstrbuton P θ t) I 1,...,t 1)) s decomposed n two parts gven by: P θ t) I 1,...,t 1)) P θ t)) P θ t) I 1,...,t 1)) 3) P θ t) ) s ndependent of tme, and s defned by n P S )P m ), where S s the projected mage of =1 the -th object and S s ts area. The pror of the mage area P S ) s modeled as beng proportonal to exp λ 1 S )[1 exp λ 2 S )] 2. The frst term here penalzes large total object sze to avod stuatons where two hypotheses overlap a large porton of an mage blob, 2 We have used pror on the number of objects n [53] to constran over segmentaton. However we found that the pror on the area s more effectve due to the large varaton of the mage szes of the objects due to camera perspectve effect) and therefore ther dfferent contrbuton to the lkelhood.

5 5 whle the second term penalzes objects wth small mage szes as they are more lkely to be due to mage nose. Although the pror on 2D mage sze could be converted to the 3D space, defnng ths pror n 2D s more natural, because these propertes model the relablty of mage evdence ndependent of the camera models. The prors on the human body parameters are consdered ndependent. Thus we have P m )=Po )P x,y )P h )P f )P ). We set P o frontal ) = P o profle ) = 1/2. P x,y ) s a unform dstrbuton n the mage regon where a human head s plausble. P h ) s a Gaussan dstrbuton N μ h,σh 2 ) truncated n the range of [h mn,h max ] and P f ) s Gaussan dstrbuton N μ f,σf 2) truncated n the range of [f mn,f max ]. P ) s a Gaussan dstrbuton N μ,σ 2 ). In our experments, we use μ h = 1.7m, σ h = 0.2m, h mn = 1.5m, h max = 1.9m; μ f =1, σ f =0.2, f mn =0.8, f max =1.2; μ =0, σ =3. These parameters correspond to common adult body szes. We approxmate the second term of the rght sde of Equ.3, P θ t) I 1,...,t 1) ),bypθ t) θ t 1) ), assumng θ t 1 encodes the necessary nformaton from the past observatons. For convenence of expresson, we rearrange θ t) and θ t 1) as θ t) = { kt) )} N, m t) and θ { kt 1) )} N t 1) =, m t 1), =1 =1 where N s the overall number of object present n the two frames, { kt) so that one of } t 1) = k, m t) = φ, m t 1) Fg. 4. = φ s true for each. k t) t 1) t) = k means object k s a tracked object; m t) = φ t 1) means object k s a dead object.e. trajectory s termnated); and m t 1) t) = φ means object k s a new object. Wth the rearranged state vector, we have P θ t) θ t 1)) = θt) ) P θ t 1) = N ) P m t) m t 1). The temporal pror =1 of each object follows the defnton P m t) ) m t 1) P assoc P new P dead m t) m t) ) m t 1), kt) t 1) = k ), m t 1) = φ m t 1) ), m t) = φ 4) We assume that the poston and the nclnaton of an object follow constant velocty models wth Gaussan nose, and that the heght and thckness follow a Gaussan dstrbuton for smplcty of presentaton, we omt the velocty terms n the state). We use Kalman flters for temporal estmaton; ) P assoc s therefore a Gaussan dstrbuton. P new m t) = ) ) ) P new ũ t) and P dead = P dead are the m t 1) ũ t 1) lkelhoods of the ntalzaton of a new track at poston ũ t) and the termnaton of an exstng track at poston ũ t 1) respectvely. They are set emprcally accordng to the dstance of the object to the entrances/exts the boundares of the mage and other areas that people move n/out). P new u) N μu), Σ e ), where μu) s the locaton of the closest entrance pont to u and Σ e s ts assocated covarance matrx whch s set manually or through a learnng phase. P dead ) follows a smlar defnton. E. Jont Image Lkelhood for Multple Objects and Background The mage lkelhood P I θ) reflects the probablty that we observe mage I or some features extracted from I) gven state θ. Here we develop a lkelhood model based on the color nformaton of background and objects. Gven a state vector θ, we partton the mage nto dfferent regons correspondng to dfferent objects and the background. Denote by S the vsble part of the -th object defned by m. The vsble part of an object s determned by the depth order of all the objects, whch can be nferred from ther 3D postons and the camera model. The entre object regon S = n =1 S = n =1 S, snce S are dsjont regons. We use S to denote the supplementary regon of S,.e. the non-object regon. The relatonshp of the regons s llustrated n Fg.4. Frst pane: the relatonshp of vsble object regons and the nonobject regon. Rest panes: the color lkelhood model. In S, the lkelhood favors both the dfference of an object hypothess wth the background and ts smlarty wth ts correspondng object n a prevous frame. In S, the lkelhood penalzes the dfference wth the background model. Note that the ellptc models are used for llustraton. In case of multple objects whch can possbly overlap n the mage, the lkelhood of the mage gven the state cannot be smply decomposed nto the lkelhood of each ndvdual objects. Instead, a jont lkelhood of the whole mage gven all objects and the background model needs to be consdered. The jont lkelhood P I θ) conssts of two terms correspondng to the object regon and the non-object regon P I θ) =P I S θ ) P I S θ After obtanng S by occluson reasonng, the object regon lkelhood can be calculated by P I S θ ) = n ) P I S m =1 exp λ n S S λ b B p, d ) +λ f B p, p ) =1 }{{}}{{} 1) 2) 6) where d s the color hstogram of the background mage wthn the vsblty mask of object, p s the color hstogram of the object, both weghted by the kernel functon K E ). Bp, d) = m j=1 pj d j s the Bhattachayya coeffcent, whch reflects the smlarty of two hstograms. Ths lkelhood favors both the dfference of an object hypothess wth the background and ts smlarty wth ts correspondng object n a prevous frame Fg.4). Ths enables smultaneous segmentaton and trackng n the same object functon. We call the two terms background excluson and ) 5)

6 6 object attracton respectvely. The background excluson concept was also proposed by [33]. λ b and λ f weght the relatve contrbuton of the two terms we constran λ b +λ f =1). The object attracton term s the same as the lkelhood functon used n [6]. For an object wthout a correspondence,.e. anew object, only the background excluson part s used. The non-object lkelhood s calculated by P I S θ ) = j S P b I j )) λ S exp λ S j S e j, 7) where e j = logp b I j )) s the probablty of belongng to the background model, as defned n Equaton 2. λ S n Equaton 6 and λ S n Equaton 7 weght the balance of the foreground and the background consderng the dfferent probablstc models beng used. The posteror probablty s obtaned by combnng the pror, Equaton 3, and the lkelhood, Equaton 5. V. COMPUTING MAP BY EFFICIENT MCMC Computng the MAP s an optmzaton problem. Due to the jont consderaton of an unknown number of objects, the soluton space contans subspace of varyng dmensons. It also ncludes both dscrete varable and contnuous varables. These has made the optmzaton challengng. We use a Markov chan Monte Carlo method wth jump/dffuson dynamcs to sample the posteror probablty. Jumps cause the Markov chan to move between subspaces wth dfferent dmensons and traverse the dscrete varables; dffusons make the Markov chan sample contnuous varables. In the process of samplng, the best soluton s recorded and the uncertanty assocated wth the soluton s also obtaned. Fg.5 gves a block dagram of the computaton process. The MCMC based algorthm s an teratve process, startng from an ntal state. In each teraton, a canddate s proposed from the state n the prevous teraton asssted by mage features. The canddate s accepted probablstcally accordng to the Metropols-Hastng rule [17]. The state correspondng to the maxmum posteror value s recorded and becomes the soluton. Suppose we want to desgn a Markov chan wth statonary dstrbuton Pθ) =P θ t) I t),θ t 1)). At the g-th teraton, we sample a canddate state θ accordng to θ g 1 from a proposal dstrbuton qθ g θ g 1 ). The{ canddate state θ s } accepted wth the probablty p = mn 1, )qθ g 1 θ ) 3 Pθ Pθ g 1)qθ θ g 1). If the canddate state θ s accepted, θ g = θ, otherwse, θ g = θ g 1. It can be proven that the Markov chan constructed n ths way has ts statonary dstrbuton equal to P), ndependent of the choce of the proposal probablty q) and the ntal state θ 0 [47]. However, the choce of the proposal probablty q) can affect the effcency of the MCMC sgnfcantly. Random proposal probabltes wll lead to very slow mxng rate. Usng more nformed proposal probabltes, e.g. as n data-drven MCMC [48], wll make the Markov chan traverse the soluton space more effcently. Therefore the proposal dstrbuton s wrtten as qθ g θ g 1,I). If the proposal probablty s nformatve enough so that each sample can be thought of as a hypothess, then the MCMC approach becomes a stochastc verson of the hypothesze and test approach. In general, the orgnal verson of MCMC has dmenson matchng problem for soluton space wth varyng dmensonalty. A varaton of MCMC, called trans-dmensonal MCMC [14] s proposed to solve ths problem. However, wth some approprate assumpton and smplfcaton, trans-dmensonal MCMC can be reduced to the standard MCMC. We address ths ssue later n ths secton. A. Markov Chan Dynamcs We desgn the followng reversble dynamcs for the Markov chan to traverse the soluton space. The dynamcs correspondng to the proposal dstrbuton wth a mxture densty qθ θ g 1,I)= a A p aq a θ θ g 1,I), where A s the set of all dynamcs = {add, remove, establsh, break, exchange, dff}. The mxng probabltes p a are the chances of selectng dfferent dynamcs and a A p a =1. We assume that we have the sample n the g 1-th teraton θ t) g 1 = {k 1, m 1 ),...,k n, m n )} and now propose a canddate θ for the g-th teraton t s omtted where there s no ambguty). Object hypothess addton Sample the parameters of a new human hypothess k n+1, m n+1 ) and add t to θ g 1. q add θ g 1 {k n+1, m n+1 )} θ g 1,I) s defned n a datadrven way whose detals wll be gven later. Object hypothess removal Randomly select an exstng human hypothess r [1,n] wth a unform dstrbuton and remove t. q remove θ g 1 \{k r, m r )} θ g 1 )=1/n. Ifk r has a correspondence n θ t 1), then that object becomes dead. Establsh correspondence Randomly select a new object r n θ t) g 1 and a dead object r n θ t 1), and establsh ther temporal correspondence. q establsh θ θ g 1 ) u r u r 2 for all the qualfed pars. Break correspondence Randomly select an object r where Fg. 5. The block dagram of the MCMC trackng algorthm. 3 Base on our experments, we fnd that approxmatng the rato n the second Pθ term wth just the posteror probablty rato, ), gves almost the same Pθ g 1 ) results as the complete computaton, hence we use ths approxmaton n our mplementaton.

Randomly select two objects r 1,r 2 [1,n] and exchange ther IDs. q exchange r 1,r 2 ) u r1 u r2 2. Identtes exchange can also be replaced by the composton of breakng and establshng correspondence.

7 7 k r θ t 1) wth a unform dstrbuton and change k r to a new object and same object n θ t 1) becomes dead). q break θ θ g 1 )=1/n, where n s the number of objects n θ t) g 1 that have correspondences n the prevous frame. Exchange dentty Exchange the IDs of two close-by objects. Randomly select two objects r 1,r 2 [1,n] and exchange ther IDs. q exchange r 1,r 2 ) u r1 u r2 2. Identtes exchange can also be replaced by the composton of breakng and establshng correspondence. It s used to ease the traversal snce breakng and establshng correspondences may lead to a bg decrease n the probablty and are less lkely to be accepted. Parameter update Update the contnuous parameters of an object. Randomly select an exstng human hypothess r [1,n] wth a unform dstrbuton, and update ts contnuous parameters q dff θ θ g 1 )=1/n)q d m r m r ). Among the above, addton and removal are a par of reverse moves, as are the establshng and breakng correspondences; exchangng dentty, and parameter updatng are the ther own reverse moves. B. Informed Proposal Probablty In theory, the proposal probablty q) does not affect the statonary dstrbuton. However, dfferent q) lead to dfferent performance. The number of samples needed to get a good soluton strongly depends on the proposal probabltes. In ths applcaton, the proposal probablty of addng a new object, and the update of the object parameters, are the two most mportant ones. We use the followng nformed proposal probabltes to make the Markov chan more ntellgent and thus have a hgher acceptance rate. Object addton We add human hypotheses from three cues, foreground boundares, ntensty edges, and foreground resdue foreground wth the exstng objects carved out). In [54] a method to detect the heads whch are on the boundary of the foreground s descrbed. The basc dea s to fnd the local vertcal peaks of the boundary. The peaks are further verfed by checkng f there are enough foreground pxels below t accordng to a human heght range and the camera model. Ths detector has a hgh detecton rate and s also effectve when the human s small and mage edges are not relable; however, t cannot detect the heads n the nteror of the foreground blobs. Fg.6a) shows an example of head detecton from foreground boundares. The second head detecton method s based on an Ω shape head-shoulder model ths term was frst ntroduced n [53]). Ths detector matches the Ω-shape edge template wth the mage ntensty edges to fnd the head canddates. Frst, Canny edge detector s appled to the foreground regon of the nput mage. A dstance transformaton [1] s computed on the edge map. Fg.6b) shows the exponental edge map where Ex, y) =exp λdx, y)) Dx, y) s the dstance to the closest edge pont and λ s a factor to control the response feld dependng on the object scale n the mage; we use λ =0.25). Besdes, the coordnates of the closest pxel pont are also recorded as Cx, y). The unt mage gradent vector Ox, y) s only computed at edge pxels. The Ω shape model, see a) c) Fg. 6. Head detecton. a) Head detecton from foreground blob boundares; b) Dstance transformaton on Canny edge detecton result; c) The Ω-shape head-shoulder model black-head shoulder shape, whte-normals); and d) Head detecton from ntensty edges. Fg.6c), s derved by projectng a generc 3D human model to the mage and takng the contour of the whole head and the upper quarter torso as the shoulder. The normals of the contour ponts are also computed. The sze of the human model s determned by the camera calbraton assumng an average human heght. Denote { u 1,..., u k } and { v 1,..., v k } as the postons and the unt normals of the model ponts respectvely when head top s at x, y). The model s matched wth the mage as Sx, y) = 1/k)Σ k =1 e λd u ) v O C u ))). A head canddate map s constructed by evaluatng Sx, y) on every pxel n the dlated foreground regon. After smoothng t, we fnd all the peaks above a threshold such that a very hgh detecton rate but may also result n a hgh false alarm rate. An example s shown n Fg.6d). The false alarms tend to happen n the area of rch texture where there are abundant edges of varous orentatons. Fnally, after some human objects obtaned from the frst two methods are hypotheszed and removed from the foreground, the foreground resdue map R = F S s computed. Morphologcal open operaton wth a vertcally elongated structural element s appled to remove thn brdges and small/thn resdues. From each connected component c, human canddates can be generated assumng 1) the centrod of the c s algned wth the center of human body; 2) the top center pont of c s algned wth the human head; or 3) the bottom center pont of c s algned wth the human feet. The proposal probablty for addton combnes these three head detecton methods q a k, m) = 3 =1 λ aq a k, m), where λ a, = 1, 2, 3 are mxng probabltes of the three methods and we use λ a = 1/3. q a ) samples m frst and then k. q a k, m) = q a m)q a k m), and q a m) = q o o)q a u)q h h)q f f)q ). q a u) answers the queston where to add a new human hypothess. In practce, q o o), q h h), q f f), and q ) use ther respectve pror dstrbutons, and q a u) s a mxture of Guassan based on the bottom-up detecton results. For example, denote by HC 1 = {x,y )} N 1 =1 the head canddates obtaned by the frst method, then q a1 u) = q a1 x, y) N 1 N x,y ), dag{σx,σ 2 y} ) 2. =1 The defnton of q a2 u) and q a3 u) are smlar. After b) d)

8 8 { u s sampled, qk m) } qk u ) s to sample k from k t 1) d 1,...,k t 1) d nd, new accordng to P u u t 1) d ), see Equaton 4, = 1,...,n d and P new u), where n d s the number of dead objects. The addton and removal actons change the dmenson of the sate vector. When calculatng the acceptance probablty, we need to compute the rato of probabltes from spaces wth dfferent dmensons. Smth et al. [41] use an explct strategy of trans-dmensonal MCMC [14] to deal wth the dmensonmatchng problem. We do not need explct strategy to match the dmenson. Snce the trans-dmensonal actons only add or remove one object at one teraton, leavng the other objects unchanged, the Jacoban n [14] s unt, as n [41]. So our formulaton s just a specal case of the more general theory. Parameter update We use two ways to update the model parameters: q dff m r m r ) = λ d1 q d1 m r m r )+ λ d2 q d2 m r m r ), λ d =1/2. q d1 ) uses stochastc gradent decent to update the object parameters. q d1 m r m r ) N m r k de dm, w), where E = log P θ t) I t),θ t 1)) s the energy functon, k s a scalar to control the step sze, and w s random nose to avod local maxmum. A mean shft vector computed n the vsble regon provdes an approxmaton of the gradent of the object lkelhood w.r.t. the poston. q d2 m r m r ) Nm ms r, w), where m ms r s the new locaton computed from the mean shft procedure detals are gven n a separate Appendx). We assume that the change of the posteror probablty by other components and due to occluson can be absorbed n the nose term. The mean shft has an adaptve step sze and has a better convergence behavor than numercally computed gradents. The rest of the parameters follow ther numercally computed gradents. Compared to the orgnal color-based mean shft trackng, the background excluson term n Equaton 6, can utlze a known background model, whch s avalable for a statonary camera. As we observe n our experments, trackng usng the above lkelhood s more robust to the change of appearance of the object, e.g. when gong nto the shadow, compared to usng the object attracton term alone. Theoretcally, the Markov chan desgned should be rreducble and reversble, however the use of the above data drven proposal probabltes makes the approach not conform to the theory exactly. Frst, rreducblty requres the Markov chan be able to reach any possble pont n the soluton space. However, n practce, the proposal probablty of some pont are very small, close to zero. For example the proposal probablty of addng a hypothess at a poston, where there s no head canddate detected nearby, s extremely low. Wth fnte numbers of teratons, a state ncludng such a hypothess wll never be sampled. Although ths breaks the completeness of the Markov chan, we argue that skppng the parts of the soluton space, where no sgn of objects observed, brngs no harm to the qualty of the fnal soluton and makes the searchng process more effcent. Second, the use of the mean shft, whch s a non-parametrc method, makes the chan rreversble. Mean-shft can be seen as an approxmaton of the gradent, whle stochastc gradent decent s essentally a Gbbs sampler [39], whch s a specal case of Metropols- Hastng sampler wth acceptance rato always equal to one [25]. However, mean shft s much faster than the random walk to estmate the parameters of the object. We choose to use these technques wth the lost of some theoretcal beauty, because expermentally they makes our method much more effcent and the results are good. C. Incremental Computaton As the MCMC process may need hundreds or more samples to approxmate the dstrbuton, we need an effcent method to compute the lkelhood for each proposed state. In one teraton of the algorthm, at most two objects may change. It affects the lkelhood locally, therefore the computaton of the new lkelhood can be carred out more effcently by ncrementally computng t only wthn ther neghborhood the area assocated wth the changed objects and those overlappng wth them). Take the addton acton as an example. When a new human hypothess s added to the state vector, for the lkelhood of the non-object regon P I S θ), we only need to remove those background pxels taken by the new hypothess. For the lkelhood of the object regon P I S θ), as the new hypothess may overlap wth some exstng hypotheses, we need to recompute the vsblty of the object regons connected to the new hypothess and then update the lkelhood of these neghborng objects. The ncremental computatons of the lkelhood for the other actons are smlar. Although a jont state and jont lkelhood s used, the computaton of each teraton s greatly reduced through the ncremental computaton. Ths s n contrast to the partcle flter where the evaluaton of each partcle jont state) needs the computaton of the full jont lkelhood. The appearance models of the tracked objects are updated after processng each frame to adapt to the change n object appearance. We update the object color hstogram usng an IIR flter p t) =λ p p t) + 1 λ p ) p t 1). We choose to update the appearance conservatvely: we use a small λ p =0.01 and stop updatng f the object s occluded by more than 25% or ts poston covarance s too bg. VI. EXPERIMENTAL RESULTS We have expermented the system wth many types of data and wll only show some representatve ones. We wll frst show results on an outdoor scene vdeo and then on a standard evaluaton dataset of ndoor scene vdeos. Vdeo results are submtted as supplementary materals. Among all the parameters of our approach, many are natural, meanng that they correspond to measurable physcal quanttes e.g. 3d human heght), therefore settng ther values s straghtforward. We use the same set of parameters for all the sequences. Ths means that our approach s not senstve to the choce of parameter values. We lst here the values of the parameters whch are not mentoned n the prevous sectons. For the sze pror n Sec. IV-D), λ 1 =0.04 and λ 2 = For lkelhood, λ f =0.5, λ b =0.5 n Equaton 6, λ S =25n Eqn. 6 and λ S =0.005 n Eqn. 7. For the mxng probabltes of dfferent types of dynamcs, we use

9 9 P add =0.1, P remove =0.1, P establsh =0.1, P break =0.1, P exchange = 0.1 and P dff = 0.5. We also apply a hard constrant of 25 pxels on the mnmum mage heght of a human. We also want to comment here on the choce of parameters related to the peakedness of a dstrbuton n samplng algorthms. The mage lkelhood s usually a combnaton of a number of components stes, e.g. pxels). Inevtable smplfcatons e.g. ndependence assumpton) n probablstc modelng may result n excessve peakedness of the dstrbuton, whch affects the performance of the samplng algorthms such as MCMC and partcle flter by havng the samples n both MCMC and partcle flter focused n one locaton.e. hghest peak) of the state space therefore makes them to degenerate nto greedy algorthms. Elmnatng the dependences of dfferent components can be extremely dffcult and nfeasble. From an engneerng pont of vew, one should set the values of the parameters e.g. λ S and λ S whle keepng ther rato constant) so that lkelhood rato of dfferent hypotheses are reasonable, so that the Markov chans can effcently traverse and partcle flters can mantan multple hypotheses. In a smlar fashon, smulated annealng has been used n the samplng process to reduce the effect of the peakedness and force convergence [48], [8], however the varyng temperature makes the samples not from a sngle posteror dstrbuton. A. Evaluaton on an Outdoor Scene We show results on an outdoor vdeo sequence, that we call the Campus Plaza sequence, whch contans 900 frames. Ths sequence s captured from a camera above a buldng gate wth a 40 camera tlt angle. The frame sze s pxel, and the samplng rate s 30 FPS. In ths sequence, 33 humans pass by the scene wth 23 gong out of feld of vew and 10 gong nsde a buldng. The nter-human occlusons n ths sequence are large. There are overall 20 occluson events, 9 out of them are heavy occluson over 50% of the object s occluded). For MCMC samplng, we use 500 teratons per frame. We show n Fg.7 some sample frames from the result on ths sequence. The denttes of the objects are shown by ther ID numbers dsplayed on the head. We evaluate the results by the trajectory-based errors. Trajectores whose lengths are less than 10 frames are dscarded. Among the 33 human objects, trajectores of 3 objects are broken once ID 28 ID 35, ID 31 ID 32, ID 30 ID 41, all between frame 387 and frame 447, as marked wth arrows n Fg.7); rest of the trajectores are correct. Usually the trajectores are ntalzed once the humans are fully n the scene, some start when the objects are only partally nsde. Only the ntalzatons of three objects objects 31, 50, 52) are notceably delayed by 50, 55, 60 frames respectvely after they are fully n the scene). Partal occluson or/and lack of contrast wth the background are the causes of the delays. To justfy our approach for ntegrated segmentaton and trackng, we compare the trackng result wth the result usng frameby-frame segmentaton as n [53] where we use frame-based evaluaton metrcs. The detecton rate and the false alarm rate s 98.13% and 0.27% respectvely. The detecton rate and the false alarm rate of the same sequence by usng segmentaton alone are 92.82% and 0.18%. Wth trackng, not only the temporal correspondences are obtaned, but also the detecton rate s ncreased by a large margn whle the false alarm rate s kept low. B. Evaluaton on Indoor Scene Sequences Fg. 8. Trackng evaluaton crtera. Next, we descrbe the results of our method on an ndoor vdeo set, CAVIAR vdeo corpus 4 [56]. We test our system on the 26 shoppng center corrdor vew sequences, overall 36,292 frames, captured by a camera lookng down towards a corrdor. The frame sze s pxel, and the samplng rate s 25 FPS. Some 2D-3D pont correspondences are gven from whch the camera can be calbrated. However, we compute the camera parameters by an nteractve method [26]. The nter-object occluson n ths set s also ntensve. There are overall 96 occluson events n ths set, 68 out of 96 are heavy occlusons, and 19 out of the 96 are almost fully occlusons more than 90% of the object s occluded). Many nteractons between humans, such as talkng, and hand shakng, make ths set very dffcult for trackng. For MCMC samplng, we use 500 teratons per frame agan. For such a bg data set, t s nfeasble to enumerate the errors lke for the Campus Plaza sequence. Instead we defned fve statstcal crtera: 1) number of mostly tracked trajectores; 2) number of mostly lost trajectores; 3) number of fragments of trajectory; 4) number of false trajectores a results trajectory correspondng to no object); and 5) the frequency of dentty swtches dentty exchangng between a par of result trajectores). Fg.8 llustrates ther defnton. These fve categores are by no means a complete classfcaton, however they cover most of the typcal errors observed on ths set. There are other performance measures that have been proposed n the recent evaluatons, such as the Multple Object Trackng Precson and Accuracy n the CLEAR 2006 evaluaton [57]. We do not use these measures, because they are less ntutve, as they try to ntegrate multple factors nto one scalar valued measure. Table I gves the performance of our method. We developed an evaluaton software to count the number of mostly tracked trajectores, mostly lost trajectores, false alarms and fragments automatcally. Denote a ground-truth trajectory by {G ),...G +n) }, where G t) s the object state at the t-th frame; denote a hypotheszed trajectory by {H j),...h j+m) }. The overlap rato of the ground-truth 4 In the provded ground-truth, there are 232 trajectores overall. However 5 of these are mostly out of sght, e.g. only one arm or the head top s vsble; we set these as don t care.

10 10 frame 42 frame 59 frame 250 frame 311 frame 387 frame 447 frame 560 frame 661 Fg. 7. Selected frames of the trackng results from Campus Plaza. The numbers on the heads show denttes. Please note that the two people who are sttng on two sdes are n the background model, therefore not detected.)

11 11 object and the hypotheszed object at the t-frame s defned by OverlapG t), H t) )= RegGt) ) RegH t) ) 8) RegG t) ) RegH t) ) where Reg) s the mage regon of the object. If OverlapG t), H t) ) > 0.5, we say {G t), H t) } s a potental match. The overlap rato of the ground-truth trajectory and the hypotheszed trajectory s defned by OverlapG :+n), H j:j+m) ) mn+n,j+m) = t=max,j) δoverlapg t),h t) )>0.5) max+n,j+m) mn,j)+1 where δ) s an ndcator functon. Gven that one sequence has N G ground-truth trajectores {G k } N G k=1, and N H hypotheszed trajectores {H k } N H k=1, we compute the overlap ratos for all ground-truth hypothess pars {G k, H l }; the pars whose overlap ratos are larger than 0.8 are consdered to be potental matches. Then the Hungaran matchng algorthm [22] s used to fnd the best matches whch are consdered to be mostly tracked. To count the mostly lost trajectores, we defne a recall rato by replacng the denomnator of Equ.9 wth n +1. If for G k, there s no H l such that the recall rato between them s larger than 0.2, we consder G k to be mostly lost. To count the false alarm and fragments, we defne a precson rato by replacng the denomnator of Equ.9 wth m +1.If for H l there s no G k such that the precson rato between them s larger than 0.2, we consder H l a false alarm; f there s such a G k that the precson between them s larger than 0.8, but the overlap rato s smaller than 0.8, we consder H l to be a fragment of G k. We frst count the mostly tracked trajectores, and remove the matched parts of the groundtruth tracks. Second, we count the trajectory fragments wth a greedy, teratve algorthm. At each round, the fragment wth the hghest overlap rato s found, and then the matched part of the ground-truth track s removed; ths procedure s repeated untl there are no more vald fragments. Lastly, we count the mostly lost trajectores and the false alarms. Ths algorthm can not classfy all ground-truth and hypotheszed tracks; the unlabeled ones are manly due to an dentty swtch. We count the frequency of dentty swtches vsually. Some sample frames and results are shown n Fg.9. Most of the mssed detectons are due to the humans wearng clothng wth color very smlar to that of the background so that some part of the object s msclassfed as background, see the frame 1413 of Fg.9b) for an example. The fragmentaton of trajectory and the ID swtch are manly due to full occlusons, see the frame 496 of Fg.9a) and the frame 316 of Fg.9b) for examples. Our method can deal wth partal occluson well. For full occluson, classfyng an object as gong nto an occluded state and assocatng t when t reappears could potentally mprove the performance. The false alarms are manly due to the shadows, reflectons and sudden brghtness changes whch are msclassfed as foreground, see the frame 563 of Fg.9a). More sophstcated background model and shadow model e.g. [32]) could be used to mprove the result. In general, our method performs reasonably well on the CAVIAR set, though not as well as on the Campus Plaza 9) TABLE I RESULTS OF PERFORMANCE EVALUATIONS ON CAVIAR SET 277 TRAJECTORIES). MT: MOSTLY TRACKED, ML: MOSTLY LOST, FGMT: FRAGMENT, FA:FALSE ALARM, IDS: IDENTITY SWITCH. MT ML Fgmt FA IDS Number Percentage 62.1% 5.3% sequence, manly due to the above mentoned dffcultes. The runnng speed of the system s about 2 FPS wth a 2.8GHz Pentum IV CPU. The mplementaton s n C++ code wthout any specal optmzaton. VII. CONCLUSION AND FUTURE WORK We have presented a prncpled approach to smultaneously detect and track humans n a crowded scene acqured from a sngle statonary camera. We take a model-based approach and formulate the problem as a Bayesan MAP estmaton problem to compute the best nterpretaton of the mage observatons collectvely by the 3D human shape model, acqured human appearance model, background appearance model, camera model, the assumpton that humans move on a a known ground plane, and the object prors. The mage s modeled as a composton of an unknown number of possbly overlappng objects and a background. The nference s performed by an MCMCbased approach to explore the jont soluton space. Data-drven proposal probabltes are used to drect the Markov chan dynamcs. Experments and evaluatons on challengng reallfe data show promsng results. The success of our approach manly les n the ntegraton of the top-down Bayesan formulaton followng the mage formaton process and the bottom-up features that are drectly extracted from mages. The ntegraton has the beneft of both the computatonal effcency of mage features and the optmalty of a Bayesan formulaton. Ths work could be mproved/extended n several ways. 1) extenson to track multple classes of objects e.g. humans and cars), by addng model swtchng n the MCMC dynamcs. 2) Trackng, operatng n a 2-frame nterval, has a very local vew therefore ambgutes nevtably exst, especally n the case of trackng fully occluded objects. The analyss n the level of trajectores may resolve the local ambgutes e.g. [29]). The analyss may take nto account the pror knowledge on the vald object trajectores ncludng ther startng and endng ponts. APPENDIX I SINGLE OBJECT TRACKING WITH BACKGROUND KNOWLEDGE USING MEANSHIFT Denote by p, pu), and bu) the color hstograms of the object learnt onlne, the color hstogram of the object at locaton u and the color hstogram of the background at the correspondng regon respectvely. Let {x } =1,...,n be the pxel locatons n the regon wth the object center at u. A kernel wth profle k) s used to assgn smaller weghts to the pxels farther away from the center. An m-bn color hstogram pu) ={p j u)} j=1,...,m, s constructed as p j u) =

12 a) Sequence ThreePastShop2cor b) Sequence TwoEnterShop2cor

n =1 k x 2) δ [b f x ) j], where functon b f ) maps the pxel

}{{}}{{} L 1 u) L 2 u) 10) where B) s the Bhattachayya

By applyng Taylor expanson at pu 0 ) and bu 0 )u 0 s a

pu), bu)) = Bu) = m pu p u u 0 ) δ [b f x ) u], therefore u=1

u 0 ) = c 1 + p p p u u 0 ) Lu) =c 1 +c 2 + λ f w f uu)+ uu

u x h u=1 2 w b 11) where w b = m u=1 Smlarly, also n [6], L

) b u u 0 ) δ [b b x ) u] m = c 2 + u=1 n h =1 p u u 0 ) p u

12 12 a) Sequence ThreePastShop2cor b) Sequence TwoEnterShop2cor Fg. 9. Selected frames of the trackng results from CAVIAR set. n =1 k x 2) δ [b f x ) j], where functon b f ) maps the pxel locaton to the correspondng hstogram bn, and δ s the delta functon. Smlar for p and b. We would lke to optmze Lu) = λ b B pu), bu)) +λ f B pu), p) }{{}}{{} L 1 u) L 2 u) 10) where B) s the Bhattachayya coeffcent. By applyng Taylor expanson at pu 0 ) and bu 0 )u 0 s a predcted poston of the object), we have where w f L 1 u) =B pu), bu)) = Bu) = m pu p u u 0 ) δ [b f x ) u], therefore u=1 Bu 0 )+B pu 0 )pu) pu 0 )) + B du 0 )bu) bu 0 )) n ) m m b u u 0 ) = c 1 + p p p u u 0 ) Lu) =c 1 +c 2 + λ f w f uu)+ uu 0) b b λ bw b u x 2) k 13) uu) h uu 0) =1 = c 1 + u=1 n =1 k u x h u=1 2 w b 11) where w b = m u=1 Smlarly, also n [6], L 2 u) =B pu), p) 1 2 b u u 0 ) p u u 0 ) δ [b f x ) u]+ p uu 0 ) b u u 0 ) δ [b b x ) u] m = c 2 + u=1 n h =1 p u u 0 ) p u p uu) w f k } {{ } w u x h 2 p u p u u 0 ) 12) The last term of Lu) s the densty estmate computed wth kernel profle k) at u. The meanshft algorthm wth negatve

13 13 weght [4] apples. By usng the Epanechkov profle [6], Lu) wll be ncreased wth the new locaton moved to n u =1 x w n =1 w 14) ACKNOWLEDGMENT Ths research was funded, n part, by the U.S. Government VACE program. REFERENCES [1] G. Borgefors. Dstance transformatons n dgtal mages. Computer Vson, Graphcs, and Image Processng, 343): , [2] Y. Boykov, O. Veksler, and R. Zabh. Fast approxmate energy mnmzaton va graph cuts. IEEE Trans. Pattern Analyss and Machne Intellgence, 2311): , [3] I. Cohen and G. Medon. Detectng and trackng movng objects for vdeo survellance. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [4] R.T. Collns. Mean-shft Blob Trackng through Scale Space. In Proc. Conf. Computer Vson and Pattern Recognton, II: , [5] D. Comancu and P. Meer. Mean shft: A robust approach toward feature space analyss. IEEE Trans. Pattern Analyss and Machne Intellgence, 245): , [6] D. Comancu and P. Meer. Kernel-based object trackng. IEEE Trans. Pattern Analyss and Machne Intellgence, 255): , [7] L. Davs, V. Phlomn, and R. Duraswam. Trackng humans from a movng platform. In Proc. Int l Conf. Pattern Recognton, IV: , [8] J. Deutscher, A. Blake, and I. Red. Artculated body moton capture by annealed partcle flterng. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [9] A. Elgammal and L. Davs. Probablstc framework for segmentng people under occluson. In Proc. Int l Conf. Computer Vson, II: , [10] A. Elgammal, R. Duraswam, D. Harwood, and L. Davs. Background and foreground modelng usng non-parametrc kernel densty estmaton for vsual survellance. Proc. IEEE, 907): , [11] F. Fleuret, R. Lengagne, and P. Fua. Fxed pont probablty feld for complex occluson handlng. In Proc. Int l Conf. Computer Vson, I: , [12] D. G-Perez, J.-M. Odobez, S. Ba, K. Smth, and G. Lathoud. Trackng people n meetngs wth partcles. In Proc. Int l Workshop on Image Analyss for Multmeda Interactve Servce, [13] D. Gavrla and V. Phlomn. Real-tme object detecton for smart vehcles. In Proc. Int l Conf. Computer Vson, I:87-93, [14] P. Green. Trans-dmensonal Markov chan Monte Carlo. Oxford Unversty Press, [15] S. Hartaoglu, D. Harwood, and L. Davs. W4: Real-tme survellance of people and ther actvtes. IEEE Trans. Pattern Analyss and Machne Intellgence, 228): , [16] R. Hartley and A. Zsserman. Multple Vew Geometry n Computer Vson. Cambrdge Unversty Press, [17] W. Hastng. Monte carlo samplng methods usng markov chans and ther applcatons. Bometrka, 571):97-109, [18] S. Hongeng and R. Nevata. Mult-agent event recognton. In Proc. Int l Conf. Computer Vson, II:84-91, [19] M. Isard and J. MacCormck. Bramble: A bayesan multple-blob tracker. In Proc. Int l Conf. Computer Vson, II:34-41, [20] J. Kang, I. Cohen, and G. Medon. Contnuous trackng wthn and across camera streams. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, I: , [21] Z. Khan, T. Balch, and F. Dellaert. Mcmc-based partcle flterng for trackng a varable number of nteractng targets. IEEE Trans. Pattern Analyss and Machne Intellgence, 2711): , [22] H. W. Kuhn. The hungaran method for the assgnment problem. Naval Research Logstcs Quarterly, II:83-87, [23] M.-W. Lee and I. Cohen. A model-based approach for estmatng human 3d poses n statc mages. IEEE Trans. Pattern Analyss and Machne Intellgence, 286): , [24] A. Lpton, H. Fujyosh, and R. Patl. Movng target classfcaton and trackng from real-tme vdeo. In Proc. DARPA Image Understandng Workshop, pp , [25] J. Lu. Metroplzed gbbs sampler. In Monte Carlo strateges n scentfc computng. Computng, Sprnger-Verlag NY INC, [26] F. Lv, T. Zhao, and R. Nevata. Self-calbraton of a camera from vdeo of a walkng human. IEEE Trans. Pattern Analyss and Machne Intellgence, 289): , [27] J. MacCormck and A. Blake. A probablstc excluson prncple for trackng multple objects. In Proc. Int l Conf. Computer Vson, pages I: , [28] A. Mttal and L. Davs. M2tracker: A mult-vew approach to segmentng and trackng people n a cluttered scene usng regon-based stereo. In Proc. European Conf. Computer Vson, II:18-33, [29] P. Nllus, J. Sullvan, and S. Carlsson. Mult-target trackng - lnkng denttes usng bayesan network nference. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [30] K. Okuma, A. Taleghan, N. de Fretas, J. Lttle, and D. Lowe. A boosted partcle flter: Multtarget detecton and trackng. In Proc. European Conf. Computer Vson, I:28-39, [31] C. Papageorgou, T. Evgenou, and T. Poggo. A tranable pedestran detecton system. In Proc. of Intellgent Vehcles, pp , [32] A. Prat, I. Mkc, M. Trved, and R. Cucchara. Detectng movng shadows: Algorthms and evaluaton. IEEE Trans. Pattern Analyss and Machne Intellgence, 257): , [33] P. Prez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probablstc trackng. In Proc. European Conf. Computer Vson, pages I: , [34] D. Ramanan, D. Forsyth, and A. Zsserman. Strke a pose: Trackng people by fndng stylzed poses. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, I: , [35] C. Rasmussen and G. D. Hager. Probablstc data assocaton methods for trackng complex vsual objects. IEEE Trans. Pattern Analyss and Machne Intellgence, 236): , [36] J. Rttscher, P. Tu, and N. Krahnstoever. Smultaneous estmaton of segmentaton and shape. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [37] R. Rosales and S. Sclaroff. 3d trajectory recovery for trackng multple objects and trajectory guded recognton of actons. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [38] H. Rue and MA. Hurn. Bayesan object dentfcaton. Bometrka, 863): , [39] C. R. H. S. Geman. Dffuson for global optmzaton. SIAM J. on Control and Optmzaton, 245): , [40] N. Sebel and S. Maybank. Fuson of multple trackng algorthm for robust people trackng. In Proc. European Conf. Computer Vson, IV: , [41] K. Smth, D. Gatca-Perez, and J.-M. Odobez. Usng partcles to track varyng numbers of nteractng people. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, I: , [42] X. Song and R. Nevata. Combned face-body trackng n ndoor envronment. In Proc. Int l Conf. Pattern Recognton, IV: , [43] X. Song and R. Nevata. A model-based vehcle segmentaton method for trackng. In Proc. Int l Conf. Computer Vson, II: , [44] C. Stauffer and E. Grmson. Learnng patterns of actvty usng realtme trackng. IEEE Trans. Pattern Analyss and Machne Intellgence, 228): , [45] C. Tao, H. Sawhney, and R. Kumar. Object trackng wth bayesan estmaton of dynamc layer representatons. IEEE Trans. Pattern Analyss and Machne Intellgence, 241):75-89, [46] H. Tao, H. Sawhney, and R. Kumar. A samplng algorthm for trackng multple objects. In Proc. Workshop of Vson Algorthms, [47] L. Terney. Markov chan concepts related to samplng algorthms. In Markov Chan Monte Carlo n Practce, pp.59-74, [48] Z. W. Tu and S. C. Zhu. Image segmentaton by data-drven markov chan monte carlo. IEEE Trans. Pattern Analyss and Machne Intellgence, 245): , [49] Y. Wess. Correctness of local probablty propagaton n graphcal models wth loops. Neural Computaton, 121):1-41, [50] B. Wu and R. Nevata. Detecton of multple, partally occluded humans n a sngle mage by bayesan combnaton of edgelet part detectors. In Proc. Int l Conf. Computer Vson, I:90-97, [51] T. Yu and Y. Wu. Collaboratve trackng of multple targets. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, I: , [52] T. Zhao, M. Aggarwal, R. Kumar, and H. Sawhney. Real-tme wde area mult-camera stereo trackng. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, I: , [53] T. Zhao and R. Nevata. Bayesan human segmentaton n crowded stuatons. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, II: , [54] T. Zhao and R. Nevata. Trackng multple humans n complex stuatons. IEEE Trans. Pattern Analyss and Machne Intellgence, 269): , [55] T. Zhao and R. Nevata. Trackng multple humans n crowded envronment. In Proc. IEEE Conf. Computer Vson and Pattern Recognton,

14 II:406-413, 2004. [56] The CAVIAR data set. http://homepages.nf.ed.ac.uk/ rbf/caviar/ [57] CLEAR06 Evaluaton Campagn and Workshop. http://sl.ra. uka.

He receved the MSc and the PhD degrees from the Department of Computer Scence at the Unversty of Southern Calforna n 2001 and 2003, respectvely.

14 14 II: , [56] The CAVIAR data set. rbf/caviar/ [57] CLEAR06 Evaluaton Campagn and Workshop. uka.de/clear06/ Tao Zhao receved the BEng degree from the Department of Computer Scence and Technology, Tsnghua Unversty, Chna, n He receved the MSc and the PhD degrees from the Department of Computer Scence at the Unversty of Southern Calforna n 2001 and 2003, respectvely. He was wth Sarnoff Corporaton, Prnceton, New Jersey, from 2003 to He s currently wth Intutve Surgcal Incorporated, Sunnyvale, Calforna workng on computer vson applcatons for medcne and surgery. Hs research nterests nclude computer vson, machne learnng, and pattern recognton. Hs experence has been n vsual survellance, human moton analyss, aeral mage analyss and medcal mage analyss. He s a member of the IEEE and IEEE computer socety. Ram Nevata receved hs Ph.D. degree from Stanford Unversty wth specalty n the area of computer vson. He has been wth the Unversty of Southern Calforna snce 1975 where he s currently a Professor of Computer Scence and Electrcal Engneerng. He s also Drector of the Insttute for Robotcs and Intellgent Systems. He has been prncpal nvestgator of major Government funded computer vson research programs for over 25 years. Dr. Nevata has made mportant contrbutons to several areas of computer vson ncludng the topcs of shape descrpton, object recognton, stereo analyss aeral mage analyss, trackng of humans and event recognton. Dr. Nevata s a Fellow of the Insttute of Electrcal and Electroncs Engneers IEEE) and of the Amercan Assocaton for Artfcal Intellgence AAAI). He s an assocate edtor for the Pattern Recognton, and the Computer Vson and Image Understandng journals. Dr. Nevata s author of two books, several book chapters, and over 100 refereed techncal papers. Bo Wu receved the B.Eng and M.Eng degrees from the Department of Computer Scence and Technology, Tsnghua Unversty, Bejng, Chna, n 2002 and 2004 respectvely. He s currently a PhD canddate at the Computer Scence Department, Unversty of Southern Calforna, Los Angeles. Hs research nterests nclude computer vson, machne learnng, and pattern recognton. He s a student member of the IEEE computer socety.

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,