Modeling inter-camera space time and appearance relationships for tracking across non-overlapping views

Size: px

Start display at page:

Download "Modeling inter-camera space time and appearance relationships for tracking across non-overlapping views"

Lorraine McDonald
6 years ago
Views:

1 Avalable onlne at Computer Vson and Image Understandng 109 (2008) Modelng nter-camera space tme and appearance relatonshps for trackng across non-overlappng vews Omar Javed a, *, Khurram Shafque b, Zeeshan Rasheed a, Mubarak Shah b a Object Vdeo, Sunrse Valley Dr. Reston, VA 20171, USA b Unversty of Central Florda, Orlando, FL 32816, USA Receved 7 December 2005; accepted 22 January 2007 Avalable onlne 27 February 2007 Abstract Trackng across cameras wth non-overlappng vews s a challengng problem. Frstly, the observatons of an object are often wdely separated n tme and space when vewed from non-overlappng cameras. Secondly, the appearance of an object n one camera vew mght be very dfferent from ts appearance n another camera vew due to the dfferences n llumnaton, pose and camera propertes. To deal wth the frst problem, we observe that people or vehcles tend to follow the same paths n most cases,.e., roads, walkways, corrdors etc. The proposed algorthm uses ths conformty n the traversed paths to establsh correspondence. The algorthm learns ths conformty and hence the nter-camera relatonshps n the form of multvarate probablty densty of space tme varables (entry and ext locatons, veloctes, and transton tmes) usng kernel densty estmaton. To handle the appearance change of an object as t moves from one camera to another, we show that all brghtness transfer functons from a gven camera to another camera le n a low dmensonal subspace. Ths subspace s learned by usng probablstc prncpal component analyss and used for appearance matchng. The proposed approach does not requre explct nter-camera calbraton, rather the system learns the camera topology and subspace of nter-camera brghtness transfer functons durng a tranng phase. Once the tranng s complete, correspondences are assgned usng the maxmum lkelhood (ML) estmaton framework usng both locaton and appearance cues. Experments wth real world vdeos are reported whch valdate the proposed approach. Ó 2007 Elsever Inc. All rghts reserved. Keywords: Mult-camera appearance models; Non-overlappng cameras; Scene analyss; Mult-camera trackng; Survellance 1. Introducton There s a major effort underway n the vson communty to develop fully automated survellance and montorng systems [3,1]. Such systems have the advantage of provdng contnuous 24 h actve warnng capabltes and are especally useful n the areas of law enforcement, natonal defence, border control and arport securty. One mportant requrement for an automated survellance system s the ablty to determne the locaton of each object n the envronment at each tme nstant. Ths problem of estmatng the trajectory of an object as the object * Correspondng author. E-mal address: omar.javed@gmal.com (O. Javed). moves around n a scene s known as trackng and t s one of the major topcs of research n computer vson. In most cases, t s not possble for a sngle camera to observe the complete area of nterest because sensor resoluton s fnte, and the structures n the scene lmt the vsble areas. Therefore, survellance of wde areas requres a system wth the ablty to track objects whle observng them through multple cameras. Moreover, t s usually not feasble to completely cover large areas wth cameras havng overlappng vews due to economc and/or computatonal reasons. Thus, n realstc scenaros, the system should be able to handle multple cameras wth non-overlappng felds of vew. Also, t s preferable that the trackng system does not requre camera calbraton or complete ste modelng, snce the luxury of fully calbrated cameras or ste models s not avalable n most stuatons. In ths paper, /$ - see front matter Ó 2007 Elsever Inc. All rghts reserved. do: /j.cvu

2 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) we present an algorthm that caters for all these constrants by employng nter-camera appearance and space tme relatonshps to track people across non-overlappng feld of vews An overvew of the proposed approach Our focus s on the problem of mult-camera trackng n a system of non-overlappng cameras. We assume that the sngle camera trackng problem s solved. The task of a mult-camera tracker s to establsh correspondence between the observatons across cameras,.e., gven a set of tracks n each camera, we want to fnd whch of these tracks belong to the same object n the real world. We accomplsh ths by frst usng the observatons of objects, passng through the system of cameras n a tranng phase, to dscover the relatonshps between the cameras. For example, suppose two cameras A and B are successvely arranged alongsde a walkway, see Fg. 1. Suppose people movng along one drecton of the walkway that are ntally observed n camera A are also observed enterng camera B after a certan tme nterval. People can take many paths across A and B. However, due to physcal and practcal constrants, people wll follow some paths more often than others. Thus, the locatons of exts and entrances between cameras, drecton of movement and the average tme taken to reach from A to B can be used as cues to constran correspondences. We refer to these cues as space tme cues and explot these cues to learn the nter-camera relatonshps. The nter-camera relatonshps are learned n the form of a probablty densty functon (pdf) of space tme parameters (.e., the probablty of an object enterng a certan camera at a certan tme gven the locaton, tme and velocty of ts ext from another camera) from the tranng data. Instead of mposng assumptons about the form of ths pdf, we let the data speak for tself [20] by estmatng the pdf usng kernel densty estmators. A commonly used cue for trackng n a sngle camera s the appearance of the objects. Appearance of an object can be modelled by ts color or brghtness hstograms, and t s a functon of scene llumnaton, object geometry, object surface materal propertes (e.g., surface albedo) and the camera parameters. Among all these, only the object surface materal propertes reman constant as an object moves across cameras. Thus, the color dstrbuton of an object can be farly dfferent when vewed from two dfferent cameras. One way to match appearances n dfferent cameras s by fndng a transformaton that maps the appearance of an object n one camera mage to ts appearance n the other camera mage. However, for a gven par of cameras, ths transformaton s not unque and also depends upon the scene llumnaton and camera parameters. In ths paper, we show that despte dependng upon a large number of parameters, all such transformatons le n a low dmensonal subspace for a gven par of cameras. The proposed method learns ths subspace of mappngs for each par of cameras from the tranng data by usng probablstc prncpal component analyss. Thus, gven appearances n two dfferent cameras, and the subspace of brghtness transfer functons learned durng the tranng phase, we can estmate the probablty that the transformaton between the appearances les n the learnt subspace. We present an ML estmaton framework to use these cues n a prncpled manner for trackng. The correspondence probablty,.e., the probablty that two observatons orgnate from the same object, depends on both the space tme nformaton and the appearance. Tracks assgnment s acheved by maxmzng the correspondence lkelhood. Ths s acheved by convertng the ML estmaton problem nto a problem of fndng the path cover of a drected graph for whch an optmal soluton can be effcently obtaned. In Secton 2, we dscuss related research. In Secton 3, a probablstc formulaton of the problem s presented. Learnng of nter-camera spato-temporal and appearance relatonshps s dscussed n Sectons 4 and 5, respectvely. In Secton 6, a maxmum lkelhood soluton to fnd correspondences s gven. Results are presented n Secton Related work In general, mult-camera trackng methods dffer from each other on the bass of ther assumpton of overlappng or non-overlappng vews, explct calbraton vs learnng the nter-camera relatonshp, type of calbraton, use of 3D poston of objects, and/or features used for establshng correspondences. In ths paper, we organze the mult-camera trackng lterature nto two major categores based on the requrement of overlappng or non-overlappng vews Mult-camera trackng methods requrng overlappng vews Fg. 1. The fgure shows two possble paths an object can take from Camera A to B. A large amount of work on mult-camera survellance assumes overlappng vews. Jan and Wakmoto [14] used calbrated cameras and an envronmental model to obtan 3D locaton of a person. The fact that multple vews of

3 148 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) the same person are mapped to the same 3D locaton was used for establshng correspondence. Ca and Aggarwal [2], used multple calbrated cameras for survellance. Geometrc and ntensty features were used to match objects for trackng. These features were modeled as mult-varate Gaussans and the Mahalanobs dstance measure was used for matchng. Chang and Gong [33] used the top most pont on an object detected n one camera to compute ts assocated eppolar lne n other cameras. The dstance between the eppolar lne and the object detected n the other camera was used to constran correspondence. In addton, heght and color were also used as features for trackng. The correspondences were obtaned by combnng these features usng a Bayesan network. Dockstader and Tekalp [6] also employed Bayesan networks for trackng and occluson reasonng across calbrated cameras wth overlappng vews. Sparse moton estmaton and appearance were used as features. Mttal and Davs [25] used a regon-based stereo algorthm to estmate the depth of ponts potentally lyng on foreground objects and projected them on the ground plane. The objects were located by examnng the clusters of the projected ponts. Kang et al. [17] presented a method for trackng n statonary and pan-tlt-zoom cameras. The ground planes n the movng and statonary cameras were regstered. The movng camera sequences were stablzed by usng affne transformatons. The locaton of each object was then projected nto a global coordnate frame for trackng. The object appearance was modeled by parttonng the object regon nto ts polar representaton. In each partton a Gaussan dstrbuton modeled the color varaton. Lee et al. [21] proposed an approach for trackng n cameras wth overlappng FOV s that dd not requre explct calbraton. The camera calbraton nformaton was recovered by matchng moton trajectores obtaned from dfferent vews and plane homographes were computed from the most frequent matches. Khan and Shah [19] avoded explct calbraton by usng the feld of vew (FOV) lne constrants to handoff labels from one camera to another. The FOV nformaton was learned durng a tranng phase. Usng ths nformaton, when an object was vewed n one camera, all the other cameras n whch the object was vsble could be predcted. Trackng n ndvdual cameras was needed to be resolved before handoff could occur. Most of the above mentoned trackng methods requre a large overlap n the FOVs of the cameras. Ths requrement s usually prohbtve n terms of cost and computatonal resources for survellance of wde areas Mult-camera trackng methods for non-overlappng vews To track people n an envronment not fully covered by the camera felds of vew, Collns et al. [4] developed a system consstng of multple calbrated cameras and a ste model. Normalzed cross correlaton of detected objects and ther locaton on the 3D ste model were used for trackng. Huang and Russel [13] presented a probablstc approach for trackng vehcles across two cameras on a hghway. The soluton presented was applcaton specfc,.e., assumpton of vehcles travellng n one drecton, vehcles beng n one of three lanes, and soluton formulaton for only two calbrated cameras. The appearance was modeled by the mean of the color of the whole object, whch s not enough to dstngush between mult-colored objects lke people. Transton tmes were modeled as Gaussan dstrbutons and the ntal transton probabltes were assumed to be known. The problem was transformed nto a weghted assgnment problem for establshng correspondence. Huang and Russel also provded an onlne verson of ther correspondence algorthm. The onlne algorthm trades off correct correspondence accuracy wth soluton space coverage, whch forced them to commt early and possbly make erroneous correspondences. Kettnaker and Zabh [18] used a Bayesan formulaton of the problem of reconstructng the paths of objects across multple cameras. Ther system requred manual nput of the topology of allowable paths of movement and the transton probabltes. The appearances of objects were represented by usng hstograms. In Kettnaker and Zabh s formulaton, the postons, veloctes and transton tmes of objects across cameras were not jontly modeled. However, ths assumpton does not hold n practce as these features are usually hghly correlated. Ells et al. [23] determned the topology of a camera network by usng a two stage algorthm. Frst the entry and ext zones of each camera were determned, then the lnks between these zones across cameras were found usng the co-occurrence of entry and ext events. The proposed method assumes that correct correspondences wll cluster n the feature space (locaton and tme) whle the wrong correspondences wll generally be scattered across the feature space. The basc assumpton s that f an entry and ext at a certan tme nterval are more lkely than a random chance then they should have a hgher lkelhood of beng lnked. Recently, Stauffer [31] proposed an mproved lnkng method whch tested the hypothess that the correlaton between ext and entry events that may or may not contan vald object transtons s smlar to the expected correlaton when there are no vald transtons. Ths allowed the algorthm (unlke [23]) to handle the case where ext entrance events may be correlated, but the correlaton s not due to vald object transtons. Rahm and Darrell [28] proposed a method to reconstruct the complete path of an object as t moved n a scene observed by non-overlappng cameras and to recover the ground plane calbraton of the cameras. They modeled the dynamcs of the movng object as a Markovan process. Gven the locaton and velocty of the object from the multple cameras, they estmated the trajectory most compatble wth the object dynamcs usng a non-lnear mnmzaton scheme. The authors assumed that the objects moved on a ground plane and that all trajectory data of the object was avalable.

4 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Therefore, the proposed approach was not sutable for an onlne mplementaton. Porkl [27] proposed a method to match object appearances over non-overlappng cameras. In hs approach, a brghtness transfer functon (BTF) s computed for every par of cameras, such that the BTF maps an observed color value n one camera to the correspondng observaton n the other camera. Once such a mappng s known, the correspondence problem s reduced to the matchng of transformed hstograms or appearance models. However, ths mappng,.e., the BTF vares from frame to frame dependng on a large number of parameters that nclude llumnaton, scene geometry, exposure tme, focal length, and aperture sze of each camera. Thus, a sngle pre-computed BTF cannot usually be used to match objects for moderately long sequences. Recently, Shan et al. [30] presented an unsupervsed approach to learn edge measures for appearance matchng between non-overlappng vews. The matchng was performed by computng the probablty of two observatons from two cameras beng generated by the same or dfferent object. Gaussan pdfs were used to compute the same/dfferent probabltes. The proposed soluton requred the edge mages of vehcles to be regstered together. Note that the requrement for regsterng object mages mght not be possble for non-rgd objects lke pedestrans. Moreover, ths requrement also constrans the vews of the objects n the dfferent cameras to be somewhat smlar. In ths paper, we propose nter-camera appearance and space tme relatonshp models for trackng that do not assume explct camera calbraton, a ste model, presence of a sngle ground plane across cameras, a partcular non-overlappng camera topology, constant llumnaton, or constant camera parameters, for example, focal length or exposure. In the next secton we present a probablstc formulaton of the mult-camera trackng problem. 3. Formulaton of the mult-camera trackng problem Suppose that we have a system of r cameras C 1,C 2...,C r wth non-overlappng vews. Further, assume that there are n objects p 1,p 2,...,p n n the envronment (the number of the objects s not assumed to be known). Each of these objects s vewed from dfferent cameras at dfferent tme nstants. Assume that the task of sngle camera trackng s already solved, and let O be the set of all observatons. Moreover, let O j ¼fO j;1 ; O j;2 ;...; O j;mj g be the set of m j observatons that were observed by the camera C j. Each observaton O j,a s generated by an object n the feld of vew of camera C j. The observatons consst of two features, appearance of the object O j,a (app) and space tme features of the object O j,a (st) (locaton, velocty, tme etc.). It s reasonable to assume that both O j,a (app) and O j,a (st) are ndependent of each other. The problem of mult-camera trackng s to fnd whch of the observatons n the system of cameras belong to the same object. It s helpful to vew the set of observatons of each object as a chan of observatons wth earler observatons precedng the latter ones. The task of groupng the observatons of each object can then be seen as lnkng the consecutve observatons n each such chan. Snce we have assumed that the sngle camera trackng problem s solved, the mult-camera trackng task s to lnk the observatons of an object extng one camera to ts observatons enterng another camera, as the object moves through the system of cameras. For a formal defnton of the problem, let a hypotheszed correspondence between two consecutve observatons,.e., ext from one camera and entrance nto another, O,a and O j,b, respectvely, be denoted as k j;b. Moreover, Let / j;b k be a bnary random varable whch s true f and only f k j;b s a vald hypothess,.e., O,a and O j,b are consecutve observatons of the same object. We need to fnd a set of correspondences K ¼fk j;b ;...g such that k j;b 2 K f and only f / k j;b s true. Let R be the soluton space of the mult-camera trackng problem. From the above dscusson, we know that each observaton of an object s preceded or succeeded by a maxmum of one observaton (of the same object). Hence, f K s a canddate soluton n R, then for all fk j;b ; kr;e p;c gk, (,a) (p,c) Ù (j,b) (r,e). In addton, let U K be a random varable whch s true f and only f K represents a vald set of correspondences,.e., all correspondences are correctly establshed. We want to fnd a feasble soluton n the space R of all feasble solutons that maxmzes the lkelhood,.e., K 0 ¼ arg max P ð OjU K ¼ trueþ: K2R Assumng that each correspondence,.e., a matchng between two observatons, s condtonally ndependent of other observatons and correspondences, we have: PðOjU K ¼ trueþ ¼ Y k j;b 2K P O ; O j;bj/ k j;b ¼ true : ð1þ Usng the above equaton along wth the ndependence of observatons O j,a (app) and O j,a (st), for all a and j, we have, PðOjU K ¼ trueþ ¼ Y P O ðappþ; O j;bðappþj/ j;b k ¼ true k j;b 2K P O ðstþ; O j;b ðstþj/ k j;b ¼ true : ð2þ Thus the followng term gves us the soluton: Y K 0 ¼ arg max PðO K2R ðappþ; O j;bðappþj/ j;b k ¼ true k j;b 2K P O ðstþ; O j;b ðstþj/ k j;b ¼ true

5 150 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Ths s equvalent to maxmzng the followng term, X K 0 ¼ arg max log PðO K2R ðappþ; O j;bðappþj/ k j;b ¼ true k j;b 2K P O ðstþ; O j;b ðstþj/ j;b k ¼ true : ð3þ In order to obtan the ML estmate we need to know the space tme and appearance probablty densty functons. Ths ssue s dscussed n the next two sectons. 4. Learnng nter-camera space tme probabltes Learnng s carred out by assumng that the correspondences are known. One way to acheve ths s to use only appearance matchng for establshng correspondence snce space tme relatonshps between cameras are unknown. Note that durng tranng t s not necessary to correspond all objects across cameras. Only the best matches can be used for learnng. Suppose we have a sample S consstng of n, d dmensonal, data ponts x 1,x 2,...,x n from a mult-varate dstrbuton p(x). If the data s contnuous, then the Parzen wndows technque [7,35] can be used to estmate ts densty. In our case, the poston/tme feature vector x, used for learnng the space tme pdfs from camera C to C j,.e., PðO ðspþ; O j;b ðspþj/ k j;b ¼ trueþ, s a vector, consstng of the ext and entry locatons n cameras, ndces of entry and ext cameras, ext veloctes, and the tme nterval between ext and entry events. The camera ndces are treated as dscrete features whle the rest of the vector components are treated as contnuous data. Snce we have a mxed,.e., contnuous and dscrete, data Parzen wndows cannot be used drectly to estmate the pdf. We have used a mxed densty estmator proposed by L and Racn [22] to obtan the space tme pdf. Let x =(x 0,x 00 ), where x 0 s a d 0 dmensonal vector representng the contnuous components of x. x 00 s a d 00 dmensonal vector representng the dscrete components and d = d 0 + d 00. In addton, Let x 00 t be the tth component of x 00 and suppose that x 00 t can assume c t P 2 dfferent values, where (t =1,2,...,d 00 ). The mxed densty estmator s defned as ^pðxþ ¼ 1 X n n jhj 1 2 ¼1 jðh 1 2 ðx 0 x 0 ÞÞwðx00 ; x 00 ; fþ; where the d 0 varate kernel j(x 0 ), for contnuous components, s a bounded functon satsfyng òj(x 0 )dx 0 = 1, and H s the symmetrc d 0 d 0 bandwdth matrx. w s a multvarate kernel functon for dscrete components, defned as wðx 00 ; x 00 ; fþ ¼c 0ð1 fþ d00 df ;x ðfþ df;x ð5þ where df ;x ¼ d 00 P d 00 t¼1 ðx00 t x 00 ;tþ, and s the ndcator functon, f s the scalar dscrete bandwdth parameter, and c o ¼ Q 1 t¼1;...;d 00 c t 1 s a normalzaton constant. The multvarate kernel j(x 0 ) can be generated from a product of symmetrc unvarate kernels j u,.e., jðx 0 Þ¼ Q d 0 j¼1 j uðx 0 jþ. We use unvarate Gaussan kernels ð4þ to generate j(x 0 ). Moreover, to reduce the complexty, H s assumed to be a dagonal matrx,.e., H ¼ dag ½h 2 1 ; h2 2 ;...; h2 d0š, and the smoothng parameter for dscrete varables f s chosen to be the same for both dscrete components. The value of f s chosen to extremely small (approachng zero) because we do not want transtons across a par of cameras beng smoothed over and affectng transton probabltes between other cameras. Each tme, a correspondence s made durng the tranng phase, the observed feature s added to the sample S. The observatons of an object extng from one camera and enterng nto another are separated by a certan tme nterval. We refer to ths nterval as nter-camera travel tme. Followng are some key observatons that are modeled by the proposed system. The dependence of the nter-camera travel tme on the magntude and drecton of moton of the object. The dependence of the nter-camera travel tme nterval on the locaton of ext from one camera and locaton of entrance n the other. The correlaton among the locatons of exts and entrances n cameras. Snce the correspondences are known n the tranng phase, the lkely tme ntervals and ext/entrance locatons are learned by estmatng the pdf. The reason for usng the kernel densty estmaton approach s that, rather than mposng assumptons, the nonparametrc technque allows us to drectly approxmate the d dmensonal densty descrbng the jont pdf. It s also guaranteed to converge to any densty functon wth enough tranng samples [7]. Moreover, t does not mpose any restrctons on the shape of the functon, nether does t assume ndependence between the feature set. 5. Estmatng change n appearances across cameras In addton to the space tme nformaton, we want to model the changes n the appearance of an object from one camera to another. The dea here s to learn the change n the color of objects, as they move between the cameras, from the tranng data and use ths as a cue for establshng correspondences. One possble way of dong ths was proposed by Porkl [27]. In hs approach, a brghtness transfer functon (BTF) f j s computed for every par of cameras C and C j, such that f j maps an observed brghtness value n Camera C to the correspondng observaton n Camera C j. Once such a mappng s known, the correspondence problem s reduced to the matchng of transformed hstograms or appearance models. Note that a necessary condton, for the exstence of a one-to-one mappng of brghtness values from one camera to another, s that the objects are planar and only have dffuse reflectance. Moreover, ths mappng s not unque and t vares from frame to frame dependng on a large number of parameters that nclude llumnaton, scene geometry, exposure tme, focal

6 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) length, and aperture sze of each camera. Thus, a sngle pre-computed mappng cannot usually be used to match objects for any moderately long sequence. In the followng subsectons, we show that despte a large number of unknown parameters, all BTFs from a gven camera to another camera le n a low dmensonal subspace. Moreover, we present a method to learn ths subspace from the tranng data and use ths nformaton to determne how lkely t s for observatons n dfferent cameras to belong to the same object. In other words, gven observatons O,a (app) and O j,b (app) from cameras C and C j, respectvely, and gven all possble brghtness transfer functons from camera C to camera C j, we want to estmate the probablty that the observatons O,a (app) and O j,b (app) belong to the same object The space of brghtness transfer functons Let L (p,t) denote the scene reflectance at a (world) pont p of an object that s llumnated by whte lght, when vewed from camera C at tme nstant t. By the assumpton that the objects do not have specular reflectance, we may wrte L (p,t) as a product of (a) materal related terms, M (p,t) =M(p) (for example, albedo) and (b) llumnaton/camera geometry and object shape related terms, G (p,t),.e., L ðp; tþ ¼MðpÞG ðp; tþ: ð6þ The above gven model s vald for commonly used Bdrectonal Reflectance Dstrbuton Functon (BRDF), such as, the Lambertan model and the generalzed Lambertan model [26] (see Table 1). By the assumpton of planarty, G (p,t) =G (q,t) =G (t), for all ponts p and q on a gven object. Hence, we may wrte, L (p,t) =M(p)G (t). The mage rradance E (p,t) s proportonal to the scene radance L (p,t) [12], and s gven as: E ðp; tþ ¼L ðp; tþy ðtþ ¼MðpÞG ðtþy ðtþ; ð7þ where Y ðtþ ¼ p 4 ðdðtþ h Þ2 ðtþ cos 4 a ðp; tþ ¼ p 4 ðdðtþ h Þ2 ðtþ c, s a functon of camera parameters at tme t. h (t) andd (t) are the focal length and dameter (aperture) of lens, respectvely, and a (p,t) s the angle that the prncpal ray from pont p makes wth the optcal axs. The fall off n senstvty due to the term cos 4 a (p,t) over an object s consdered neglgble [12] and may be replaced wth a constant c. Table 1 Commonly used BRDF models that satsfy Eq. (6) Model M G I Lambertan q p cos h I Generalzed Lambertan q p cos h ½1 0:5r2 r 2 þ0:33 þ 0:15r2 r 2 þ0:09 cosð/ / r Þ sn a tan bš The subscrpts and r denote the ncdent and the reflected drectons measured wth respect to surface normal. I s the source ntensty, q s the albedo, r s the surface roughness, a = max(h,h r ) and b = mn(h,h r ). Note that for generalzed Lambertan model to satsfy Eq. (6), we must assume that the surface roughness r s constant over the plane. If X (t) s the tme of exposure, and g s the radometrc response functon of the camera C, then the measured (mage) brghtness of pont p, B (p,t), s related to the mage rradance as B ðp; tþ ¼g ðe ðp; tþx ðtþþ ¼ g ðmðpþg ðtþy ðtþx ðtþþ; ð8þ.e., the brghtness, B (p,t), of the mage of a world pont p at tme nstant t, s a nonlnear functon of the product of ts materal propertes M(p), geometrc propertes G (t), camera parameters, Y (t) and X (t). Consder two cameras, C and C j, assume that a world pont p s vewed by cameras C and C j at tme nstants t and t j, respectvely. Snce materal propertes M of a world pont reman constant, we have, MðpÞ ¼ g 1 ðb ðp; t ÞÞ G ðt ÞY ðt ÞX ðt Þ ¼ g 1 j B j ðp; t j Þ G j ðt j ÞY j ðt j ÞX j ðt j Þ : ð9þ Hence, the brghtness transfer functon from the mage of camera C at tme t to the mage of camera C j at tme t j s gven by: G j ðt j ÞY j ðt j ÞX j ðt j Þ B j ðp; t j Þ¼g j G ðt ÞY ðt ÞX ðt Þ g 1 ðb ðp; t ÞÞ ¼ g j wðt ; t j Þg 1 ðb ðp; t ÞÞ ; ð10þ where w(t,t j ) s a functon of camera parameters and llumnaton/scene geometry of cameras C and C j at tme nstants t and t j, respectvely. Snce Eq. (10) s vald for any pont p on the object vsble n the two cameras, we may drop the argument p from the notaton. Also, snce t s mplct n the dscusson that the BTF s dfferent for any two par of frames, we wll also drop the arguments t and t j for the sake of smplcty. Let f j denote a BTF from camera C to camera C j, then, B j ¼ g j ðb Þ ¼ fj ðb Þ: ð11þ wg 1 In ths paper we use a non-parametrc form of the BTF by samplng f j at a set of fxed ncreasng brghtness values B (1) < B (2) < < B (n), and representng t as a vector. That s, (B j (1),...,B j (n)) = (f j (B (1)),...,f j (B (n))). We denote the space of brghtness transfer functons (SBTF) from camera C to camera C j by C j. It s easy to see that the dmenson of C j can be at most d max, where d max s the number of dscrete brghtness values (For most magng systems, d max = 256). However, the followng theorem shows that BTFs actually le n a small subspace of the d max dmensonal space (Please see Appendx A for proof). Theorem 1. The subspace of brghtness transfer functons C j has dmenson at most m f for all a; x 2 R, g j ðaxþ ¼ P m u¼1 r uðaþs u ðxþ, where g j s the radometrc response functon of camera C j, and for all u, 1 6 u 6 m, r u and s u are arbtrary but fxed 1D functons. From Theorem 1, we see that the upper bound on the dmenson of subspace depends on the radometrc response functon of camera C j. Though, the radometrc response functons are usually nonlnear and dffer from

7 152 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) one camera to another. They do not have exotc forms and are well-approxmated by smple parametrc models. Many authors have approxmated the radometrc response functon of a camera by a gamma functon [8,24],.e., g(x)=kx c + l. Then, for all a; x 2 R, gðaxþ ¼kðaxÞ c þ l ¼ ka c x c þ l ¼ r 1 ðaþs 1 ðxþþr 2 ðaþs 2 ðxþ; where, r 1 (a) =a c, s 1 (x) =kx c, r 2 (a) = 1, and s 2 (x) =l. Hence, by Theorem 1, f the radometrc response functon of camera C j s a gamma functon, then the SBTF C j has dmensons at most 2. As compared to gamma functons, polynomals are a more general approxmaton of the radometrc response functon. Once agan, for a degree q polynomal gðxþ ¼ P q u¼0 k ux u and for any a,x 2 R, we can wrte gðaxþ ¼ P q u¼0 r uðaþs u ðxþ by puttng r u (a)=a u and s u (x)=k u x u, for all 0 6 u 6 q. Thus, the dmenson of the SBTF C j s bounded by one plus the degree of the polynomal that approxmates g j. It s stated n [10] that most of the real world response functons are suffcently well approxmated by a low degree polynomal, e.g., a polynomal of degree less than or equal to 10. Thus, gven our assumptons, the space of nter-camera BTFs wll also be a polynomal of degree less than or equal to 10. In Fg. 2, we show emprcally that the assertons made n ths subsecton reman vald for real world radometrc response functons. In the next subsecton, we wll gve a method for estmatng the BTFs and ther subspace from tranng data n a mult-camera trackng scenaro. Percent of total Varance Agfa Color Futura 100 Red Agfa color HDC 100 plus Green Agfacolor Ultra 050 plus Green Agfachrome RSX2 050 Blue Kodak Ektachrome100 plus Green Gamma Curve, γ=0. 4 Cannon Optur a Sony DXC 950 Fuj F400 Green Kodak Max Zoom 800 Green Prncpal Components Fg. 2. Plots of the percentage of total varance accounted by m prncpal components (x-axs) of the subspace of brghtness transfer functons from synthetc camera C 1 to camera C. Note that each synthetc camera was assgned a radometrc response functon of a real world camera/flm and a collecton of BTFs was generated between pars of synthetc cameras by varyng w n the Eq. (11). PCA was performed on ths collecton of BTFs. The plot confrms that a very hght percentage of total varance s accounted by frst 3 or 4 prncpal components of the subspace. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) 5.2. Estmaton of nter-camera BTFs and ther subspace Consder a par of cameras C and C j. Correspondng observatons of an object across ths camera par can be used to compute an nter-camera BTF. One way to determne ths BTF s to estmate the pxel to pxel correspondence between the object vews n the two cameras (see Eq. (11)). However, self occluson, change of scale and geometry, and dfferent object poses can make fndng pxel to pxel correspondences from vews of the same object n two dfferent cameras mpossble. Thus, we employ normalzed hstograms of object brghtness values for the BTF computaton. Such hstograms are relatvely robust to changes n object pose [32]. In order to compute the BTF, we assume that the percentage of mage ponts on the observed object O,a (app) wth brghtness less than or equal to B s equal to the percentage of mage ponts n the observaton O j,b (app) wth brghtness less than or equal to B j. Note that, a smlar strategy was adopted by Grossberg and Nayar [9] to obtan a BTF between mages taken from the same camera of the same vew but n dfferent llumnaton condtons. Now, f H and H j are the normalzed cumulatve hstograms of object observatons I and I j, respectvely, then H (B )=H j (B j )=H j (f j (B )). Therefore, we have f j ðb Þ¼H 1 j ðh ðb ÞÞ; ð12þ where H 1 s the nverted cumulatve hstogram. As dscussed n the prevous sub-secton, the BTF between two cameras changes wth tme due to llumnaton condtons, camera parameters, etc. We use Eq. (12) to estmate the brghtness transfer functon f j for every par of observatons n the tranng set. Let F j be the collecton of all the brghtness transfer functons obtaned n ths manner,.e., {f (j)n },n 2 {1,...,N}. To learn the subspace of ths collecton we use the probablstc Prncpal Component Analyss PPCA [34]. Accordng to ths model a d max dmensonal BTF f j can be wrtten as f j ¼ Wy þ f j þ : ð13þ Here y s a normally dstrbuted q dmensonal latent (subspace) varable, q < d max, W s a d max q dmensonal projecton matrx that relates the subspace varables to the observed BTF, f j s the mean of the collecton of BTFs, and s sotropc Gaussan nose,.e., N(0,r 2 I). Gven that y and are normally dstrbuted, the dstrbuton of f j s gven as f j Nðf j ; ZÞ; ð14þ where Z = WW T + r 2 I. Now, as suggested n [34], the projecton matrx W s estmated as W ¼ U q ðe q r 2 IÞ 1=2 R; ð15þ where the q column vectors n the d max q dmensonal U q are the egenvectors of the sample covarance matrx of F j, E q s a q q dagonal matrx of correspondng egenvalues

8 k 1...,k q, and R s an arbtrary orthogonal rotaton matrx whch can be set to an dentty matrx for computatonal purposes. The value of r 2, whch s the varance of the nformaton lost n the projecton, s calculated as r 2 ¼ 1 d max q X dmax v¼qþ1 k v : ð16þ Once the values of r 2 and W are known, we can compute the probablty of a partcular BTF belongng to the learned subspace of BTFs by usng the dstrbuton n Eq. (14). Note that tll now we have been dealng wth only the brghtness values of mages and computng the brghtness transfer functons. To deal wth color mages we treat each channel,.e., R, G, and B separately. The transfer functon for each color channel s computed exactly as dscussed above. The subspace parameters W and r 2 are also computed separately for each color channel. Also note that we do not assume the knowledge of any camera parameters and response functons for the computaton of these transfer functons and ther subspace Computng object color smlarty across cameras usng the BTF subspace The observed color of an object can vary wdely across multple non-overlappng cameras due to change n scene llumnaton or any of the dfferent camera parameters lke gan and focal length. Note that, the tranng phase provdes us the subspace of color transfer functons between the cameras, whch models how colors of an object can change across the cameras. Durng the test phase, f the mappng between the colors of two observatons s well explaned by the learned subspace then t s lkely that these observatons are generated by the same object. Specfcally, for two observatons O,a and O j,b wth color transfer functons (whose dstrbuton s gven by Eq. (14)) f R ;j ; fg ;j and fb ;j, we defne the probablty of the observatons belongng to same object as P ;j ðo ðappþ; O j;b ðappþjk j;b ¼ Y ch2fr;g;bg 1 ð2pþ d 2 jz ch j 1 2 Þ e 1 2 T ðz ch Þ 1 f ch fch j fch j j fch j ; ð17þ where Z = WW T + r 2 I. The ch superscrpt denotes the color channel for whch the value of Z and f j were calculated. For each color channel, the values of W and r 2 are computed from the tranng data usng Eqs. (15) and (16), respectvely. 6. Establshng correspondences O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Recall from Secton 3, that the problem of mult-camera trackng s to fnd a set of correspondences K 0, such that, each observaton s preceded or succeeded by a maxmum of one observaton, and that maxmzes the lkelhood,.e., K 0 ¼ arg max K2R X k j;b 2K log P O ðstþ; O j;b ðstþj/ j;b k PðO ðappþ; O j;bðappþj/ k j;b ¼ true ¼ true The problem of fndng the ML soluton can be modeled as a graph theoretcal problem as follows: We construct a drected graph such that for each observaton O,a, there s a correspondng vertex n the drected graph, whle each hypotheszed correspondence k j;b s modeled by an arc from the vertex of observaton O,a to the vertex of observaton O j,b. The weght of ths arc of the hypotheszed correspondence k j;b s computed from the space tme and appearance probablty terms, n the summaton n Eq. (3). Note that these probabltes are computed usng the methods descrbed n Sectons 4 and 5. Wth the constrant that an observaton can correspond to at most one succeedng and one precedng observaton, t s easy to see that each canddate soluton s a set of drected paths (of length 0 or more) n ths graph. Also, snce each observaton corresponds to a sngle object, each vertex of the graph must be n exactly one path of the soluton. Hence, each canddate soluton n the soluton space s a set of drected paths n the constructed graph, such that each vertex of the graph s n exactly one path of ths set. Such a set s called vertex dsjont path cover of a drected graph. The weght of a path cover s defned by the sum of all the weghts of the edges n the path cover. Hence, a path cover wth the maxmum weght corresponds to the soluton of the ML problem as defned n Eq. (3). The problem of fndng a maxmum weght path cover can be optmally solved n polynomal tme f the drected graph s acyclc [29]. Recall that k j;b defnes the hypothess that the observatons O,a and O j,b are consecutve observatons of the same object n the envronment, wth the observaton O,a precedng the observaton O j,b. Thus, by the constructon of graph, all the arcs are n the drecton of ncreasng tme, and hence, the graph s acyclc. The maxmum weght path cover of an acyclc drected graph can be found by reducng the problem to fndng the maxmum matchng of an undrected bpartte graph. Ths bpartte graph s obtaned by splttng every vertex v of the drected graph nto two vertces v and v + such that each arc comng nto the vertex v s substtuted by an edge ncdent to the vertex v, whle the vertex v + s connected to an edge for every arc gong out of the vertex v n the drected graph (The bpartte graph obtaned from the drected graph s shown n Fg. 3). The edges n the maxmum matchng of the constructed bpartte graph correspond to the arcs n the maxmum weght path cover of the orgnal drected graph. The maxmum matchng of a bpartte graph can be found by an O(n 2.5 ) algorthm by Hopcroft and Karp [11], where n s the total number of vertces n graph G,.e., the total number of observatons n the system. The method descrbed above, assumes that the entre set of observatons s avalable and hence cannot be used n real tme applcatons. One approach to handle ths type of

9 154 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) j,b ( k, a O, a, O j,b) P, O j,b O, a problem n real tme applcatons s to use a sldng wndow of a fxed tme nterval. Ths approach, however, nvolves a tradeoff between the qualty of results and the tmely avalablty of the output. In order to avod makng erroneous correspondences, we adaptvely select the sze of sldng wndow n the onlne verson of our algorthm. Ths s acheved by examnng the space tme pdfs for all observatons (tracks) n the envronment that are not currently vsble n any of the cameras n the system and fndng the tme nterval after whch the probablty of reappearance of all these observatons n any camera s nearly zero. The sze of sldng wndow s taken to be the sze of ths tme nterval, and the correspondences are establshed by selectng the maxmum weght path cover of the graph wthn the wndow. 7. Results Increasng tme Fg. 3. An example of splt graph (constructed from the drected graph) that formulates the mult-camera trackng problem. Each vertex of the drected graph s spltted nto + (ext) and (entry) vertces, such that the + vertex s adjacent to an edge for each arc gong out of the vertex and the vertex s adjacent to an edge for each arc comng nto the vertex. The weght of an edge s the same as the weght of the correspondng arc. The graph s bpartte, snce no + vertex s adjacent to a + vertex and no vertex s adjacent to a vertex. Fg. 4. Trackng results: Trackng accuracy for each of the three sequences computed for three dfferent cases. (1) By usng only space tme model, (2) by usng only appearance model, and (3) both models. The results mprove greatly when both the space tme and appearance models are employed for establshng correspondence. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) In ths secton, we present the results of the proposed method n three dfferent mult-camera scenaros. The scenaros dffer from each other both n terms of camera topologes and scene llumnaton condtons, and nclude both ndoor and outdoor settngs. Each experment conssts of a supervsed tranng phase and a testng phase. In both phases, the sngle camera object detecton and trackng nformaton s obtaned by usng the method proposed n [15]. In the tranng phase, the known correspondence nformaton s used to compute the kernel densty of the space tme features (entry and ext locatons, ext velocty and nter-camera tme nterval) and the subspaces of transfer functons for each color channel (red, blue, and green). In the testng phase, these correspondences are computed usng the proposed mult-camera correspondence algorthm. The performance of the algorthm s analyzed by comparng the resultng tracks to the ground truth. We say that an object n the scene s tracked correctly f t s assgned a sngle unque label for the complete duraton of ts presence n the area of nterest. The trackng accuracy s defned as the rato of the number of objects tracked correctly to the total number of objects that passed through the scene. In order to determne the relatve sgnfcance of each model and to show the mportance of combnng the space tme nformaton wth the appearance matchng scheme, for each mult-camera scenaro, the correspondences n the testng phase are computed for three dfferent cases separately, by usng () only space tme model, () only appearance model, and () both models. The results of each of these cases are analyzed by usng the above defned trackng accuracy as the evaluaton measure. These results are summarzed n Fg. 4 and are explaned below for each of the expermental setup. The frst experment was conducted wth two cameras, Camera 1 and Camera 2, n an outdoor settng. The camera topology s shown n Fg. 5(a). The scene vewed by Camera 1 s a covered area under shade, whereas Camera 2 vews an open area llumnated by the sunlght (please see Fg. 7). It can be seen from the fgure that there s a sgnfcant dfference between the global llumnaton of the two scenes, and matchng the appearances s consderably dffcult wthout accurate modelng of the changes n appearance across the cameras. Tranng was performed by usng a 5 mn sequence. The margnal of the space tme densty for ext veloctes from Camera 2 and the ntercamera travel tme nterval s shown n Fg. 5(b). The margnal densty shows a strong ant-correlaton between the two space tme features and comples wth the ntutve noton that for hgher veloctes there s a greater probablty that the tme nterval wll be less, whereas a longer tme nterval s lkely for slower objects. In Fg. 6 the transfer functons obtaned from the frst fve correspondences from Camera 1 to Camera 2 are shown. Note that lower color values from Camera 1 are beng mapped to hgher color values n Camera 2 ndcatng that the same object s appearng much brghter n Camera 2 as compared to Camera 1.

O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 155 Fg. 5. (a) Two camera confguraton for the frst experment. Feld-of-vew of each camera s shown wth trangles.

10 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Fg. 5. (a) Two camera confguraton for the frst experment. Feld-of-vew of each camera s shown wth trangles. The cameras were mounted approxmately 10 yards apart. It took 7 12 seconds for a person walkng at normal speed to ext from the vew of Camera 1 and enter Camera 2. The green regon s the area covered by grass, most people avod walkng over t. (b) The margnal of the nter-camera space tme densty (learned from the tranng data) for ext veloctes of objects from Camera 2 and the tme taken by the objects to move from Camera 2 to Camera 1. Note f the object velocty s hgh a lesser nter-camera travel tme s more lkely, whle for objects movng wth lower veloctes a longer nter-camera travel tme s more lkely. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) Transfer Functons (Red Channel) Transfer Functons (Green Channel) Transfer Functons (Blue Channel) Camera 2 Red Channel Range Camera 2 Green Channel Range Camera 2 Blue Channel Range Camera 1 Red Channel Range Camera 1 Green Channel Range Camera 1 Blue Channel Range Fg. 6. The transfer functons for the R,G and B color channels from Camera 1 to Camera 2, obtaned from the frst fve correspondences from the tranng data. Note that mostly lower color values from Camera 1 are beng mapped to hgher color values n Camera 2 ndcatng that the same object s appearng much brghter n Camera 2 as compared to Camera 1. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) The test phase conssted of a 12 mnutes long sequence. In ths phase, a total of 68 tracks were recorded n the ndvdual cameras and the algorthm detected 32 transtons across the cameras. Trackng accuracy for the test phase s shown n Fg. 4. Our second expermental setup conssts of three cameras, Camera 1, Camera 2, and Camera 3, as shown n Fg. 8(a). The feld-of-vew of each camera s also shown n the fgure. It should be noted that there are several paths from one camera to the other, whch make the sequence more complex. Tranng was done on a 10 mn sequence n the presence of multple persons. Fg. 8(b) shows the probabltes of enterng Camera 2 from Camera 1, that were obtaned durng the tranng phase. Note that people lke to take the shortest possble path between two ponts. Ths fact s clearly demonstrated by the space tme pdf, whch shows a correlaton between the y-coordnates of the entry and ext locatons of the two cameras. That s, f an object exts Camera 1 from pont A, t s more probable that t wll enter Camera 2 at pont C rather than pont D. The stuaton s reversed f the object exts Camera 1 from pont B. Testng was carred out on a 15 mn sequence. A total of 71 tracks n ndvdual cameras were obtaned and the algorthm detected 45 transtons wthn the cameras. The trajectores of the people movng through the scene n the testng phase are shown n Fg. 9. Note that

156 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 Fg. 7. Frames from sequence 1.

The frst camera s overlookng a covered area whle the second camera vew s under drect sun lght, therefore the observed color of objects s farly dfferent n the two vews (also see Fg. 12).

11 156 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Fg. 7. Frames from sequence 1. Note that multple persons are smultaneously extng from Camera 2 and enterng at rregular ntervals n Camera 1. The frst camera s overlookng a covered area whle the second camera vew s under drect sun lght, therefore the observed color of objects s farly dfferent n the two vews (also see Fg. 12). Correct labels are assgned n ths case due to accurate color modelng. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) people dd not stck to a narrow path between Camera 1 and Camera 2, but ths dd not affect the trackng accuracy and all the correspondences were establshed correctly when both space tme and appearance models were used (see Fg. 4). Fg. 15 shows some trackng nstances n ths sequence. In the thrd experment, three cameras Camera 1, Camera 2, and Camera 3 were used for an ndoor/outdoor setup. Camera 1 was placed ndoor whle the other two cameras were placed outdoor. The placements of the cameras along wth ther felds of vew are shown n Fg. 10. Tranng was done on an 8 mn sequence n the presence of multple persons. Testng was carred out on a 15 mn sequence. Fg. 11 shows some trackng nstances for the test sequence. The algorthm detected 49 transtons among the total of 99 ndvdual tracks that were obtaned durng ths sequence, out of whch only two correspondences were ncorrect. One such error was caused by a person stayng, for a much longer than expected duraton, n an unobserved regon. That s, the person stood n an unobserved regon for a long tme and then entered another camera but the tme constrant (due to the space tme model) forced the assgnment of a new label to the person. Such a scenaro could have been handled f there were smlar examples n the tranng phase. The aggregate trackng results for the sequence are gven n Fg. 4. It s clear from Fg. 4 that both the appearance and space tme models are mportant sources of nformaton as the trackng results mprove sgnfcantly when both the models are used jontly. In Table 2, we show the number of prncpal components that account for 99% of the total varance n the nter-camera BTFs computed durng the tranng phase. A C B D Camera 1 Camera 2 Fg. 8. (a) Camera setup for sequence 2. Camera 2 and Camera 3 were mounted approxmately 30 yards apart, whle the dstance between Camera 1 and Camera 2 was approxmately 20 yards. It took 8 14 seconds for a person walkng at normal speed to ext from the vew of Camera 1 and enter Camera 2. The walkng tme between Camera 2 and 3 was between 10 and 18 s. The green regons are patches of grass. The ponts A and D are locatons where people exted and/or entered the camera feld of vew. (b) The margnal of the nter-camera space tme densty for ext locaton of objects from Camera 1 and Entry locaton n Camera 2. In the graph the y coordnates of rght boundary of Camera 1 and left boundary of Camera 2 are plotted. Snce most people walked n a straght path from Camera 1 to Camera 2,.e., from locatons A to C and from B to D as shown n (a), thus correspondng locatons had a hgher probablty of beng the ext/entry locatons of the same person. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.)

O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 157 Fg. 9. Trajectores of people for the camera setup 2. Trajectores of the same person are shown n the same color.

Even though the expermental setup does not follow the assumptons of Secton 5, such as planarty of objects, the small number of prncpal components ndcates that the nter-camera BTFs le n a low dmenson

12 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Fg. 9. Trajectores of people for the camera setup 2. Trajectores of the same person are shown n the same color. There were a total of 27 people who walked through the envronment. (For nterpretaton of the references to color n ths ﬁgure legend, the reader s referred to the web verson of ths paper.) vew. Even though the expermental setup does not follow the assumptons of Secton 5, such as planarty of objects, the small number of prncpal components ndcates that the nter-camera BTFs le n a low dmenson subspace even n more general condtons. In order to demonstrate the superorty of the subspace based method we compare t wth the drect use of colors for trackng. For drect color base matchng, nstead of usng Eq. (17) for the computaton of appearance probabltes P ;j ðo ðappþ; Oj;b ðappþjk j;b Þ, we deﬁne t n terms of the Bhattacharraya dstance between the normalzed hstograms of the observatons O,a and O,b,.e., cdðh ;hj Þ P ;j ðo ðappþ; Oj;b ðappþjk j;b ; Þ ¼ ce Fg. 10. Camera setup for sequence 3. It s an Indoor/Outdoor Sequence. Camera 3 s placed ndoor whle Cameras 1 and 2 are outdoor. The dstance between camera 3 and the other two cameras s around 20 m. About 40 correspondences were used for tranng, for each camera par between whch there was a drect movement of people,.e., wthout gong through an ntermedate camera ð18þ where h and hj are the normalzed hstograms of the observatons O,a and Oj,b and D s the modﬁed Bhattacharraya dstance [5] between two hstograms and s gven as ﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Xm qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^ ^ Dðh ; hj Þ ¼ 1 h;v hj;v ; ð19þ v¼1 where m s the total number of bns. The Bhattacharraya coeﬃcent ranges between zero and one and s a metrc. Fg. 11. Frames from sequence 3 test phase. A person s assgned a unque label as t moves through the camera vews. (For nterpretaton of the references to color n ths ﬁgure legend, the reader s referred to the web verson of ths paper.)

158 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 Table 2 The number of prncpal components that account for 99% of the varance n the BTFs Sequence No. Camera par No.

of prncpal components (Blue) 1 1-2 6 5 5 2 1-2 7 7 7 2 2-3 7 7 6 3 1-3 7 6 7 3 2-3 7 7 7 Note that for all camera pars a maxmum of 7 prncpal components were suffcent to account for the subspace of

The comparson of the proposed appearance modelng approach wth the drect color based appearance matchng s presented n Fg.

13 158 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Table 2 The number of prncpal components that account for 99% of the varance n the BTFs Sequence No. Camera par No. of prncpal components (Red) No. of prncpal components (Green) No. of prncpal components (Blue) Note that for all camera pars a maxmum of 7 prncpal components were suffcent to account for the subspace of the BTFs. Once agan, the trackng accuracy was computed for all three mult-camera scenaros usng the color hstogram based model (Eq. (18)). The comparson of the proposed appearance modelng approach wth the drect color based appearance matchng s presented n Fg. 13, and clearly shows that the subspace based appearance model performs sgnfcantly better. For further comparson of the two methods, we consder two observatons, O a and O b, n the testng phase, wth hstograms H(O a ) and H(O b ), respectvely. We frst compute a BTF, f, between the two observatons and reconstruct the BTF, f *, from the subspace estmated from the tranng data,.e., f ¼ WW T ðf fþþf. Here W s the projecton matrx obtaned n the tranng phase. The frst observaton Fg. 13. Trackng accuracy: comparson of the BTF subspace based trackng method to smple color matchng. A much mproved matchng s acheved n the transformed color space relatve to drect color comparson of objects. The mprovement s greater n the frst sequence due to the large dfference n the scene llumnaton n the two camera vews. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) O a s then transformed usng f *, and the hstogram of the object O b s matched wth the hstograms of both O a and f * (O a ) by usng the Bhattacharraya dstance. When both the observatons O a and O b belong to the same object, the transformed hstogram gves a much better match as compared to drect hstogram matchng, as shown n Fgs. 12 and 14. However, f the observatons O a and O b belong to dfferent objects then the BTF s reconstructed poorly, (snce t does not le n the subspace of vald BTFs), and the Bhattacharraya dstance for the transformed observaton ether ncreases or does not change sgnfcantly. The a H(Oa) H(Ob) b c 0.03 d Dst(H(Oa),H(Ob))=.537 Normalzed Bn Counts Normalzed Bn Counts Normalzed Bn Counts H(f*(Oa)) Dst(H(f*(Oa)),H(Ob))=.212 f*-reconstructon Error = Color Values 0 Color Values 0 Color Values e f Normalzed Bn Counts H(Oa) g Normalzed Bn Counts H(Ob) Dst(H(Oa),H(Ob)=.278 h Normalzed Bn Counts H(f*(Oa)) Dst(H(f*(Oa)),H(Ob))=.301 f*-reconstructon E rror = Color Values 0 Color Values 0 Color Values Fg. 12. (a) Observatons O a and O b of the same object from Camera 1 and Camera 2, respectvely from camera setup 1. (b) Hstogram of observaton O a (All hstograms are of the Red color channel). (c) Hstogram of observaton O b. The Bhattacharraya dstance between the two hstograms of the same object s (d) The Hstogram of O a after undergong color transformaton usng the BTF reconstructon from the learned subspace. Note that after the transformaton the hstogram of (f * (O a )) looks farly smlar to the hstogram of O b. The Bhattacharraya dstance reduces to after the transformaton. (e) Observaton from Camera 1 matched to an observaton from a dfferent object n camera 2. (f and g) Hstograms of the observatons. The dstance between hstograms of two dfferent objects s Note that ths s less than the dstance between hstograms of the same object. (h) Hstogram after transformng the colors usng the BTF reconstructed from the subspace. The Bhattacharraya dstance ncreases to Smple color matchng gves a better match for the wrong correspondence. However, n the transformed space the correct correspondence gves the least bhattacharraya dstance. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.)

O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 159 Fg. 14. Row 1: Observatons from camera setup 3.

14 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Fg. 14. Row 1: Observatons from camera setup 3. The observatons are of the same object from Camera 1 and Camera 2, respectvely. Ther blue channel hstograms are also shown. The last hstogram s obtaned after transformng the colors wth a reconstructed BTF f * from the subspace. Note that the Bhattacharraya dstance (shown at the top of the hstograms) mproves sgnfcantly after the transformaton. Row 2: Observatons of dfferent objects and ther blue channel hstograms. Here there s no sgnfcant change n the Bhattacharraya dstance after the transformaton. Rows (3,4): Observatons from camera setup 2. Here the drect use of color hstograms results n a better match for the wrong correspondence. However after the color transformaton, hstograms of the same objects have the lesser Bhattacharraya dstance between them. (For nterpretaton of the references to color n ths fgure legend, the reader s referred to the web verson of ths paper.) normalzed reconstructon error of the BTF, f * -Reconstructon Error = f f * /s, where s s a normalzng constant, s also shown n the fgures. The aggregate results for the reconstructon error, for the BTFs between the same object and also between dfferent objects are gven n Table 3. The above dscusson suggests the applcablty of the BTF subspace for the mprovement of any multcamera appearance matchng scheme that uses color as one of ts components. Our mult-camera trackng system uses a clent server archtecture, n whch a clent processor s assocated wth each camera. The advantage of ths archtecture s that the computatonally expensve tasks of object detecton and sngle camera trackng are performed at the clent sde, whle the server only performs the mult-vew correspondence. The communcaton between the clent and server conssts of hstogram and trajectory nformaton of objects, and ths nformaton s sent only when the objects ext or enter the feld of vew of a camera. Note that, the server does not use the mages drectly and thus the communcaton overhead s low. In our experments, there was no sgnfcant dfference n frame rates between the two camera

160 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) 146 162 Fg. 15. Consstent labellng for camera setup 2. Rows 1 and 2: people enter Camera 3.

15 160 O. Javed et al. / Computer Vson and Image Understandng 109 (2008) Fg. 15. Consstent labellng for camera setup 2. Rows 1 and 2: people enter Camera 3. Row 3: A new person enters Camera 2, note that he s gven a prevously unassgned label. Rows 4 8: People keep on movng across cameras. All persons retan unque labels. (For nterpretaton of the references to color n ths ﬁgure legend, the reader s referred to the web verson of ths paper.)

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components