Multi-view Occlusion Reasoning for Probabilistic Silhouette-Based Dynamic Scene Reconstruction

Size: px

Start display at page:

Download "Multi-view Occlusion Reasoning for Probabilistic Silhouette-Based Dynamic Scene Reconstruction"

Alban Curtis
6 years ago
Views:

1 Int J Comput Vs (2010) 90: DOI /s y Mult-vew Occluson Reasonng for Probablstc Slhouette-Based Dynamc Scene Reconstructon L Guan Jean-Sébasten Franco Marc Pollefeys Receved: 22 July 2008 / Accepted: 8 Aprl 2010 / Publshed onlne: 27 Aprl 2010 Sprnger Scence+Busness Meda, LLC 2010 Abstract In ths paper, we present an algorthm to probablstcally estmate object shapes n a 3D dynamc scene usng ther slhouette nformaton derved from multple geometrcally calbrated vdeo camcorders. The scene s represented by a 3D volume. Every object n the scene s assocated wth a dstnctve label to represent ts exstence at every voxel locaton. The label lnks together automatcallylearned vew-specfc appearance models of the respectve object, so as to avod the photometrc calbraton of the cameras. Generatve probablstc sensor models can be derved by analyzng the dependences between the sensor observatons and object labels. Bayesan reasonng s then appled to acheve robust reconstructon aganst real-world envronment challenges, such as lghtng varatons, changng background etc. Our man contrbuton s to explctly model the vsual occluson process and show: (1) statc objects (such as trees or lamp posts), as parts of the pre-learned background model, can be automatcally recovered as a byproduct of the nference; (2) ambgutes due to nter-occluson between multple dynamc objects can be allevated, and the fnal reconstructon qualty s drastcally mproved. Several ndoor L. Guan ( ) M. Pollefeys UNC-Chapel Hll, Chapel Hll, USA e-mal: lguan@cs.unc.edu M. Pollefeys e-mal: marc@cs.unc.edu J.-S. Franco LaBRI INRIA Sud-Ouest, Unversty of Bordeaux, Talence Cedex, France e-mal: jean-sebasten.franco@labr.fr M. Pollefeys ETH-Zürch, Zürch, Swtzerland e-mal: marc.pollefeys@nf.ethz.ch and outdoor real-world datasets are evaluated to verfy our framework. Keywords Mult-vew 3D reconstructon Bayesan nference Graphcal model Shape-from-slhouette Occluson 1 Introducton 3D shape reconstructon from real world magery s an mportant research area n computer vson. In ths paper, we focus on the problem of recoverng a tme-varyng dynamc scene nvolvng movng (and statc) objects observed from multple fx-postoned vdeo streams wth known geometrc camera poses. Ths setup has been wdely used n securty survellance, moves, medcal surgery, sport broadcastng, dgtal 3D archvng, vdeo games, etc. There are manly two categores of algorthms for such mult-vew setups. The frst s mult-vew stereo/shape from Photo-consstency (Kutulakos and Setz 2000; Broadhurst et al. 2001; Scharsten and Szelsk 2002; Slabaugh et al. 2004; Setz et al. 2006). They recover the surface of an object assumng ts appearance s the same across vews, so 3D surface ponts can be trangulated from multple vews. The output s usually a detaled surface model, because n theory object concavtes can be recovered. However, n practce, many challenges exst. On the one hand, the cross-vew consstent appearance assumpton usually requres tedous radometrc calbraton of the cameras. Ths s hard to realze n outdoor scenes wthout the constant llumnaton, whch s requred by most of the state-of-the-art radometrc calbraton approaches (Ile and Welsh 2005; Josh et al. 2005; Takamatsu et al. 2008). In addton, lmted camera feld of vew, moton blur, specular surfaces, object self-occluson

2 284 Int J Comput Vs (2010) 90: and over-compresson of the vdeos may all nvaldate the consstency assumpton. On the other hand, even the appearance s the same across vews, the 3D pont trangulaton technque mght as well fal n homogeneous regons (such as the shrts n Fg. 1(c), whch are common n practcal datasets), where no 2D feature pont can be dstnctvely located. The method we present n ths paper fall n the second category Shape from Slhouette methods (Matusk et al. 2000, 2001; Lazebnk et al. 2001; Franco and Boyer 2003), whch usually depct the scene as foreground movng objects aganst known statc background, and generally assume the slhouette of a foreground object n a camera vew can be subtracted from the background, e.g. Fg. 1(d) and (e). Assume the camera network s geometrcally calbrated beforehand, the back-projected slhouette cones ntersect one another to form the vsual hull (Baumgart 1974; Laurentn 1994), an approxmate shape of the orgnal object. Slhouette-based algorthms are relatvely smple, fast, and output a global closed shape of the object. Therefore they are good choces for dynamc scene analyss. They also do not requre object appearance to be smlar across vews, thus bypass the radometrc calbraton of the camera network. And they are not affected by homogeneous regons of the objects ether. For the above reasons, many state-of-theart mult-vew stereo approaches such as (Snha and Pollefeys 2005; Furukawa and Ponce 2006) use a vsual hull as ntalzaton, or slhouette-based constrants. However, Shape from Slhouette methods have ther own caveats: most slhouette-based methods are hghly dependent on appearance-based background modelng, whch s usually senstve to magng sensor nose, shadows, llumnaton varatons n the scene, etc. Also the background subtracton technques are usually unstable when the modeled object has a smlar appearance to the background. Therefore, slhouette-based 3D modelng technques were usually used n a controlled, man-made envronment, such as a turn-table setup or ndoor laboratory. In order to extend these approaches n uncontrolled, natural envronments, researchers have explored dfferent possbltes to mprove the robustness, such as adaptvely updatng the background models (Stauffer and Grmson 1999; Elgammaletal.2002; Km et al. 2005), usng a dscrete global optmzaton framework (Snow et al. 2000), proposng slhouette prors over mult-vew sets (Grauman et al. 2003), and ntroducng a sensor fuson scheme to compute the probablty of exstence of the 3D shape (Franco and Boyer 2005). There s one more challenge for slhouette-based methods to work n a general envronment occlusons, whch can be categorzed nto three types: (1) Self-occluson. It happens to every closed-surface object where a part of the object s blockng another part of tself. The lack of nformaton n the occluded regon ntroduces ambgutes, and s one of the man reasons why a vsual hull s always larger than the real shape. Gven a certan number of camera vews, n the absence of further surface nformaton, self-occlusons cannot always be handled because of slhouette ambgutes. In ths paper, we manly address the other two types of occlusons. (2) Statc occluson. It happens when a statc object blocks a dynamc object wth respect to a certan camera vew, such as the sculpture blockng the person n Fg. 1(b). In ths paper, we call the statc object (the sculpture) a statc occluder or smply an occluder, so as to dfferentate from a dynamc subject, such as a person. As the sculpture n Fg. 1, occluders cannot always be removed from the scene n advance, causng ther appearance to be ncluded as part of the pre-learnt background model. When a dynamc object goes behnd a statc occluder, a statc occluson event happens. When subjects of nterest move behnd occluders wth respect to a certan pont of vew, the colors perceved n that vew correspond to the occluder and thus are dentcal to the colors learned n the background model. Ths results n apparently corrupted slhouettes, correspondng to the regon of statc occlusons, as n Fg. 1(d). Consequently, due to the ntersecton rule, such corrupted slhouettes result n an ncomplete vsual hull. Ths type of occluson s specfc to reconstructon approaches based on slhouettes extracted usng statc background modelng. (3) Inter-occluson. Occlusons may also occur between two or more dynamc objects of nterest, as shown n Fg. 1(c). Wth the ncrease of such occlusons, the dscrmnatory power of the slhouettes decreases, resultng n the reconstructed shapes much larger n volume than the real objects. In fact, when more dynamc objects cluster n the scene, the vsblty ambguty generally ncreases. Ths s dscussed n detal n Sect. 2 (Fg. 3). Both the statc occluson and nter-occluson decrease the qualty of the fnal reconstructon result, yet they are very common and almost unavodable n everyday envronments. If we plan to use slhouette-based methods n uncontrolled real-world scenes, we need to deal wth both of them. The dfference between statc occluson and nter-occluson s that the statc occluder s appearance s already part of the background model, but the dynamc objects are not. Ths requres dfferent consderaton n the problem modelng. In ths paper, we explctly model the statc occluson and nter-occluson events n a volume representaton of the reconstructon envronment by analyzng the vsblty relatonshps. We show that the shape of the statc occluders can be recovered ncrementally by accumulatng occluson cues from the moton of the dynamc objects. Also, by usng a dstnct per-vew appearance model for each dynamc object, nter-occluson and mult-object vsblty ambgutes can be effectvely solved, whle avodng the photometrc calbraton of the cameras. All the reasonngs are performed n a probablstc Bayesan sensor fuson framework, whch bulds on the probablstc slhouette-based modelng

Int J Comput Vs (2010) 90: 283 303 285 Fg. 1 The occluson problem for a slhouette-based method. (a) A background vew wth a sculpture as an rremovable statc vsual occluder. (b) A person n the scene.

3 Int J Comput Vs (2010) 90: Fg. 1 The occluson problem for a slhouette-based method. (a) A background vew wth a sculpture as an rremovable statc vsual occluder. (b) A person n the scene. (c) Two persons, one occludng the other. (d) Background subtracton slhouette for (b). (e) Background subtracton slhouette for (c) (Franco and Boyer 2005) by ntroducng occluson related terms. The major task s to compute the posteror probablty for a gven voxel to be part of a certan object shape, gven mult-vew observatons. Our algorthm s verfed aganst real datasets to be effectve and robust n general ndoor and outdoor envronment of densely populated dynamc scenes wth possble statc occluders. We present the formulatons of (Guan et al. 2007, 2008) n a more consstent way, analyze the theoretcal propertes of the recovered statc occluder, dscuss the drawbacks and propose some possble extensons of the framework. 2 Related Work and Overvew 2.1 Statc Occluson As shown n the last secton, a statc occluson makes the slhouette ncomplete, and thus has a negatve mpact over the slhouette-based modelng. Consequently, the nclusve property of the vsual hull s no longer vald for models produced n ths stuaton (Laurentn 1994): the real shape s no longer guaranteed to resde wthn the vsual hull. Generally detectng and accountng for statc occluson has drawn much attenton n areas such as depth layer extracton (Brostow and Essa 1999), occludng T-juncton detecton (Apostoloff and Ftzgbbon 2005), bnary occluder mask extracton (Guan et al. 2006), and sngle mage object boundary nterpretaton (Hoem et al. 2007). All these works are lmted to 2D mage space. Among papers regardng 3D occluson, Favaro et al. (2003) uses sparse 3D occludng T-junctons as salent features to recover structure and moton. In De Bonet and Vola (1999), occlusons are mplctly modeled n the context of voxel colorng approaches, usng an teratve scheme wth sem-transparent voxels and multple vews of a scene from the same tme nstant. Our ntal treatment of slhouette occlusons has lead to subsequent work desgned to track objects from a small set of vews (Keck and Davs 2008), wth some dfferences n assumptons and modelng: they use teratve EM framework that at each frame frst solvng the voxel occupancy whch s then fed back nto the system to update the occluson model. Also, for (Keck and Davs 2008), a hard threshold of slhouette nformaton has to be provded durng the ntalzaton and the occluder nformaton s mantaned n a 4D (a 3D space volume per camera vew) state space. We represent the statc occluder explctly wth a random varable at every locaton n the 3D scene. Theoretcally occluder shapes can be accessed wth careful reasonng about the vsual hull of the ncomplete slhouettes, as depcted n Fg. 2, whch would lead to a determnstc algorthm to recover occluders. Let S t be the set of ncomplete slhouettes obtaned at tme t, and VH t the ncomplete vsual hull obtaned usng these slhouettes. However VH t s a regon that s observed by all cameras as beng both occuped by an object and unoccluded from any vew. Thus we can deduce an entre regon R t of ponts n space that are free from any statc occluder shape, as the shaded cones n Fg. 2(a). Formally, R t s the set of ponts X R 3 for whch a vew exsts, such that the vewng lne of X from vew hts VH t at a frst vsble pont A, and X O A, wth O the optcal center of vew (Fg. 2(a)). Ths expresses the condton that X appears n front of the vsual hull wth respect to vew. The regon R t vares wth t, thus assumng statc occluders and broad coverage of the scene by dynamc ob-

286 Int J Comput Vs (2010) 90: 283 303 Fg. 2 Determnstc occluson reasonng. (a) An occluder-free regon R t can be deduced from the ncomplete vsual hull at tme t.

4 286 Int J Comput Vs (2010) 90: Fg. 2 Determnstc occluson reasonng. (a) An occluder-free regon R t can be deduced from the ncomplete vsual hull at tme t.(b) R: occluder-free regons accumulated over tme ject moton, the free space n the scene can be deduced as the regon R = T t=1 R t. The shape of occluders, ncludng concavtes f they were covered by object moton, can be recovered as the complement of R n the common vsblty regon of all vews (Fg. 2(b)). However ths determnstc approach would yeld an mpractcal and non-robust soluton, due to nherent slhouette extracton senstvtes to nose and corrupton that contrbute rreversbly to the result. It also suffers from the lmtaton that only portons of objects that are seen by all vews can contrbute to occluson reasonng. Moreover, the scheme only accumulates negatve nformaton, where occluders are certan not to be. However postve nformaton s also underlyng to the problem: the dscrepances between the object s projecton and the actual recorded slhouette would tell us where an occluson event s postvely happenng, as long as we know where the object shape s, whch the current slhouette-based method s able to provde (Franco and Boyer 2005). To lft these lmtatons and provde a robust soluton, we propose a probablstc approach to the statc occluson reasonng, n whch both the negatve and postve cues are fused and compete n a complementary way towards the statc occluder shape estmaton. 2.2 Multple Dynamc Objects Inter-occluson Most of the exstng slhouette-based reconstructon methods focus on mono-object stuatons, and fal to address the more general mult-object cases. When multple dynamc objects are present n the scene, besdes the nteroccluson problem n Fg. 1(c) and (e), bnary slhouettes and the resultng vsual hull are not able to dsambguate regons actually occuped by dynamc objects from slhouetteconsstent ghost regons the empty regons that project nsde all dynamc objects slhouettes, whch s depcted by the polygonal gray regon ndcated by arrows n Fg. 3(a). The ghost regons are ncreasngly lkely as the number of observed objects rses, because t then becomes more dffcult to fnd vews that vsually separate any two objects n the scene and carve out unoccuped regons of space. The ghost regons have been analyzed n the context of people countng/trackng to avod producng ghost tracks (Yang et al. 2003; Otsuka and Mukawa 2004). The method Fg. 3 The prncple of mult-object slhouette reasonng for shape modelng dsambguaton. (a) Ambguous ghost regons n gray polygons, due to the bnary slhouette back-projecton does not have enough dscrmnablty. (b) The ghost regon ambgutes are reduced or elmnated by dstngushng between multple objects appearances we propose casts the problem of slhouette modelng at the mult-object level, where ghosts can naturally be elmnated based on per object slhouette consstency. Multobject slhouette reasonng has been appled n the context of mult-object trackng (Mttal and Davs 2003; Fleuret et al. 2007). The nter-occluson problem has also been studed for the specfc case of transparent objects (De Bonet and Vola 1999). Recent trackng efforts also use 2D probablstc nter-occluson reasonng to mprove object localzaton (Gupta et al. 2007). But none of these methods are able to provde multple probablstc 3D shapes from slhouette cues as proposed here. To address ths problem, n addton to the background model learnng, we also ntalze a set of appearance models assocated to every object n the scene. Gven such extra nformaton, the probablty of the ghost regons can be reduced, because the set of slhouettes from dfferent vews that result nto a ghost regon are not drawn from consstent appearance models of any sngle object, as depcted n Fg. 3(b). Multple slhouette labels have been ntroduced n a determnstc, purely geometrc method (Zegler et al. 2003), but ths requres an arbtrary hard threshold for the number of vews that defne consstency. Moreover, slhouettes are assumed to be noseless, whch s volated or requres manual nterventon for practcal datasets. On the contrary, we propose a fully automatc framework. Smlar to statc occluson formulaton, usng a volumetrc representaton of the 3D scene, we process mult-object sequences by examnng the nosy causal relatonshp between every

5 Int J Comput Vs (2010) 90: voxel and the correspondng pxels n all camera vews usng a Bayesan formulaton. In partcular, every voxel s modeled as a random varable. It can take any one of the m states representng the m possble objects n the scene. Gven the knowledge that a voxel s occuped by a certan object, the camera sensor model explans what appearance dstrbutons are supposed to be observed. Ths framework s able to explctly model nter-occluson wth other objects, and estmate a wndow of object locatons. The voxel sensor model semantcs and smplfcatons are borrowed from the occupancy grd framework n robotcs (Elfes 1989; Margarts and Thrun 1998). The proposed method s naturally combned wth the statc occluder ncremental recovery, because as mentoned before, the occluder s nothng but another state of a voxel. Ths scheme enables us to perform slhouette nference (Sect. 3.2) n a way that renforces regons of space whch are drawn from the same conjuncton of color dstrbutons, correspondng to one object, and penalzes appearance nconsstent regons, whle accountng for object vsblty. In the rest of ths paper, we frst ntroduce the fundamental probablstc sensor fuson framework and the detaled formulatons n Sect. 3. We then descrbe problems related to buldng an automatc dynamc scene analyss system n Sect. 4. Specfcally, we dscuss how to ntalze the appearance models and keep track the moton and status of each dynamc object. Secton 5 shows the results of the proposed system and algorthm on real-world datasets. Despte the challenges n the datasets, such as lghtng varaton, shadows, background moton, reflecton, dense populaton, drastc color nconsstency between vews, etc, our system produces hgh qualty reconstructons. Secton 6 analyzes the advantages and lmtatons of ths framework and compares the two types of occlusons n more depth, and draws the future pcture. 3 Probablstc Framework Gven the complete set of symbols lsted n Table 1, we can defne our problem formally: at a specfc tme nstant, gven a set of geometrcally calbrated and temporally synchronzed vdeo frames I from n cameras, we nfer for every dscretzed locaton X n a 3D occupancy volume grd ts probablty of beng L {, O, 1,...,m,U}. Ths means a voxel could be empty (denoted by ), occuped by a statc occluder (denoted by O), or by one of the m objects currently n the scene, whch are of known appearance models. Last but not the least, one more label U could be assgned, for undentfed objects. It acts as a default label to capture all objects that are detected as dfferent than background but not explctly modeled by other labels. It s useful Table 1 Notatons of the mult-vew system n m X x l ˆX ˇX L G U O I t B C m S number of cameras number of dynamc objects camera ndex 3D locaton, n the occupancy grd pxel at camera vew correspondng to the voxel X vewng lne of X to vew 3D locaton, on the vewng ray of X, and n front of X wth respect to vew 3D locaton, on the vewng ray of X, and behnd X wth respect to vew voxel labels empty space label dynamc object label label for a newcomng dynamc object, whose appearance has not been learnt statc occluder label mage from camera at tme t camera s background model dynamc object m s appearance model n vew slhouette formaton hdden varable and effectve for automatc detecton of new objects comng nto the scene, whose appearance has not yet been learned (Sect. 4.3). Theoretcally, the problem s to compute the posteror probablty from the camera observatons; but t s not easy n practce. Because the estmaton of a voxel s state nvolves modelng dependences wth all other voxels on ts vewng lnes wth respect to all cameras. Gven the huge state space,.e. the sold 3D volume, and multple state labels of dfferent objects, t s mpossble to enumerate all state confguratons to fnd the one wth the hghest probablty. People have encountered smlar problems and proposed solutons that suppress the 3D volume state space nto 2D ground plane (Mttal and Davs 2003) and then solve the global soluton as an offlne process (Fleuret et al. 2007). However, snce we want to recover the full 3D nformaton for dynamc scenes, and to keep the potental of real-tme processng, the prevous proposals are not satsfactory. Instead, we borrow the teratve dea of an EM framework (Keck and Davs 2008) and break the estmaton nto two steps: for every tme nstant, we frst estmate the occupancy probabltes for each of the ndvdual dynamc object from slhouette nformaton usng a Bayesan formulaton; then we estmate the nter-occluson as well as statc occluson n a second pass. Although refnements can be acheved by dong teratons over ths two-step soluton, we demonstrate wth real datasets that the shape estmaton s already good for a sngle teraton of the two-step process. In our formulaton we address the nference of both statc occluders and multple dynamc objects n one scene.

6 288 Int J Comput Vs (2010) 90: Our treatment of statc occluder s ndfferent to the number of dynamc objects n the scene. To keep notatons uncluttered we wll present the occluder nference framework n the context of only one undscrmnated dynamc object label G wth only two states G {0, 1} (Sect. 3.1). We wll then specfcally explan how to model mult-object nterocclusons n Sect In the latter we leave out occluder nference for clarty, although we later show how to perform both tasks smultaneously. 3.1 Statc Occluder In ths secton, to ntroduce the statc occluder formulaton, for smplcty, we assume only one dynamc object s n the scene. In the result secton Sect. 5, we show that our occluson modelng technque also apples to multple dynamc objects. Let a bnary varable G denote the sngle dynamc object n the scene at voxel X, namely G = 1 means the voxel s occuped by the object, and G = 0 denotes t s not. The occluder occupancy state can also be expressed usng the bnary label O. O = 1 occuped, and O = 0 not. Notce that the statc occluder state O sassumedtobefxed,whle the dynamc object state G vares over tme t {1,...,T}, where T denotes the tme nstant of the last avalable frame so far. The dynamc object occupancy of voxel X at tme t s expressed by a G t. As shown n Fg. 4(a), the regons of nterest to nfer the probabltes of both G and O are on the vewng lnes l, {1,...,n} from the camera centers through X. ThevoxelX projects to n mage pxels x, {1,...,n}, whose color observed at tme t n vew s expressed by the varable I t. We assume that background mages, whch are generally statc, were pre-recorded free of dynamc objects, and that the appearance and varablty of background colors for pxels x has been modeled usng a set of parameters B. Such observatons can be used to nfer the probablty of dynamc object occupancy n the absence of statc occluders. The problem of recoverng occluder occupancy s more complex because t requres modelng nteractons between voxels on the same vewng lnes. Relevant statstcal varables are shown n Fg. 4(b) Vewng Lne Modelng Because of potental mutual occlusons, to nfer O, one must account for other occupances along the vewng lnes of X. Statc occluder or dynamc shapes can be present along the same vewng lne, leadng to dfferent mage formatons at the camera vew. Accountng for all the combnatoral number of possbltes for voxel states along X s vewng lne s nether necessary nor meanngful: frst because occupances of neghborng voxels are fundamentally correlated to the presence or the absence of a sngle common object, second because the man useful nformaton one needs to make occluson decsons about X s to know whether somethng s n front of t or behnd t, regardless of the exact locatons along the vewng lne. Wth ths n mnd, the pxel observaton at x wth respect to a certan X s vewng lne l can be descrbed wth three components: the state of X tself, the state of occluson of X by anythng n front, and the state of what s at the back of X. And the front and back components are modeled by extractng the two most nfluental modes n front of and behnd X. Specfcally, the locatons of the two modes are gven by two voxels ˆX t and ˇX t. We select ˆX t as the voxel at tme t that most contrbutes to the belef that X s obstructed by a dynamc object along l, and ˇX t as the voxel most lkely to be occuped by a dynamc object behnd X on l at tme t. Wth ths three-component modelng, comes a number of related statstcal varables llustrated n Fg. 4(b). The occupancy of voxels ˆX t and ˇX t by the vsual hull of a dynamc object at tme t on l s expressed by two bnary state varables, respectvely Ĝ t and Ǧt. Two bnary state varables Ôt and Ǒt express the presence or absence of an occluder at voxels ˆX t and ˇX t respectvely. Note the dfference n semantcs between the two varable groups Ĝ t, Ǧt and Ô t, Ǒt. The former desgnates dynamc shape vsual hull occupances of dfferent tme nstants and chosen postons, whle the latter expresses the statc occluder occupances. The locatons of Ĝ t and Ǧt at dfferent tmes are dfferent, because the dynamc shape may have moved over the tme; whle Ô t and Ǒt are always assocated wth the same locatons as Ĝ t and Ǧt, for the purpose of our smplfed vewng lne state enumeraton scheme. All the aforementoned states need to be consdered because they dependently nfluence the occupancy nference at X. From the mage formaton perspectve, by varyng the states of Ĝ t, Ǧ t, Ôt, Ǒt, G and O (the latter are the dynamc object and occluderstates atthe voxellocaton X), t would form dfferent mage pxel values at x. For legblty, we occasonally refer to the conjuncton of a group of varables by droppng the ndces and exponents, e.g. B ={B 1,...,B n } Jont Dstrbuton We now explan the dependences between the problem varables to smplfy the ther jont probablty dstrbuton. An ntutve assumpton s that dfferent vews can be ndependently predcted wthout the knowledge of other vews, gven the knowledge about the scene G and O. The background model for one vew can be ndependently traned. A second assumpton s that space occupancy varables at X depend only on the nformaton along optc rays that go through X, whch may nclude not just the sngle pxel that the voxel s projected onto, but a 2D neghborhood of pxels around the voxel s projecton. We assume that the vewng lne varables are suffcent to model the dependences

Int J Comput Vs (2010) 90: 283 303 289 Fg. 4 Problem overvew. (a) Geometrc context of voxel X.(b)Man statstcal varables used to nfer the occluder occupancy probablty of X.

7 Int J Comput Vs (2010) 90: Fg. 4 Problem overvew. (a) Geometrc context of voxel X.(b)Man statstcal varables used to nfer the occluder occupancy probablty of X. G t, Ĝ t, Ǧt : dynamc object occupances at relevant voxels at, n front of, behnd X respectvely. O, Ô t, Ǒt : statc occluder occupances at, n front of, behnd X. I t, B : colors and background color models observed where X projects n mages Fg. 5 The dependency graph for the statc occluder nference at voxel X. O and G t are the occluder occupancy and dynamc object occupancy state at tme t at locaton X. Notce that the background B s assumed to be only dependent on the vew but constant over tme; whle Ô t and Ǒ t are at dfferent locatons at dfferent tmes for X, though O tself s not a functon of tme between a voxel X and other voxels on ts vewng lnes. Ths assumpton allows us to use the common slhouette method smplfcaton whch conssts n ndependently computng probabltes of each voxel X. Ths avods the hghly complex problem of updatng the full grd state O whle smultaneously accountng for vewng lne dependences. Besdes, ths assumpton s reasonable because t s smlar to determnstc volumetrc vsual hull algorthms, where every voxel s status s evaluated ndvdually aganst ts projectons onto mage pxels. Results show that ndependent estmaton, whle not as exhaustve as a global search over all voxel confguratons, stll provdes very robust and usable nformaton, at a much lower cost. We now descrbe the nosy nteractons between the varables consdered, through the decomposton of ther jont dstrbuton p(o, G, Ô, Ĝ, Ǒ, Ǧ, I, B). Gven the varable dependency graph shown n Fg. 5, we propose: T n p(o) p(g t O) p(ô t )p(ĝt Ôt )p(ǒt )p(ǧt Ǒt ) t=1 =1 p(i t Ôt, Ĝt, O, Gt, Ǒ t, Ǧt, B ). (1) p(o), p(ô t) and p(ǒt ) are prors of occluder occupancy. We set them to a sngle constant dstrbuton P o whch reflects the expected rato between occluder voxels and non-occluder voxels n a scene. No partcular regon of space s to be favored apror. p(g t O), p(ĝ t Ôt ), p(ǧt Ǒt ) are prors of dynamc vsual hull occupancy wth dentcal semantcs. Ths choce of terms reflects the followng modelng decsons. Frst, the dynamc vsual hull occupances nvolved are consdered ndependent of one another as they synthesze the nformaton of three dstnct regons for each vewng lne. However they depend upon the knowledge of occluder occupancy at the correspondng voxel poston, because occluder and dynamc object occupances are mutually exclusve at a gven scene locaton. Importantly however, because we only use slhouette cues, we do not have drect access to dynamc object occupances but to the occupances of ts vsual hull. Fortunately ths ambguty can be adequately modeled n a Bayesan framework, by ntroducng a local hdden varable H expressng the correlaton between dynamc and occluder occupancy: p(g t O) = H p(h)p(g t H, O). (2) We set p(h = 1) = P c usng a constant expressng our pror belef about the correlaton between vsual hull and occluder occupancy. The pror p(g t H, O) explans what we expect to know about G t gven the state of H and O: p(g t = 1 H = 0, O = ω) = P Gt ω, (3) p(g t = 1 H = 1, O = 0) = P Gt, (4) p(g t = 1 H = 1, O = 1) = P go, (5) wth P Gt the pror dynamc object occupancy probablty as computed ndependently of occlusons (Franco and Boyer 2005), and P go set close to 0, expressng that t s unlkely that the voxel s occuped by dynamc object vsual hulls when the voxel s known to be occuped by an occluder and both dynamc and occluder occupancy are known to be strongly correlated (5). The probablty of vsual hull occupancy s gven by the prevously computed occupancy pror,

8 290 Int J Comput Vs (2010) 90: n case of non-correlaton (3), or when the states are correlated but occluder occupancy s known to be empty (4) Image Sensor Model We choose the sensor model p(i t Ô t, Ĝt, O, Gt, Ǒ t, Ǧ t, B ) n (1) to be governed by a hdden local per-pxel process S. The bnary varable S represents the hdden slhouette detecton state (0 or 1) at ths pxel. It s unobserved nformaton and can be margnalzed, gven an adequate splt nto two subterms: p(i t Ôt, Ĝt, O, Gt, Ǒ t, Ǧt, B ) = S p(i t S, B )p(s Ô t, Ĝt, O, Gt, Ǒ t, Ǧt ). (6) p(i t S, B ) ndcates what color dstrbuton we expect to observe gven the knowledge of slhouette detecton and background color model at ths pxel. When S = 0, the slhouette s undetected and thus the color dstrbuton s dctated by the pre-observed background model B (consdered Gaussan n our experments). When S = 1, a dynamc object s slhouette s detected, n whch case our knowledge of color s lmted, thus we use a unform dstrbuton n ths case, favorng no dynamc object color apror. p(s Ô t, Ĝt, O, Gt, Ǒ t, Ǧt ) s the second part of our sensor model, whch explcts what slhouette state s expected to be observed gven the three domnant occupancy state varables of the correspondng vewng lne. Snce these are encountered n the order of vsblty ˆX t,x, ˇX t, the followng relatons hold: p(s {Ô t, Ĝt, O, Gt, Ǒ t, Ǧt } ={o, g, k, l, m, n}, B ) = p(s {Ô t, Ĝt, O, Gt, Ǒ t, Ǧt }={0, 0,o,g,p, q}, B ) = p(s {Ô t, Ĝt, O, Gt, Ǒ t, Ǧt }={0, 0, 0, 0,o,g}, B ) = P S (S o, g) (o, g) (0, 0) (k, l, m, n, p, q). (7) These expressons convey two characterstcs. Frst, that the form of ths dstrbuton s gven by the frst non-empty occupancy component n the order of vsblty, regardless of what s behnd ths component on the vewng lne. Second, that the form of the frst non-empty component s gven by an dentcal sensor pror P S (S o, g). We set the four parametrc dstrbutons of P S (S o, g) as followng: P S (S = 1 0, 0) = P fa P S (S = 1 1, 0) = P fa, (8) P S (S = 1 0, 1) = P d P S (S = 1 1, 1) = 0.5, (9) where P fa [0, 1] and P d [0, 1] are constants expressng the pror probablty of false alarm and the probablty of detecton, respectvely. They can be chosen once for all datasets as the method s not senstve to the exact value of these prors. Meanngful values for P fa are close to 0, whle P d s generally close to 1. Equaton (8) expresses the cases where no slhouette s expected to be detected n mages,.e. ether when there are no objects at all on the vewng lne, or when the frst encountered object s a statc occluder, respectvely. Equaton (9) expresses two dstnct cases. Frst, the case where a dynamc object s vsual hull s encountered on the vewng lne, n whch case we expect to detect a slhouette at the matchng pxel. Second, the case where both an occluder and dynamc vsual hull are present at the frst non-free voxel. Ths s perfectly possble, because the vsual hull s an overestmate of the true dynamc object shape. Whle the true shape of objects and occluders are naturally mutually exclusve, the vsual hull of dynamc objects can overlap wth occluder voxels. In ths case we set the dstrbuton to unform, because the slhouette detecton state cannot be predcted: t can be caused by shadows casted by dynamc objects on occluders n the scene, and nose Statc Occluder Occupancy Inference Estmatng the occluder occupancy at a voxel translates to estmatng p(o I, B) n Bayesan terms. Applyng Bayes rule to the modeled jont probablty (1) leads to the followng expresson, once hdden varable sums are decomposed to factor out terms not requred at each level of the sum: p(o I, B) = 1 T ( z p(o) where P t = Ǒ t,ǧt t=1 p(ǒ t )p(ǧt Ǒt ) Ô t,ĝt G t p(g t O) ( n =1 P t p(ô t )p(ĝt Ôt ) )), (10) p(i t Ôt, Ĝt, O, Gt, Ǒ t, Ǧt, B ). (11) P t expresses the contrbuton of vew at a tme t. The formulaton therefore expresses Bayesan fuson over the varous observed tme nstants and avalable vews, wth margnalzaton over unknown vewng lne states (10). The normalzaton constant z s easly obtaned by ensurng summaton to 1 of the dstrbuton Onlne Incremental Computaton Wth the formulaton of the prevous sectons, the statc occluder probablty s computed by consderng all the occluson events (between the dynamc shape and the statc occluder) that have happened up to the current frame. It s an onlne process. However, t has a subtle problem: for a gven voxel locaton whch s supposed to be free of occupancy,

9 Int J Comput Vs (2010) 90: the occluder probablty may be hgh just because an occluson event has happened along the vewng lne somewhere behnd the voxel (a real occluder s behnd the voxel) wth respect to the camera, as shown n Fg. 9. And as the fgure shows, t s only lkely to happen at the begnnng of the vdeos when too lttle nformaton has been collected. The voxel s occluder probablty would eventually drop to a reasonable near zero value when more evdences have shown over tme that snce ths voxel s not blockng the dynamc shape behnd t (maybe from all other vews), t s more lkely to be an empty voxel. There s a way however to detect ths bas n the voxel probablty estmaton. If we take a second look at the problem, ths early decson s made from evdences that come from only a few camera vews ether because those vews s geometrc calbraton errors are larger than others, or because there happen to be some real statc occluders n those vews behnd the msjudged voxel. Intutvely, a voxel X s probablty estmaton becomes more relable as ts occluson nformaton s confrmed from more vews,.e. when a dynamc object has passed behnd X namorevews. We thus ntroduce a measure of observablty and trustworthness of a voxel s estmaton: the relablty R of a voxel at a certan tme nstant. Specfcally, we model the ntuton that voxels whose occluson cues arse from an abnormally low number of vews should not be trusted. Snce ths nvolves all cameras and ther observatons jontly, the ncluson of ths constrant n our ntal model would break the symmetry n the nference formulated n (10) and defeat the possblty for onlne updates. Instead, we opt to use a second crteron n the form of a relablty measure R [0, 1]. Small values ndcate poor coverage of dynamc objects, whle large values ndcate suffcent cue accumulaton. We defne relablty usng the followng expresson: R = 1 n n =1 max (1 PĜt )PǦt (12) t wth PĜt and PǦt the pror probabltes of dynamc vsual hull occupancy. R examnes, for each camera, the maxmum occurrence across the examned tme sequence of X to be both unobstructed and n front of a dynamc object. Ths determnes how well a gven vew was able to contrbute to the estmaton across the sequence. R then averages these values across vews, to measure the overall qualty of observaton, and underlyng coverage of dynamc object moton for the purpose of occluson nference. The relablty R s not a probablty, but an ndcator. It can be used onlne n conjuncton to the occluson probablty estmaton to evaluate a conservatve occluder shape at all tmes, by only consderng voxels for whch R exceeds a certan qualty threshold. As shown n Sect , t can be used to reduce the senstvty to nose n regons of space that have only been observed margnally Accountng the Recovered Occluder As more data becomes avalable and relable, the results of occluder estmaton can be accounted for when nferrng the occupances of dynamc objects. Ths translates to the evaluaton of p(g τ I τ, B) for a gven voxel X and tme τ.the occluson nformaton obtaned can be ncluded as a pror n dynamc object nference, by adequately modfyng the exstng probablstc framework (Franco and Boyer 2005), leadng to the followng smplfed jont probablty dstrbuton: p(o)p(g τ O) n p(ô τ )p(ĝτ Ô τ ) =1 p(i τ Ô τ, Ĝτ, O, Gτ, B ), where G τ and O are the dynamc and occluder occupancy at the nferred voxel, Ô τ, Ĝτ the varables matchng the most nfluental statc occluder component along l n front of X. Ths component s selected as the voxel whose pror of beng occuped s maxmal, as computed to date by occluson nference. In ths nference, there s no need to consder voxels behnd X, because knowledge about ther occluson occupancy has no nfluence on the dynamc object occupancy state of X. The parametrc forms of ths dstrbuton have dentcal semantcs as Sect but dfferent assgnments because of the nature of the nference. Naturally no pror nformaton about dynamc occupancy s assumed here. p(o) and p(ô τ ) are set usng the result to date of expresson (10) at ther respectve voxels, as pror. p(g τ O) and p(ĝ τ Ô τ ) are constant: p(g τ = 1 O = 0) = 0.5 expresses a unform pror for dynamc objects when the voxel s known to be occluder free. p(g τ = 1 O = 1) = P go expresses a low pror of dynamc vsual hull occupancy gven the knowledge of occluder occupancy, as n (5). The term p(i τ Ô τ, Ĝ τ, O, Gτ, B ) s set dentcal to (7), only strpped of the nfluence of Ǒ τ, Ǧτ. 3.2 Multple Dynamc Objects In ths secton, we focus on the nference of multple dynamc objects. Snce a dynamc object changes shape and locaton constantly, our dynamc object reconstructon has to be computed for every frame n tme, and there s no way to accumulate the nformaton over tme as we dd for the statc occluder. We assume statc occluson s computed n an ndependent thread and can be used as pror n ths nference. We thus focus here on the mult-object problem occurrng at one tme nstant t. We ntroduce new notatons to account for up to m dynamc objects of nterest, n a scene observed by n calbrated cameras. Occupances G at a gven voxel X need

292 Int J Comput Vs (2010) 90: 283 303 Fg. 7 The dependency graph for the dynamc object nference at voxel X, assumng m dynamc objects n the scene and the probablty for X to be other labels are known.

10 292 Int J Comput Vs (2010) 90: Fg. 7 The dependency graph for the dynamc object nference at voxel X, assumng m dynamc objects n the scene and the probablty for X to be other labels are known. The background model for each vew B, the color model of each object for each vew C and the statc occluder O are not drawn for clarty Fg. 6 Overvew of man statstcal varables and geometry of the problem. G s the occupancy at voxel X and lves n a state space L of object labels. {I } are the color states observed at the n pxels where X projects. {G v j } are the statesn L of the most lkely obstructng voxels on the vewng lne, for each of the m objects, enumerated n ther order of vsblty {v j } now to be defned over an extended set of m + 2 labels (descrbed n the followng secton) rather than {0, 1} to model occupancy dstrbutons over several objects. We now also assume some pror knowledge about scene state s avalable for each voxel X n the lattce and can be used n the nference. Varous uses of ths assumpton wll be demonstrated n Sect. 4. Let us revst the number of statstcal varables used to model the scene state, the mage generaton process and to nfer G, as depcted n Fg Statstcal Varables Scene Voxel State Space The occupancy state of X s represented by G L, where L s a set of labels {, 1,...,m,U}. A voxel s ether empty ( ), one of m objects the model s keepng track of (numercal labels), or occuped by an undentfed object (U). U s ntended to act as a default label capturng all objects that are detected as dfferent than background but not explctly modeled by other labels, whch proves useful for automatc detecton of new objects (Sect. 4.3). Observed Appearance The voxel X projects to a set of pxels, whose colors I, 1,...,n we observe n mages. We assume these colors are drawn from a set of object and vew specfc color models whose parameters we note C l.more complex appearance models are possble usng gradent or texture nformaton, wthout loss of generalty. Latent Vewng Lne Varables To account for nter-object occluson, we need to model the contents of vewng lnes and how t contrbutes to mage formaton. We assume some aprorknowledge about where objects le n the scene. The presence of such objects can have an mpact on the nference of G because of the vsblty of objects and how they affect G. Intutvely, conclusve nformaton about G cannot be obtaned from a vew f a voxel n front of G wth respect to s occuped by another object, for example. However, G drectly nfluences the color observed f t s unoccluded and occuped by one of the objects. But f G s known to be empty, then the color observed at pxel I reflects the appearance of objects behnd X n mage, f any. These vsblty ntutons are modeled below (Sect ). It s agan not meanngful to account for the combnatoral number of occupancy possbltes along the vewng rays of X. Ths s because neghborng voxel occupances on the vewng lne usually reflect the presence of the same object and are therefore correlated. In fact, assumng we wtness no more than one nstance of every one of the m objects along the vewng lne, the fundamental nformaton that s requred to reason about X s the knowledge of presence and orderng of the objects along ths lne. To represent ths knowledge, as depcted n Fg. 6, assumng pror nformaton about occupances s already avalable at each voxel, we extract, for each label l L and each vewng lne {1,...,n}, the voxel whose probablty of occupancy s domnant for that label on the vewng lne. Ths corresponds to electng the voxels whch best represent the m objects and have the most nfluence on the nference of G. We then account for ths knowledge n the problem of nferrng X, by ntroducng a set of statstcal occupancy varables G l L, correspondng to these extracted voxels. Ths s a generalzaton of the dea expressed n Sect for occluder voxels, to the general case of m object nter-occlusons Dependences Consderaton Based on the dependency graph n Fg. 7, we propose a set of smplfcatons to the jont probablty dstrbuton of the set of varables, that reflect the pror knowledge we have about the problem. In order to smplfy the wrtng we wll often note the conjuncton of a set of varables as follows: G1:n 1:m ={Gl } {1,...,n},l {1,...,m}. We now decompose the jont probablty p(g, G1:n 1:m, I 1:n, C1:n 1:m) as: p(g) p(c1:n l ) p(g l G) l L,l L p(i G, G 1:m, C 1:m ). (13)

11 Int J Comput Vs (2010) 90: Pror Terms p(g) carres pror nformaton about the current voxel. Ths pror can reflect dfferent types of knowledge and constrants already acqured about G, e.g. localzaton nformaton to gude the nference (Sect. 4). p(c1:n l ) s the pror over the vew-specfc appearance models of a gven object l. The pror, as wrtten over the conjuncton of these parameters, could express expected relatonshps between the appearance models of dfferent vews, even f not color-calbrated. Snce the focus n ths paper s on the learnng of voxel X, we do not use ths capablty here and assume p(c1:n l ) to be unform. Vewng Lne Dependency Terms We have summarzed the pror nformaton along each vewng lne usng the m voxels most representatve of the m objects, so as to model nterobject occluson phenomena. However when examnng a partcular label G = l, keepng the occupancy nformaton about G l would lead us to account for ntra-object occluson phenomena, whch n effect would lead the nference to favor mostly voxels from the front vsble surface of the object l. Because we wsh to model the volume of object l, we dscard the nfluence of G l when G = l: p(g k {G = l}) = P(Gk ) when k l, (14) p(g l {G = l}) = δ (G l ) l L, (15) where P(G k ) s a dstrbuton reflectng the pror knowledge about G k, and δ (G k ) s the dstrbuton gvng all the weght to label. In(15) p(g l {G = l}) s thus enforced to be empty when G s known to be representng label l, whch ensures that the same object s represented only once on the vewng lne. Image Formaton Terms p(i G, G 1:m, C 1:m ) s the mage formaton term. It explans what color we expect to observe gven the knowledge of vewng lne states and per-object color models. We decompose each such term n two subterms, by ntroducng a local latent varable S L representng the hdden slhouette state: p(i G, G 1:m, C 1:m ) = S p(i S, C 1:m )p(s G, G 1:m ). (16) The term p(i S, C 1:m ) smply descrbes what color s lkely to be observed n the mage gven the knowledge of the slhouette state and the appearance models correspondng to each object. S acts as a mxture label: f {S = l} then I s drawn from the color model C l. For objects (l {1,...,m}) we typcally use Gaussan Mxture Models (GMM) (Stauffer and Grmson 1999) to effcently descrbe the appearance nformaton of dynamc object slhouettes. For background (l = ) we use per-pxel Gaussan as learned from pre-observed sequences, although other models are possble. When l = U the color s drawn from the unform dstrbuton, as we make no assumpton about the color of prevously unobserved objects. The slhouette formaton term p(s G, G 1:m ) requres that the varables be consdered n ther vsblty order to model the occluson possbltes. Note that ths order can be dfferent from 1,...,m. We note {G v j } j {1,...,m} the varables G 1:m as enumerated n the permutated order {v j } reflectng ther vsblty orderng on vewng lne l.if{g} denotes the partcular ndex after whch the voxel X tself appears on vewng lne l, then we can re-wrte the slhouette formaton term as p(s G v 1 G v g, G, G v g+1 G v m ). A dstrbuton of the followng form can then be assgned to ths term: p(s,l, ) = d l (S) wth l (17) p(s,, ) = d (S), (18) where d k (S), k L s a famly of dstrbutons gvng strong weght to label k and lower equal weght to others, determned by a constant probablty of detecton P d [0, 1]: d k (S = k) = P d and d k (S k) = 1 P d L 1 to ensure summaton to 1. Equaton (17) thus expresses that the slhouette pxel state reflects the state of the frst vsble non-empty voxel on the vewng lne, regardless of the state of voxels behnd t ( ). Equaton (18) expresses the partcular case where no occuped voxel les on the vewng lne, the only case where the state of S should be background: d (S) ensures that I s mostly drawn from the background appearance model Dynamc Object Inference Estmatng the occupancy at voxel X translates to estmatng p(g I 1:n, C1:n 1:m ) n Bayesan terms. We apply Bayes rule usng the jont probablty dstrbuton, margnalzng out the unobserved varables G1:n 1:m: p(g I 1:n, C 1:m 1:n ) = 1 z where f k and f m = G v k = G vm G 1:m 1:n = 1 z p(g) n p(g, G 1:m 1:n, I 1:n, C 1:m 1:n ) (19) =1 f 1 (20) p(g v k G)f k+1 for k<m, (21) p(g v m G)p(I G, G 1:m, C 1:m ). (22)

12 294 Int J Comput Vs (2010) 90: Smlar to (10), the normalzaton constant z s obtaned by ensurng that the dstrbuton of (19) sum up to 1: z = G,G1:n 1:m p(g, G1:m 1:n, I 1:n, C1:n 1:m ). The sum n ths form s ntractable, thus we factorze the sum n (20). The sequence of m functons f k specfy how to recursvely compute the margnalzaton wth the sums of ndvdual G k varables approprately subsumed, so as to factor out terms not requred at each level of the sum. Because of the partcular form of slhouette terms n (17), ths sum can be effcently computed, gven that all terms after a frst occuped voxel of the same vsblty rank k share a term of dentcal value n p(i, {G v k = l}, ) = P l (I ). They can be factored out of the remanng sum, whch sums to 1 beng a sum of terms of a probablty dstrbuton, leadng to the followng smplfcaton of (21), k {1,...,m 1}: f k = p(g v k = G)f k+1 + l 4 Automatc Learnng and Trackng p(g v k =l G)P l (I ). (23) We have presented n Sect. 3 a generc framework to nfer the occupancy probablty of a voxel X and thus deduce how lkely t s for X to belong to one of m objects. Some addtonal work s requred to use t to model objects n practce. The formulaton explans how to compute the occupancy of X f some occupancy nformaton about the vewng lnes s already known. Thus the algorthm needs to be ntalzed wth a coarse shape estmate, whose computaton s dscussed n Sect Intutvely, object shape estmaton and trackng are complementary and mutually helpful tasks. We explan n Sect. 4.2 how object localzaton nformaton s computed and used n the modelng. To be fully automatc, our method uses the nference label U to detect objects not yet assgned to a gven label and learn ther appearance models (Sect. 4.3). Fnally, statc occluder computaton can easly be ntegrated n the system and help the nference be robust to statc occluders (Sect. 4.4). The algorthm at every tme nstance s summarzed n Algorthm Shape Intalzaton and Refnement The proposed formulaton reles on some pror knowledge about the scene occupances and dynamc object orderng. Thus part of the occupancy problem must be solved to bootstrap the algorthm. Fortunately, usng mult-label slhouette nference wth no pror knowledge about occupances or consderaton for nter-object occlusons provdes a decent ntal m-occupancy estmate. Ths smpler nference case Algorthm 1 Dynamc Scene Reconstructon Input: Frames at a new tme nstant for all vews Output: 3D object shapes n the scene Coarse Inference; f a new object enters the scene then add a label for the new object; ntalze foreground appearance model; go back to Coarse Inference; end f Refned Inference; statc occluder nference; update object locaton and pror; return can easly be formulated by smplfyng occluson related varables from (20): p(g I 1:n, C1:n 1:m ) = 1 n z p(g) p(i G, C 1:m ). (24) =1 Ths ntal coarse nference can then be used to nfer a second, refned nference, ths tme accountng for vewng lne obstructons, gven the voxel prors p(g) and P(G j ) of (14) computed from the coarse nference. The pror over p(g) s then used to ntroduce soft constrants to the nference. Ths s possble by usng the coarse nference result as the nput of a smple localzaton scheme, and usng the localzaton nformaton n p(g) to enforce a compactness pror over the m objects, as dscussed n Sect Object Localzaton We use a localzaton pror to enforce the compactness of objects n the nference steps. For the partcular case where walkng people represent the dynamc objects, we take advantage of the underlyng structure of the dataset, by projectng the maxmum probablty over a vertcal voxel column on the horzontal reference plane. We then localze the most lkely poston of objects by sldng a fxed-sze wndow over the resultng 2D probablty map for each object. The resultng center s subsequently used to ntalze p(g), usng a cylndrcal spatal pror. Ths favors objects localzed n one and only one porton of the scene and s ntended as a soft gude to the nference. Although smple, ths trackng scheme s shown to outperform state of the art methods (Sect ), thanks to the rch shape and occluson nformaton modeled. 4.3 Automatc Detecton of New Objects The man nformaton about objects used by the proposed method s ther set of appearances n the dfferent vews.

13 Int J Comput Vs (2010) 90: These sets can be learned offlne by segmentng each observed object alone n a clear, uncluttered scene before processng mult-objects scenes. More generally, we can ntalze object color models n the scene automatcally. To detect new objects we compute U s object locaton and volume sze durng the coarse nference, and track the unknown volume just lke other objects as descrbed n Sect A new dynamc object nference label s created (and m ncremented), f all of the followng crtera are satsfed: The entrance s only at the scene boundares U s volume sze s larger than a threshold U s not too close to the scene boundary Subsequent updates of U s track are bounded To buld the color model of the new object, we project the maxmum voxel probablty along the vewng ray to the camera vew, threshold the mage to form a slhouette mask, and choose pxels wthn the mask as tranng samples for a GMM appearance model. Samples are only collected from unoccluded slhouette portons of the object, whch can be verfed from the nference. Because the cameras may be badly color-calbrated, we propose to tran an appearance model for each camera vew separately. Ths approach s fully evaluated n Sect Occluder computaton The statc occluder computaton can easly be ntegrated wth the multple dynamc object reconstructon descrbed n Sect At every tme nstant the domnant occupancy probabltes of m objects are already extracted; the two domnant occupances n front and behnd the current voxel X can be used n the occupancy nference formulaton of Sect It could be thought that the mult-label dynamc object nference dscussed n ths secton s an extenson to the sngle dynamc object cases assumed n Sect In fact, the occluson occupancy nference does beneft from the dsambguaton nherent to mult-slhouette reasonng, as the real-world experment shows, n Fg. 16, n Sect Result and Evaluaton 5.1 Occluson Inference Results To demonstrate the valdty of the statc occluder shape recovery, we manly use a sngle person as the dynamc object n the scene. In the next secton, we also show that t can be recovered n the presence of multple dynamc objects. We show three sequences: the PILLARS and SCULPTURE sequences, acqured outdoors, and the CHAIR sequence, acqured ndoors, wth combned artfcal and natural lght from large bay wndows. In all sequences nne DV cameras surround the scene of nterest, background models are learned n the absence of movng objects. A sngle person as our dynamc object walks around and through the occluder n each scene. The shape of the person s estmated at each consdered tme step and used as pror to occluson nference. The data s used to compute an estmate of the occluder shape usng (10). Results are presented n Fg. 8. Nne geometrcally calbrated resoluton cameras all record at 30 fps. Color calbraton s unnecessary because the model uses slhouette nformaton only. The background model s learned per-vew usng a sngle RGB Gaussan color model per pxel, and tranng mages. Although smple, the model s proved suffcent, even n outdoor sequences subject to background moton, foreground object shadows, and substantal llumnaton changes, llustratng the strong robustness of the method to dffcult real condtons. The method copes well wth background msclassfcatons that do not lead to large coherent false postve dynamc object estmatons: pedestrans are routnely seen n the background for the SCULPTURE and PILLARS sequences (e.g. Fg. 8(a1)), wthout any sgnfcant corrupton of the nference. Adjacent frames n the nput vdeos contan largely redundant nformaton for occluder modelng, thus vdeos can safely be subsampled. PILLARS was processed usng 50% of the frames (1053 frames processed), SCULPTURE and CHAIR wth 10% (160 and 168 processed frames respectvely) Onlne Computaton Results All experments can be computed usng ncremental nference updates. Fgure 9 depcts the nference s progresson, usng the sensor fuson formulaton alone or n combnaton wth the relablty crteron. For the purpose of ths experment, we used the PILLARS sequence and manually segmented the occluder n each vew for a ground truth comparson, and focused on a subregon of the scene n whch the expected behavors are well solated. Fgure 9 shows that both schemes converge reasonably close to the vsual hull of the consdered pllar. In scenes wth concave parts accessble to dynamc objects, the estmaton would carve nto concavtes and reach a better estmate than the occluder s vsual hull. A somewhat larger volume s reached wth both schemes n ths example. Ths s attrbutable to calbraton errors whch over-tghtens the vsual hull wth respect to the true slhouettes, and accumulaton of errors n both schemes toward the end of the sequence. We trace those to the redundant, perodcal poses contaned n the vdeo, that sustan consstent nose. Ths suggests the exstence of an optmal fnte number of frames to be used for processng. Jolts can be observed n both volumes correspondng to nstants where the person walks behnd the pllar, thereby addng postve contrbutons to the nference. The use of the relablty crteron defned n Sect contrbutes to lower senstvty

Blue: neutral regons (pror Po ), red: hgh probablty regons. Brghter/clear regons ndcate the nferred absence of occlud- Fg.

14 296 Fg. 8 Occluder shape retreval results. Sequences: (a) PILLARS, (b) SCULPTURE, (c) CHAIR. (1) Scene overvew. Note the harsh lght, dffcult backgrounds for (a) and (b), and specularty of the sculpture, causng no sgnfcant modelng falure. (2 3) Occluder nference accordng to (10). Blue: neutral regons (pror Po ), red: hgh probablty regons. Brghter/clear regons ndcate the nferred absence of occlud- Fg. 9 Onlne nference analyss and ground truth vsual hull comparson, usng PILLARS dataset, focusng on a slce ncludng the mddle pllar. (a) Frames 109, 400 and 1053, nferred usng (10). (b) Same frames, ths tme excludng zones wth relablty under 0.8 (reverted here to 0.5). (c) Number of voxels compared to ground truth vsual hull across tme Int J Comput Vs (2010) 90: ers. Fne levels of detal are modeled, sometmes lost mostly to calbraton. In (a) the structure s steps are also detected. (4) Same nference wth addtonal excluson of zones wth relablty under 0.8. Perpheral nose and margnally observed regons are elmnated. The background protrudng shape n (c3) s accountng for an actual occluson from a sngle vew, the pllar vsble n (c3)

Int J Comput Vs (2010) 90: 283 303 297 Fg. 10 (a) Person shape estmate from PILLARS sequence, as occluded by the rghtmost pllar and computed wthout accountng for occluson.

15 Int J Comput Vs (2010) 90: Fg. 10 (a) Person shape estmate from PILLARS sequence, as occluded by the rghtmost pllar and computed wthout accountng for occluson. (b) Same stuaton accountng for occluson, showng better completeness of the estmate. (c) Volume plot n both cases. Accountng for occluson leads to more stable estmates across tme, decreases false postves and overestmates due to shadows cast on occluders (I), ncreases estmaton probabltes n case of occluson (II) to nose, as well as a permanently conservatve estmate of the occluder volume as the curves show n frames Raw nference (10) momentarly yelds large hypothetcal occluder volumes when data s based toward contrbutons of an abnormally low subset of vews (frame 109) Accountng for Occluson n SfS Our formulaton (Sect ) can be used to account for the accumulated occluder nformaton n dynamc shape nference. We only use occluson cues from relable voxels (R >0.8) to mnmze false postve occluder estmates, whose excessve presence would lead to sustaned errors. Whle n many cases the orgnal dynamc object formulaton (Franco and Boyer 2005) performs robustly, a number of stuatons beneft from the addtonal occluson knowledge (Fg. 10). Person volume estmates can be obtaned when accountng for occluders. These estmates appear on average to be a stable multple of the real volume of the person, whch depends manly on camera confguraton. Ths suggests a possble bometrcs applcaton of the method, for dsambguaton of person recognton based on computed volumes. 5.2 Mult-Object Shape Inference Results We have used four mult-vew sequences to valdate multobject shape nference. Eght 30 fps DV cameras surroundng the scene n a sem-crcle were used for the CLUSTER and BENCH sequences. The LAB sequence s provded by (Gupta et al. 2007) and SCULPTURE was used to reconstruct the statc sculpture (Fg. 8(b)) n the prevous secton. Here, we show the result of multple persons walkng n the scene together wth the reconstructed sculpture. Table 2 No. of Cam. No. of Dynamc Obj. Occluder CLUSTER (outdoor) 8 5 no BENCH (outdoor) yes LAB (ndoor) 15 4 no SCULPTURE (outdoor) 9 2 yes Cameras n each data sequence are geometrcally calbrated but not color calbrated. The background model s learned per-vew usng a sngle Gaussan color model at every pxel, wth tranng mages. Although smple, the model s proved to be suffcent, even for outdoor sequences subject to background moton, foreground object shadows, wndow reflectons and substantal llumnaton changes, showng the robustness of the method to dffcult real condtons. For dynamc object appearance models of the CLUS- TER, LAB and SCULPTURE data sets, we off-lne tran a pervew RGB GMM model for each person wth manually segmented foreground mages. For the BENCH sequence however, appearance models are ntalzed onlne automatcally, usng the method descrbed n Sect Appearance Modelng Valdaton It s extremely hard to color-calbrate a large number of cameras, not to menton under varyng lghtng condtons, as n a natural outdoor envronment. To show ths, we compare dfferent appearance modelng schemes n Fg. 11,fora frame of the outdoor BENCH dataset. Wthout loss of generalty, we use GMMs. The frst two rows compare slhouette extracton probabltes usng the color models of spatally

16 298 neghborng vews. These ndcate that stereo approaches whch heavly depend on color correspondence across vews are very lkely to fal n the natural scenaros, especally when the cameras have dramatc color varatons, such as n vew 4 and 5. The global appearance model on row 3 performs better than row 1 and 2, but ths s manly due to ts compensaton between large color varatons across camera vews, whch at the same tme, decreases the model s dscrmnablty. The last row obvously s the wnner where Int J Comput Vs (2010) 90: a color appearance model s ndependently mantaned for every camera vew. We hereby use the last scheme n our system. Once the model s traned, we do not update t as tme goes by. But ths onlne updatng of the appearance models could be an easy extenson for robustness. One more thng to note, s that n our approach, even though an object s appearances are learnt for each vew separately, they are stll lnked together n 3D by the same object label. In ths sense, our per-vew based appearances can be taken as an ntermedate model between the global model as used n Shape from Photo-consstency and mult-vew stereo, and the pure 2D mage models used by vdeo survellance and trackng lteratures Densely Populated Scene Fg. 11 Appearance model analyss. A person n eght vews s dsplayed n row 4. A GMM model C s traned for vew [1, 8]. A global GMM model C0 over all vews s also traned. Row 1, 2, 3 and 5 compute P (S I, B, C+1 ), P (S I, B, C 1 ), P (S I, B, C0 ) and P (S I, B, C ) for vew respectvely, wth S the foreground label, I the pxel color, B the unform background model. The probablty s dsplayed accordng to the color scheme at the top rght corner. The average probablty over all pxels n the slhouette regon and the mean color modes of the appled GMM model are shown for each fgure Fg. 12 Result from 8-vew CLUSTER dataset. (a) Two vews at frame 0. (b) Respectve 2-labeled reconstructon. (c) More accurate shape estmaton usng our algorthm The CLUSTER sequence s a partcularly challengng confguraton: fve people are on a crcle of less than 3 m. n dameter, yeldng an extremely ambguous and occluded stuaton at the crcle center. Despte the fact that none of them are beng observed n all vews, we are stll able to recover the people s label and shape. Images and results are shown n Fg. 12. The nave 2-label reconstructon (probablstc vsual hull) yelds large volumes wth lttle separaton between objects, because the entre scene confguraton s too ambguous. Addng trackng pror nformaton estmates the most probable compact regons and elmnates large errors, at the expense of dlaton and lower precson. Accountng for vewng lne occlusons enables the model to recover more detaled nformaton, such as the lmbs. The LAB sequence (Gupta et al. 2007) wth poor mage contrast s also processed. The reconstructon result from all 15 cameras s shown n Fg. 13. Moreover, n order to evaluate our localzaton pror estmaton, we compare our trackng method (Sect. 4.2) wth the ground truth data, the result of (Gupta et al. 2007) and (Mttal and Davs 2003). We use the exactly same eght cameras as n (Mttal and Davs 2003)

Int J Comput Vs (2010) 90: 283 303 for the comparson, shown n Fg. 13(b). Our method s generally more robust n trackng, and also bulds 3D shape nformaton.

3 Automatc Appearance Model Intalzaton The automatc dynamc object appearance model ntalzaton has been tested usng the BENCH sequence. Three people are walkng nto the empty scene one after another.

(a) 3D reconstructon wth 15 vews at frame 199. (b) 8-vew trackng result comparson wth methods n (Gupta et al. 2007; Mttal and Davs 2003) and the ground truth data.

17 Int J Comput Vs (2010) 90: for the comparson, shown n Fg. 13(b). Our method s generally more robust n trackng, and also bulds 3D shape nformaton. Most exstng trackng methods only focus on a trackng envelope and do not compute precse 3D shapes. It s ths shape nformaton that enables our method to acheve comparable or better precson Automatc Appearance Model Intalzaton The automatc dynamc object appearance model ntalzaton has been tested usng the BENCH sequence. Three people are walkng nto the empty scene one after another. By examnng the undentfed label U, object appearance models are ntalzed and used for shape estmaton n subsefg. 13 LAB dataset result from (Gupta et al. 2007). (a) 3D reconstructon wth 15 vews at frame 199. (b) 8-vew trackng result comparson wth methods n (Gupta et al. 2007; Mttal and Davs 2003) and the ground truth data. Mean error n ground plane estmate n mm s plotted Fg. 14 Appearance model automatc ntalzaton wth the BENCH sequence. The volume of U ncreases f a new person enters the scene. When an appearance model s learned, a new label s ntalzed. Durng the sequence, L1 and L2 volumes drop to near zero value because they walk out of the scene on those occasons Fg. 15 BENCH result. Person numbers are assgned accordng to the order ther appearance models are ntalzed. At frame 329, P3 s enterng the scene. Snce t s P3 s frst tme nto the scene, he s captured by label U (gray color). P1 s out of the scene at the moment. At frame 359, P1 has re-entered the scene. P3 has ts GMM model already traned and label L3 assgned. The bench as a statc occluder s beng recovered 299 quent frames. Volume sze evoluton of all labels are shown n Fg. 14 and the reconstructons at two tme nstants are shown n Fg. 15. Durng the sequence, U has three major volume peaks due to three new persons enterng the scene. Some smaller perturbatons are due to shadows on the bench or the ground. Besdes automatc object appearance model ntalzaton, the system robustly re-detects and tracks the person who leaves and re-enters the scene. Ths s because once the label s ntalzed, t s evaluated for every tme nstant, even f the person s out of the scene. The algorthm can easly be mproved to handle leavng/reenterng labels transparently.

300 Int J Comput Vs (2010) 90: 283 303 Fg. 16 SCULPTURE data set comparson. The mddle column shows the reconstructon wth a sngle foreground label.

18 300 Int J Comput Vs (2010) 90: Fg. 16 SCULPTURE data set comparson. The mddle column shows the reconstructon wth a sngle foreground label. The rght column shows the reconstructon wth a label for each person. Ths fgure shows that, by resolvng nter-occluson ambgutes, both the statc occluder and dynamc objects acheve better qualty Dynamc Object and Occluder Inference The BENCH sequence demonstrates the power of our automatc appearance model ntalzaton as well as the ntegrated occluder nference of the bench as shown n Fg. 15 between frame 329 and 359. Fgure 14 llustrates the status of our scene trackng and modelng across tme. We also compute result for SCULPTURE sequence wth two persons walkng n the scene, as shown n Fg. 16. For the dynamc objects, we manage to get much cleaner shapes when the two persons are close to each other, and more detaled shapes such as extended arms. For the occluder, thanks to the multple foreground modes and the consderaton of nter-occluson between the dynamc objects n the scene, we are able to recover the fne shape as well. The occluder nference would otherwse be perturbed by dynamc shape overestmatons. 6 Dscusson 6.1 Dynamc Object and Statc Occluder Comparson We have shown the probablstc models and real datasets for statc and dynamc shapes nference. Although both types of enttes are computed only from slhouette cues from camera vews and both requre the consderaton of occlusons, they actually have fundamentally dfferent characterstcs. Frst of all, there s no way to learn an appearance model for a statc occluder, because ts appearance s ntally embedded n the background model of a certan vew. Only when an occluson event happens between the dynamc object and the occluder, can we detect that certan appearance should belong to the occluder but not the background, and the occluder probablty should ncrease along that vewng drecton. Whereas for dynamc objects, we have mentoned and wll show n more detal n the next secton, that ther appearance models for all camera vews could be manually or automatcally learnt before reconstructon. Secondly, because occluders are statc, regons n the 3D scene that have been recovered as hghly probable to be occluder wll always mantan the hgh probabltes, not consderng nose. Ths enables the accumulaton of the statc occluder n our algorthm. But for the nter-occluson between dynamc objects, t s just a one tme nstant event. Ths effect s actually reflected n the nference formulae of the statc occluder and the dynamc objects. Thrdly, a recovered dynamc object can be thought of as a probablstc vsual hull representaton, because t s after all a fuson of slhouette nformaton, based on (Franco and Boyer 2005). However, the statc occluder that we recover s actually not a vsual hull representaton. In fact, t s an entty that s carved out usng movng vsual hulls (of the dynamc objects), as shown n Fg. 2. Therefore, our estmated occluder shape can mantan some concavtes, as long as a dynamc object can move nto the concave regons and be wtnessed by camera vews. Fnally, the computed statc occluder shape s n a probablstc form. Its counterpart n the determnstc representaton s gven n Fg. 2. Snce t s formed by carvng away dynamc shapes, t has some unque propertes that are dfferent from the tradtonal vsual hull. Consder a dynamc shape D wth nfntesmal volume. We defne an occluder hull as an approxmaton volume to a statc occluder recovered wth a nfntesmal dynamc shape D movng randomly n all accessble parts of the scene. It can be shown that the occluder hull s the regon of space vsble to at most one camera, ncludng the nsde of the actual occluder shape. In Fg. 17(a) and (b), the thck black lnes delneate the occluder hull. In comparson, n Fg. 17(c) and (d), the

Int J Comput Vs (2010) 90: 283 303 301 Fg. 17 2D theoretcal occluder hull and vsual hull. (a) 3 camera occluder hull; (b) 4 camera occluder hull; (c) 3 camera vsual hull; (d) 4 camera vsual hull.

19 Int J Comput Vs (2010) 90: Fg. 17 2D theoretcal occluder hull and vsual hull. (a) 3 camera occluder hull; (b) 4 camera occluder hull; (c) 3 camera vsual hull; (d) 4 camera vsual hull. Concavtes can be recovered by occluder hull thck black lnes delneate the vsual hull of the occluder, assumng the slhouettes of the objects are known. The vsual hull s the ntersecton of all the slhouettes vsual cones. Fgure 17 shows that contrary to the vsual hull, the occluder hull can recover concavtes. In fact, when cameras are dstrbuted all over space, the actual shape of an arbtrary statc occluder can be recovered. A fnte number of cameras may be suffcent to recover arbtrary occluder shapes n certan cases. However, the occluder hull shape s hghly dependent on the camera placement. As (a) shows, the occluder hull may even not be closed. For occluder hull, there s no lower bound number to guarantee a closed shape. Although the occluder hull n (b) s closed, f the fourth camera changes ts orentaton or poston, the occluder hull may be open agan. On the contrary, only two slhouettes from dfferent vews can guarantee a closed vsual hull, whch s the mnmum number of cameras requred for a vsual hull. Gven the above analyss, some emprcal requrements for good qualty occluder estmaton are summarzed as follows: There s no guarantee that how many cameras would produce a closed occluder shape. But when the sze of the occluder s small relatve to the camera focal length, or the occluder poston s so far from the cameras that the regon where only one camera can see the dynamc shape s lmted, a closed occluder shape can usually be recovered wth the proposed the algorthm. For a regon behnd the occluder, where no camera vew has sampled, the algorthm cannot nfer any nformaton. For example, the algorthm does not recover the wall, f a person s hdng completely behnd t. In ths case, the person s occupancy s not recovered n the frst place. One soluton may be to add more camera vews behnd the wall. Snce the closed dynamc shape (needs at least two camera vews) s requred by the algorthm, plus an occluded vew for the occludng ncdence, n theory, the mnmum requrement for the occluder nference s three cameras. 6.2 Computaton Complexty and Acceleraton The occluder occupancy computaton was tested on a 2.8 GHz PC at approxmately 1 mnute per frame. The strong localty nherent to the algorthm and prelmnary benchmarks suggest that real-tme performance could be acheved usng a GPU mplementaton. We choose nvidia CUDA ppelne, yeldng a 15 speedup for the complete algorthm. The dynamc shape computaton alone reaches a speed-up of more than 80 tmes and a speed of 0.2 second per frame on the test machne, whch s already satsfactory for real-tme applcatons. Although t has ganed reasonable speedup, the statc occluder computaton could not yet acheve nteractve frame rate (0.9 to 3.15 seconds per tme nstant), due to the hgh computatonal cost for fndng the dynamc components n

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,