Multi-Scale Object Candidates for Generic Object Tracking in Street Scenes

Size: px

Start display at page:

Download "Multi-Scale Object Candidates for Generic Object Tracking in Street Scenes"

Donald Arnold
5 years ago
Views:

Muli-Scale Objec Candidaes for Generic Objec Tracking in Sree Scenes Aljoša Ošep, Alexander Hermans, Francis Engelmann, Dirk Klosermann, Markus Mahias and Basian Leibe Absrac Mos vision based sysems

However, pracical driving scenarios conain many addiional objecs of ineres, for which suiable deecors eiher do no ye exis or would be cumbersome o obain.

1 Muli-Scale Objec Candidaes for Generic Objec Tracking in Sree Scenes Aljoša Ošep, Alexander Hermans, Francis Engelmann, Dirk Klosermann, Markus Mahias and Basian Leibe Absrac Mos vision based sysems for objec racking in urban environmens focus on a limied number of imporan objec caegories such as cars or pedesrians, for which powerful deecors are available. However, pracical driving scenarios conain many addiional objecs of ineres, for which suiable deecors eiher do no ye exis or would be cumbersome o obain. In his paper we propose a more general racking approach which does no follow he ofen used racking-bydeecion principle. Insead, we invesigae how far we can ge by racking unknown, generic objecs in challenging sree scenes. As such, we do no resric ourselves o only racking he mos common caegories, bu are able o handle a large variey of saic and moving objecs. We evaluae our approach on he KITTI daase and show compeiive resuls for he annoaed classes, even hough we are no resriced o hem. I. INTRODUCTION Oudoor visual scene undersanding is a key componen for auonomous mobile sysems. Specifically, deecion and racking of oher raffic paricipans are essenial seps owards safe navigaion and pah planning hrough populaed urban areas. Recen resuls on sandard benchmarks [1] show ha some objec caegories, such as cars or pedesrians, can already be racked raher reliably by sae-of-he-ar rackingby-deecion approaches [2], [3], [4], [5]. In pracical driving scenarios, however, here are numerous oher objecs ha could pose poenial safey hazards and i quickly becomes infeasible o rain specific deecors for all possible classes. In his paper, we herefore invesigae he problem of generic objec racking in sree scenes. Raher han saring from he oupu of a class-specific deecor, we ry o exrac a se of objec candidae regions purely from low-level cues and o rack hem over ime. This approach has he advanage ha i is no a priori resriced in he ypes of objecs ha can be racked. However, he racking ask becomes much more challenging, since i requires solving a complex figureground segmenaion problem in every frame o decide which scene regions conain valid objecs and a wha spaial exen hose objecs should be represened. In order o address his segmenaion problem, we make use of scene informaion from sereo deph o generae Generic Objec Proposals GOPs) in 3D and keep only hose proposals ha can consisenly be racked over a sequence of frames. In our racking sep, we link hese objec proposals ino rajecories and inegrae he individual 3D measuremens ino a 3D shape model for each racked objec. We joinly reason abou valid objec proposals and All auhors are wih Visual Compuing Insiue, RWTH Aachen Universiy. {lasname}@vision.rwh-aachen.de Fig. 1. We propose an approach o rack generic objecs in sree scenes ha goes beyond he capabiliies of pre-rained objec deecors. Our approach can handle a wide variey of saic and moving objecs of differen sizes and robusly rack hem. Blue areas indicae poenial objec regions. heir corresponding rajecories via a model selecion based muli-objec racking procedure. For such an approach o work, he generaion of good objec proposals is a key requiremen. This is a very challenging problem, since he unknown objecs may originae from vasly differen scales see Fig. 1), measuremens from nearby objecs end o merge, and objecs close o scene srucures are difficul o segmen due o ofen noisy sereo daa. To reach accepable recall values, sae-of-he-ar appearance-based objec proposal generaion approaches [6] ypically need several housand objec proposals per frame, wo orders of magniude more han wha would be racable o use in a racking framework. We propose a novel robus muli-scale objec proposal exracion procedure ha uses a wo-sage segmenaion approach. Firs, a coarse supervised segmenaion removes non-objec regions corresponding o known background caegories such as road, building, or vegeaion. Nex, we perform a fine unsupervised muli-scale segmenaion o exrac scale-sable objec proposals from he remaining scene regions. As many of hese proposals may overlap and he correc objec scale ofen canno be deermined on a singleframe basis, we perform muli-hypohesis racking a he level of objec proposals. In summary, our main conribuions are: 1) We presen a novel, scalable approach ha successfully racks a large variey of generic objecs in challenging sree scenes. 2) As a key componen of his approach, we propose a robus muli-scale 3D objec proposal exracion procedure based on a wo-sage segmenaion and scale-sable clusering. 3) We demonsrae he validiy of our approach quaniaively and

qualiaively on he KITTI daase [1]. We show ha our approach can compee wih sae-of-he-ar deecor-based mehods in close and medium camera disance. Objec definiion.

cerain band of free-space. In addiion, objecs need o appear consisenly in a sequence of frames, eiher moving or no, and mainain a roughly consisen appearance.

We explicily exclude only iems ha are beer explained by suff caegories such as vegeaion or building facade. II.

Mos of hose follow a racking-by-deecion sraegy by firs applying deecors rained for specific caegories on each frame and hen linking he deecions ino rajecories.

[3] consider racking as a spaio-emporal grouping problem and propose greedy global opimizaion approach. Milan e al.

2 qualiaively on he KITTI daase [1]. We show ha our approach can compee wih sae-of-he-ar deecor-based mehods in close and medium camera disance. Objec definiion. In he remainder of his paper we refer o an objec as an eniy ha appears in urban sree scenes, sicks ou of he ground plane, has a well-defined closed boundary in space [7], and is surrounded by a cerain band of free-space. In addiion, objecs need o appear consisenly in a sequence of frames, eiher moving or no, and mainain a roughly consisen appearance. We also assume a size range for objecs of ineres beween 0.5m and 5m. This definiion includes oher raffic paricipans, as well as saic/parked vehicles and iems of sree furniure. We explicily exclude only iems ha are beer explained by suff caegories such as vegeaion or building facade. II. RELATED WORK Many approaches have been proposed for objec racking in sree scenarios [8], [2], [9], [3], [5]. Mos of hose follow a racking-by-deecion sraegy by firs applying deecors rained for specific caegories on each frame and hen linking he deecions ino rajecories. The KITTI racking benchmark [1] gives a good overview of such racking mehods. Zhang e al. [5] pose racking as a maximum-a-poseriori daa associaion problem wih non-overlap consrains. Pirsiavash e al. [3] consider racking as a spaio-emporal grouping problem and propose greedy global opimizaion approach. Milan e al. [2] use a coninuous energy minimizaion approach ha akes ino accoun physical consrains and rack persisence. While hese approaches obain impressive resuls, hey have he drawback ha hey assume ha all ineresing objec caegories are known beforehand and ha deecors can be rained for each caegory. Recenly, he problem of racking generic objecs has received more aenion. For auomoive scenarios several approaches address his problem using highly precise LIDAR daa as inpu [10], [8], [11], [12]. Perovskaya e al. [10] use a model-based approach o deec car-sized objecs in laser poin clouds. Held e al. [12] uilize 3D shape and color informaion o obain precise velociy esimaes of generic objecs in LIDAR daa. In conras, we use deph informaion obained from a sereo camera pair, which is far less accurae and requires a more robus processing pipeline. Oher approaches ry o find and rack generic objecs based on moion segmenaion. For example, Bewley e al. [13] use a self-supervised framework o deec dynamic objec clusers exraced from a monocular camera sream. In conras o hose approaches, we are also ineresed in racking saic insances of ineresing objecs. To he bes of our knowledge, only few approaches deal wih generic objec racking from sereo deph. Nguyen e al. [14] also arge generic objecs of several sizes, bu hey only rack moving objecs, wih he purpose of generaing improved occupancy grids of he scene for a driver assisance sysem. While our pipeline is similar o he approaches of [9], [15], hey only rack pedesrian sized objecs, whereas we aim o also rack larger objecs such as cars and vans. Inpu Dispariy and Ground- Plane Esimaion Sereo Pair Sequence Visual Odomery Fig. 2. Semanic Segmenaion Sec.4 Generic Objec Proposal Generaion Generic Objec Tracking High-level overview of our pipeline. Sec.5 Sec.6 A key par of our pipeline is he generaion of good Generic Objec Proposals GOPs). Several previous mehods have been proposed for his sep, ofen based on LIDAR daa. Wang e al. [16] use a minimal-spanning-ree clusering approach o exrac 3D objec proposals from LIDAR daa and hen classify hem ino background, bicyclis, car, or pedesrian. Ioanneu e al. [17] propose a Difference-of- Normals operaor o exrac scale-sable objec proposal regions from LIDAR daa. We compare agains his approach in Sec. VII. Bansal e al. [18] propose a semanic srucure labeling approach based on sereo daa in order o creae proposal regions for a pedesrian deecor. The resuling regions are oo coarse and would no generalize well o all ineresing objecs. There is also a large se of approaches ha ry o find good objec proposals in he form of bounding boxes from color images [7], [6], [19]. While sae-of-hear mehods such as EdgeBoxes [6] obain a very high recall, heir precision is oo low o be applicable for our approach. Our approach builds upon a semanic segmenaion o rejec scene pars ha can be well explained by background caegories. Several oher approaches have already demonsraed semanic segmenaion on differen subses of KITTI [1]. Xu e al. [20] fuse informaion from several sensors o classify superpixels, while we only rely on sereo daa. Ladický e al. [21] joinly infer dispariy maps and dense semanic segmenaions based on a monocular image using a combinaion of deph and semanic classifiers. Ros e al. [22] pre-compue a high-qualiy semanic map of he saic pars of a scene in order o laer on label he environmen based on he curren locaion wihin ha map. Objecs ha appeared in he scene can hen auomaically be labeled. While his is fas and gives good resuls, our approach also generalizes o unknown scenes. III. METHOD OVERVIEW Fig. 2 gives an overview of our approach. Given a sequence of sereo image pairs, we compue dispariy maps using ELAS [23]. From a dispariy map we generae poin cloud and fi a ground plane using RANSAC. To narrow Our Mehod

3 down he 3D search space for poenial objecs, we perform supervised coarse semanic segmenaion on he poin cloud Sec. IV). Based on he idea of hings and suff [24], we remove all poins ha belong o suff regions such as road, sky, or building poins. This gives us a coarse idea where poenial objecs could be locaed. On he remaining poin cloud we perform muli-scale search for generic objec proposals Sec. V). Each proposal is defined by a se of 3D poins. The resul is an over-complee se of possible 3D objecs. In he las sep we perform objec racking. Firs, we ransform he 3D objec proposals o he common coordinae frame using visual odomery of [25] and link hem across frames. Nex, we idenify he bes se of objecs and heir corresponding racks Sec. VI). We perform his selecion joinly in a model selecion based muli-hypohesis racking framework, which searches for he subse of objec rajecory hypoheses ha ogeher bes explains he observed daa. IV. SEMANTIC SEGMENTATION We use a supervised semanic segmenaion approach o classify pars of he scene ha do no resemble objecs and can herefore be removed for furher processing. In conras o he classical semanic segmenaion asks, we are ineresed in correcly recognizing he known background caegories while generalizing o poenially unseen objec caegories. To achieve ha, we specifically use feaures ha capure he background caegories well. We rea cars and pedesrians as one single objec class, such ha afer semanic segmenaion we know ha somehing is an objec, bu no of wha kind. We follow he design of a ypical segmenaion pipeline: saring wih an over-segmenaion, feaures are exraced for each segmen and are hen used o classify he segmens ino semanic caegories. A Condiional Random Field CRF) is hen applied o enforce spaial coherence. As our furher approach operaes in 3D, we use he VCCS algorihm [26] o pariion he poin cloud ino segmens. For each segmen we compue several feaures which can be grouped ino four caegories: Appearance. We compue L*a*b* hisograms over he poins wihin a segmen. We use hree separae hisograms for L*, a*, and b*, each conaining 10 bins 30 dimensions). Furhermore, we compue he mean and covariance of he L*a*b* gradiens wihin he segmen 3+6 dimensions as he covariance is symmeric). Finally, we add hisograms of exons, similar o hose used in [27]. We exonize he whole image and creae a hisogram of exons wihin he segmen 50 dimensions), giving a oal of 89 dimensions. Only he appearance feaures are based on he color image, while all furher feaures are based on he 3D poin cloud. Densiy. These feaures are largely inspired by Bansal e al. [18]. Based on he orienaion of an esimaed ground plane, we slice he 3D space ino 3 heigh bands and projec he poins of each band ono a densiy map. This gives us densiies for 3 heigh regions. The densiy maps are hen discreized using 3 resoluions. By projecing a segmen s cenroid ono each densiy map we are able o selec a cell in each layer and resoluion 3 3 = 9 grid cells). We also consider he 4-neighborhood of he seleced cells in each layer 4 9 = 36 grid cells), resuling in a oal of 45 dimensions. Furhermore, we coun he 3D poins wihin a segmen, which represens he densiy of he segmen iself, summing up o a oal of 46 feaure dimensions. Geomery. Based on he covariance marix of he 3D poins wihin one segmen, we compue several specral and direcional feaures [28]. From he eigenvalues we compue he poin-ness, linear-ness, surface-ness and curvaure of he segmen. From he eigenvecors we deermine he segmen normal and he cosines beween boh he normal and angen vecors and he ground plane normal. Finally, we compue a igh bounding box of he segmen along he principal axes. This resuls in a oal of 12 feaure dimensions. Locaion. This feaure represens a locaion prior wih 3 dimensions. I consiss of: he heigh of he segmen cenroid, he deph of he segmen cenroid and he horizonal angle beween he camera s opical axis and he vecor from he camera cener o he segmen cenroid. Thus, our resuling feaure vecor consiss of a oal of 150 dimensions. We hen rain a Random Fores classifier [29] wih single-aribue ess, yielding class poseriors for every segmen. A fully conneced CRF [30], defined over he segmen ceners in 3D, furher improves he resuls. We use a close-range smoohing kernel defined only over he 3D cenroid locaions and a larger-range appearance kernel defined over he 3D cenroid and he average L*a*b* color of a segmen. From his semanic segmenaion we only consider he segmens labeled as objec for our furher seps. V. GENERIC OBJECT PROPOSAL GENERATION The muli-scale objec proposal generaion mehod produces a ranked se of objec proposals GOPs) from he remaining objec regions wihin he poin cloud. In addiion o correc objec proposals arges for racking), his se may sill conain under- and over-segmenaions e.g., car pars, groups of pedesrians, pedesrians merged wih oher objecs). These overlapping and compeing proposals are a major difference o previous single-scale approaches [9] and make he daa associaion ask more challenging. In order o suppor efficien daa associaion and racking, he objec proposal generaion procedure should achieve a high recall wih a very small se of objec proposals. Curren appearance-based objec proposal mehods are able o achieve good recall, bu a he cos of very large proposal ses for an overview see [19]). The muli-scale search for objec proposals is necessary for several reasons. Firsly, sizes of poenially ineresing objecs fundamenally differ e.g., pedesrians and vans). Secondly, he observed objecs migh be jus parially visible. Noisy sereo poin clouds ypically conain severe deph arifacs and ouliers. This makes our problem even harder and requires a robus approach, which we describe in deail in he nex subsecions. In a nushell, we firs projec he 3D poin cloud o he ground plane and compue a densiy

4 Poin-Cloud D σ1 D σ2. D σk Ground-Plane Scale-Space Camera Image Fig. 3. Semanic segmenaion allows us o only consider poins labeled as objec. Since objec sizes are unknown, we consider differen scales of he ground-plane densiy map. map of he 3D poins. Then we perform muli-scale search for objec proposals as follows. We ieraively smooh he densiy map and idenify blobs clusers) around modes in he densiy map using Quick-Shif [31] a each scale. Our final proposals are clusers ha persis in he scale space of he densiy map. Scale-Space Represenaion of he Densiy Map. Firs, we discreize he ground plane of he poin cloud ino a regular grid and compue he poin-densiy map D by projecing he 3D poin cloud o he ground plane. Each grid cell sores he scalar value represening he densiy of poins falling ino he cell. In addiion, cells sore a lis of associaed 3D poins. We creae a ground-plane scale-space represenaion of he densiy map D σk, k = 1...K by convolving D wih a Gaussian kernel σ k whose size increases in each ieraion k see Fig. 3). Muli-Scale Clusering. In he nex sep, we apply Quick- Shif clusering [31] o obain he modes of he scale-filered densiy D σk : C k = {cluser k m)},m = 1...#clusers. A cluser cluser k m) = [ ] {cell c } m,bbm 2D,s m is defined by he se of cells {cell c } m, c = 1...#cells ha converged o is mode. BBm 2D represens a 2D bounding box ha is compued by projecing he corresponding 3D poins o he image plane see Fig. 3) and s m is a scale-sabiliy coun. Idenificaion of Scale-Sable Clusers. In order o obain a compac se of GOPs we idenify he clusers ha persis over scales. This sep is moivaed by resuls from scale-space filering [32], namely ha he mos scale-sable proposals also end o be he mos salien ones. We idenify scalesable clusers by ieraing hrough cluser ses C k, k = 1...K and search for similar clusers beween ses C k and C k+1. If wo clusers cluser k j), cluser k+1 l) are very similar according o our scale-sabiliy crierion, hen we merge hem. This is done by removing cluser k j) from C k and merging i wih cluser k+1 l) and incremening he scale-sabiliy coun s m of cluser k+1 l) by 1. In our scenario, wo clusers should be declared as similar when hey roughly) correspond o he same objec. This moivaes he following scale-sabiliy crierion: wo clusers cluser k j) and cluser k+1 l) are similar when heir bounding boxes BBj 2D and BBl 2D have a very high overlap. To be specific, we compue he Jaccard Index J, ) of he wo bounding boxes and declare hem as he same cluser if J ) BBk 2D,BB2D l > 0.9. Finally we obain a se of GOPs {Ω i } for frame where each GOP is defined as: Ω i = [ p i,c i,3d,h i,s i,r i], 1) where p i is he 3D posiion of he i h GOP, projeced ono he ground plane. C i,3d is a 3 3 covariance marix represening he uncerainy in 3D posiion p i, compued as [33] C i,3d = F cl C 1 2D F c L +F cr C 1 2D F c R ) 1, 2) where F cl,f cr are Jacobians of he projecion marices of boh cameras and C 2D is he covariance of pixel measuremens. h i denoes a color hisogram, compued by dividing he bounding box of he GOP ino 4 4 cells and sacking heir RGB color hisograms. Si R3 denoes he se of 3D poin measuremens of he GOP in he camera space) and he scalar ri [0,1] is he objec sabiliy score, compued as ri = si K, where s i is scale-sabiliy of he proposal. VI. TRACKING Saring wih he previously inroduced, possibly overlapping GOPs { } Ω 0: i, we now wan o find a se of mos likely objecs and heir rajecories {H n }. Our basic assumpion is ha correc GOPs have a higher chance of producing sable rajecories wih consisen appearance han GOPs caused by noise and incorrec segmenaions. We approach his problem by performing racking and objec selecion joinly in a muli-hypohesis racking framework. Oher han classic racking approaches we are no only looking for physically exclusive inlier deecions i.e. is he rack coninued by deecion A or deecion B?), bu we also have an inlier hypohesis ambiguiy on physically overlapping objec proposals see Fig. 4). We ackle his challenging muli-hypohesis racking problem on he objec proposal level by mainaining a lis of physically overlapping objec-rajecory hypoheses ha compee for he poenially overlapping) GOPs. A each ime sep, our algorihm selecs a subse of hypoheses, ha bes explains he observaions. We formulae racking as a model selecion procedure and exend our previous work [34], [35], where rajecories wih consisen moion and appearance are preferred. Addiionally, our mehod akes emporal consisency of he 3D shape of he racked objec ino accoun. Our mehod is also capable of keeping rack of currenly no seleced rack hypoheses. As a concree example, his means ha we may rack a group of pedesrians as a single objec over a sequence of frames 1, bu we also keep hypoheses for he individual pedesrians. If a some poin heir moion sars diverging, he observed daa can beer be explained by individual pedesrian hypohesis. Tracking is performed on he esimaed ground plane and he camera pose compued for each frame using he Visual Odomery mehod of [25]. In order o obain a sable 3D shape represenaions of he racked objecs, we inegrae he noisy 3D measuremens of he GOPs over ime. In following, we will inroduce he quadraic pseudo-boolean opimizaion QPBO) racking mehod by Leibe e al. [34] and our exension of he approach, ha enables us o perform 1 Remember ha we do no have pedesrian specific knowledge, such ha a group of pedesrians is a valid objec.

5 +1 +1 Responses from Deecors Generic Objec Proposals Trajecory Hypoheses Objec-Trajecory Hypoheses Fig. 4. Tracking-by-deecion associaes deecions and rejecs he incorrec racks op). We associae GOPs and penalize incorrec associaions e.g.car pars) bu associae boh individual pedesrians and pedesrian groups boom). racking wihou using a deecor and rack regions ha likely correspond o he valid objecs. A. QPBO Tracking The idea of [34] is o use a deecor o generae an overcomplee possibly physically implausible) se of rajecory hypoheses. Then a physically plausible) se of hypoheses is seleced by solving a quadraic pseudo-boolean opimizaion problem QPBO): argmax m m T Qm, m {0,1}, 3) where m is a binary indicaor vecor ha indicaes wheher he model hypohesis) was seleced or no. The diagonal erms of he marix Q represens he hypohesis likelihoods cos benefis for specific hypohesis) reduced by a consan penaly ε 1 ha enforces sparse soluions: q nn = ε 1 + D i H0:k n 1 ε2 )+ε 2 SD i H 0:k n ) ). 4) Here, Di represen he supporing deecions of he hypohesis Hn 0:k and S ) is he likelihood of he deecion belonging o he hypohesis. Wih off-diagonal enries we model ineracions beween hypoheses: Physical overlap penaly {}}{ q mn = 0.5 ε 3 OHn 0:k,Hm 0:k ) + 5) 1 ε2 )+ε 2 S Di H )) ), Di H0:k n H0:k m }{{} Avoiding double-couning of inlier conribuions where H {H m,h n } is he weaker hypohesis. O, ) measures he physical overlap of he hypoheses and he second erm correcs for double-couning deecions ha are consisen wih boh hypoheses. Model parameer ε 2 is he minimal score of he inlier deecions and ε 3 weighs penalizaion of he physical overlap. In our formulaion, we use he GOPs Ω i insead of he deecions Di and inroduce a shape model of he unknown objec o he racking process. Physical overlap beween he compeing hypoheses is compued as a Bhaacharyya coefficien of he wo 2D occupancy hisograms of heir shape represenaions. The hisograms are compued by sampling 3D poins from he hypoheses shape represenaions and projecing hem o he ground plane. B. Objec-Trajecory Hypohesis Generaion The basic uni of our racker is he objec-rajecory hypohesis Hn 0:k, ha spans over he frames 0...k: Hn 0:k = [ In 0:k,Mn 0:k,A 0:k n,sn 0:k ], 6) where I n represen he inlier GOP se of he n h hypohesis, M n is he moion model, A n he appearance model and S n is he 3D shape model. Noe, ha an objec-rajecory hypohesis does no only hypohesize he rajecory bu also he objec s shape. This is a fundamenal difference compared o he original QPBO racking. Hypohesis Generaion. Following he QPBO racking approach, he firs sep is o generae an over-complee se of hypoheses. In each frame, we exend he old hypohesis se using he new GOP se by running a forward Kalman filer. We sar a new hypoheses from he new GOPs ha were no used for exending old hypoheses by running he Kalman filer backwards. A each Kalman filer sep we perform neares neighbor daa associaion wihin he validaion volume of C i,3d, selecing inlier GOPs of pas frames by evaluaing he GOP associaion probabiliy. Daa Associaion. We compue GOP Ω i associaion probabiliy as we omi he indices 0 : k o reduce cluer): p Ω i H n ) = p Ω i A n ) p Ω i M n ) p Ω i S n ). 7) As appearance model we compue he Bhaacharyya disance beween he rajecory RGB color hisogram A n and GOP color hisogram h i : p Ω ) i A n = 1 h i r,g,b) A nr,g,b). 8) r,g,b For moion model we assume a consan-velociy Kalman filer wih he following sae vecor: x k = [ x k,y k,ẋ k,ẏ k] T, 9) where [ x k,y k] T represen he 2D posiion on he ground plane and [ ẋ k,ẏ k] T he velociy. Given he prediced sae x k and GOP Ω i, we ge he moion model probabiliy as: p Ω ) i M n = e 1 2 p i [xk,0,y k ] T) C 1 p i [xk,0,y k ] T), 10) where C = C i,3d + C sys, C sys is he sysem uncerainy of he Kalman filer. The shape model is evaluaed by: p Ω i S n ) = e α d J BB 2D β d J BB 3D, 11) where d J BB 2D is is he Jaccard disance defined as 1 J, )) beween he 2D bounding boxes of he inegraed) hypohesis shape represenaion and he GOP. These bounding boxes are compued by projecing he associaed 3D poins o he camera image plane. d J BB 3D is he Jaccard

6 Objec Road Building Tree Bush Sign Pole Dir Sky Grass Average Jaccard Acc TABLE I JACCARD SCORE & CLASS-ACCURACY FOR OUR 7 CLASSES. disance beween heir 3D bounding boxes and α,β are he weighing facors for boh erms. Finally, he fi of he GOP Ω i o he hypohesis S ) Hn 0:k is evaluaed as: S Ω i Hn 0:k ) = e k τ ) p Ω i Hn 0:k ) ) p Ω i. 12) The erm pω i ) = e γ1 r i ) is he GOP prior compued from he GOP sabiliy score ri. The final score of he hypohesis S ) Hn 0:k is a summaion over is inlier GOP scores, weighed by emporal decay. The parameer τ regulaes he exen of emporal decay and γ regulaes he influence of he GOP prior. C. Shape Model Measuremen Inegraion Our racker relies on raw 3D deph esimaes for he compuaion of GOP associaions and selecion coss. Because individual sereo-based 3D measuremens are very imprecise, we inegrae 3D measuremens of inlier GOPs In 0:k over ime o creae a sable 3D represenaion of he hypoheses. We coninuously build hypohesis shape represenaions Sn 0:k by inegraing he GOP measuremens in a voxel grid and compuing occupancy probabiliies of he voxel grid cells. We perform inegraion in a wo-sep procedure: firs, we reconsruc he poin cloud represenaion of he inegraed model, second, we regiser model poins wih associaed inlier GOP poins Si and updae he shape model S0:k n wih new measuremens. Model Iniializaion. We iniialize he model by cenering a fixed-size regular voxel grid a he cener of mass of he firs inlier GOP of he hypohesis Hn 0:k and iniialize each voxel grid cell c j Sn 0:k wih p cj) 0, he probabiliy ha a measured poin falls ino he cell normalized coun of he poins falling ino he cell). Model Updae. To updae he shape model Sn 0:k wih new GOP measuremens Si we cener he voxel grid represenaion of he inegraed model o he las posiion world coordinaes) of he hypohesis Hn 0:k and reconsruc poins wih he highes occupancy probabiliy along he camera ray. We align he shape model Sn 0:k o he new measuremen Si using weighed Ieraive Closes Poin ICP) algorihm. For efficien updaes, we consider cells c j independen and use a Binary Bayes Filer o updae occupancy probabiliies of each cell [36]. The sae ransiion model applies an exponenial decay owards he uniform disribuion. VII. EXPERIMENTAL EVALUATION In his secion we conduc a series of experimens o firs evaluae he individual sages of our approach and hen assess overall performance. As a es bed we use he well known KITTI daase [1]. All experimens are performed on he KITTI racking raining se. As we perform general objec racking and do no single ou specific classes, he sandard evaluaion pipeline on he KITTI es se is no suiable for our approach. All mehods evaluaed in he remaining of he paper do no use he raining se as inpu. This enables us o use i as a valid es bed. A. Semanic Segmenaion To show he validiy of our segmenaion algorihm iself, we compare our approach o hree recen baselines [21], [22], [20] which each provide ground ruh annoaions for a differen se of images and semanic caegories wihin he KITTI [1] daase. Only an approximae comparison can be provided, as he approaches use differen deph maps and hus label slighly differen pars of he image. Ladický e al. [21] even esimae a dense semanic map wihou deph informaion, whereas our mehod provides semanic labels only for image pixels wih a corresponding deph esimae. However, even wih his rough comparison, Table II shows ha our semanic segmenaion obains compeiive resuls. Segmenaion Daase. For our complee pipeline, we rained our semanic segmenaion classifier on a oal of 203 annoaed images exraced across he KITTI odomery daase we will publicly release his daa upon publicaion). In our annoaions we labeled he following classes: building, car, curb, grass/dir, person, pole, road, sky, sidewalk, sign, surface marking, ree/bush and wall. For our approach we group person and car ino a single objec class. For he remaining pipeline, he semanic segmenaion is used as an iniial sep o filer ou regions which are unlikely o belong o an objec. Therefore, is main goal is o be able o disinguish beween objec and non-objec regions, raher han separaing non-)objec classes. While our annoaed daase conains a oal 13 objec caegories, we merge hem ino objec and non-objec classes for evaluaion. We qualiaively and quaniaively found ha beer resuls can be obained by using more han only wo classes for raining. We believe ha his is he resul of reducing he inra class variance. In pracice, curb, sidewalk and surface marking were merged ino he road class. We also joined wall wih building, and pole wih sign. Table I shows boh he class accuracy and Jaccard scores for hese classes. B. Objec Proposal Generaion In Fig. 5 we compare our generic objec proposal generaion mehod wih wo relevan baselines. Differenceof-Normals DoN) [17] demonsraed excellen resuls on KITTI 3D laser daa [1]; EdgeBoxes [6] is a sae-of-hear appearance-based objec proposal generaion mehod as shown in [19]). The code of boh mehods is publicly available. We use defaul parameers for EdgeBoxes [6] and heir pre-rained edge deecion model. For DoN we used he specified parameers from [17]. Fig. 5 lef) shows ha our mehod requires 2 orders of magniude fewer proposals han EdgeBoxes [6] o cover roughly 70% of he relevan arges annoaed in KITTI).

7 Building Car Fence Grass Obsacle Pole Road Sidewalk Ladický [21] Our approach Ros [22] Our approach Xu [20] Our approach TABLE II CLASS-ACCURACY COMPARISON TO OTHER APPROACHES. WE TRAIN OUR APPROACH ON THE DIFFERENT SEMANTIC ANNOTATIONS. OUR RESULTS ARE AVERAGED OVER 5 RUNS AND GRAY CELLS REPRESENT CLASSES NOT REPRESENTED IN A DATASET. Sign Sky Tree Global Average 1 Generic Objec Proposals - Baselines 1 Generic Objec Proposals - Occlusion 1 Tracking Recall Comparison 1 Precision vs. Recall - All Caegories Recall Recall Recall Precision Our mehod DoN EdgeBoxes k Number of objec proposals per frame 0.2 Fully visible Toal Parially occluded Mosly occluded Number of objec proposals per frame Fig. 5. GOP Recall. Lef: Comparison of proposal generaion mehod and wo baselines, Difference-of-Normals [17] and EdgeBoxes [6]. Righ: Recall per occlusion. Wih 30 objec proposals per frame, DoN has a similar sauraion poin as our mehod, bu achieves only 40% recall. Fig. 5 righ) shows he recall of our mehod under varying amouns of occlusion. As can be seen, our approach achieves good recall for he mosly visible objecs. For parially occluded objecs, our mehod repors 2D bounding boxes only spanning he visible area, while he KITTI annoaions cover he whole objec even if i is no acually visible). As our mehod is no aware of objec caegories, no class-specific size heurisics can be applied. EdgeBoxes [6] does no require deph daa, bu needs oo many proposals o be applicable o our problem. We observed ha DoN [17] produces very relevan and compac proposals, bu only in he close camera range. C. Tracking In his secion we demonsrae compeiive performance on car and pedesrian caegories compared o oher saeof-he-ar deecion-based approaches on he KITTI racking daase [1]. We will show ha our proposed racks include he caegories annoaed in KITTI. Evaluaion of racking performance of our approach is non-rivial as we do no have caegory knowledge for he racked objecs. This means ha we do no know if a rajecories represens, e.g., a car or pedesrian; i is jus a generic objec. Especially he caegory-specific precision merics become meaningless, as he confidence in a racked objec does no rely on is caegory! We compare o wo sae-of-he-ar racking-by-deecion mehods [2], [5], for which we obained racking resuls from he auhors. Fig. 7 lef) shows a frame-level recall evaluaion for cars and pedesrians as a funcion of he disance from he cameras. In shor camera-range 25m) we ouperform he oher mehods in erms of recall, while hey achieve a higher Car - Our mehod Car - MCF 0.2 Car - CEM Ped. - CEM Ped. and groups - Our mehod Ped. - Our mehod Cyclis - Our mehod Disance from Camera m) 0.5 Our inegraion mehod GCT Recall Fig. 7. Lef: Tracking recall compared o wo baselines [5], [2] on pedesrian and car caegories. Righ: Precision vs. recall of our mehod for all caegories in KITTI, using our voxel-grid based and he GCT based inegraion [9]. recall in he limi. In case of pedesrian racking he saeof-he-ar mehod [2] ouperforms our mehod by abou 13% poins. We observed ha his performance difference originaes from he fac ha we are simply no able o disinguish beween individual pedesrians a he racking level. Already a he proposal level, proposals for pedesrians walking close ogeher are ranked higher, as he free-space surrounds he groups surpasses he free-space around individuals. Again, his is due o he fac ha our racker has no caegoryrelaed knowledge. In order o validae his effec, we also plo he performance when changing he annoaions, such ha annoaed pedesrians walking very close ogeher are merged ino a single hypohesis See Fig. 7, lef). To furher show generalizaion o novel classes, we also repor recall for he cyclis class in Fig. 7 lef). In Fig. 7 righ) we show a full precision-recall curve for all annoaed objecs in KITTI based on he assumpion ha hose annoaions can be used as a proxy for all valid objecs in realiy, no all objecs are no annoaed). Our approach can rack abou 50% of all annoaed objecs in a disance range of up o 30m. Experimenally he voxelgridbased inegraion mehod urned ou o be more robus for racking han he GCT approach [9]. This experimen also demonsraes he imporance of robus shape inegraion. Qualiaive resuls are shown in Fig. 6. VIII. CONCLUSIONS In his paper, we invesigaed how far we can ge wih a generic objec racking approach. In paricular, we proposed a novel racking pipeline wih he key feaure of racking muliple objecs simulaneously wihou explicily learning a classifier for each caegory. This is an imporan sep owards beer scene undersanding, where i is impossible o learn class specific knowledge for everyhing ineresing.

Building Grass/Dir Objec Road Sky Sign/Pole Tree/Bush Fig. 6. Qualiaive resuls on he KITTI racking raining se. Lef: Semanic segmenaion resuls. The label colors are shown in he color map a he boom.

We do no aim o replace deecor-based racking mehods, bu believe ha an opimal racking approach should combine he srenghs of boh paradigms, which we plan o address in fuure work.

racker ha achieves compeiive resuls for close-range objecs. Acknowledgmens: This work was funded by ERC Saring Gran projec CV-SUPER ERC-2012-SG-307432).

Milan, S. Roh, and K. Schindler, Coninuous Energy Minimizaion for Muliarge Tracking, PAMI, vol. 36, no. 1, pp. 58 72, 2014. [3] H. Pirsiavash, D. Ramanan, and C. C.Fowlkes, Globally-opimal Greedy Algorihms for Tracking a Variable Number of Objecs, in CVPR, 2011.

Nevaia, Global Daa Associaion for Muli-Objec Tracking Using Nework Flows, in CVPR, 2008. [6] C. L. Zinick and P. Dollár, Edge Boxes: Locaing Objec Proposals from Edges, in ECCV, 2014. [7] B. Alexe, T.

Siegwar, Generaive Objec Deecion and Tracking in 3D Range Daa, in ICRA, 2012. [9] D. Mizel and B.

8 Building Grass/Dir Objec Road Sky Sign/Pole Tree/Bush Fig. 6. Qualiaive resuls on he KITTI racking raining se. Lef: Semanic segmenaion resuls. The label colors are shown in he color map a he boom. Middle: Generic Objec Proposals. Righ: Tracking Resuls. The saic objecs are visualized wih he gray bounding boxes. We do no aim o replace deecor-based racking mehods, bu believe ha an opimal racking approach should combine he srenghs of boh paradigms, which we plan o address in fuure work. Towards our goal of general objec racking, we proposed a compeiive semanic segmenaion algorihm, a novel muli-scale objec proposal generaion sage, ha reaches high recall wih few proposals, and a 3D racker ha achieves compeiive resuls for close-range objecs. Acknowledgmens: This work was funded by ERC Saring Gran projec CV-SUPER ERC-2012-SG ). We would like o hank Dennis Mizel for helpful discussions. REFERENCES [1] A. Geiger, P. Lenz, and R. Urasun, Are we ready for Auonomous Driving? The KITTI Vision Benchmark Suie, in CVPR, [2] A. Milan, S. Roh, and K. Schindler, Coninuous Energy Minimizaion for Muliarge Tracking, PAMI, vol. 36, no. 1, pp , [3] H. Pirsiavash, D. Ramanan, and C. C.Fowlkes, Globally-opimal Greedy Algorihms for Tracking a Variable Number of Objecs, in CVPR, [4] J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, Bayesian Muliobjec Tracking Using Moion Conex from Muliple Objecs, in WACV, [5] L. Zhang, L. Yuan, and R. Nevaia, Global Daa Associaion for Muli-Objec Tracking Using Nework Flows, in CVPR, [6] C. L. Zinick and P. Dollár, Edge Boxes: Locaing Objec Proposals from Edges, in ECCV, [7] B. Alexe, T. Deselaers, and V. Ferrari, Measuring he Objecness of Image Windows, PAMI, vol. 34, no. 11, pp , [8] R. Kaesner, J. Maye, Y. Pila, and R. Siegwar, Generaive Objec Deecion and Tracking in 3D Range Daa, in ICRA, [9] D. Mizel and B. Leibe, Taking Mobile Muli-Objec Tracking o he Nex Level: People, Unknown Objecs, and Carried Iems, in ECCV, [10] A. Perovskaya and S. Thrun, Model Based Vehicle Deecion and Tracking for Auonomous Urban Driving, Auonomous Robos, vol. 26, pp , [11] A. Teichman and S. Thrun, Tracking-based semi-supervised learning, IJRR, vol. 31, no. 7, pp , [12] D. Held, J. Levinson, S. Thrun, and S. Savarese, Combining 3D Shape, Color, and Moion for Robus Anyime Tracking, in RSS, [13] A. Bewley, V. Guizilini, F. Ramos, and B. Upcrof, Online Self- Supervised Muli-Insance Segmenaion of Dynamic Objecs, in ICRA, [14] T.-N. Nguyen, B. Michaelis, A. Al-Hamadi, M. Tornow, and M. Meinecke, Sereo-Camera-Based Urban Environmen Percepion Using Occupancy Grid and Objec Tracking, TITS, vol. 13, no. 1, pp , [15] D. Beymer and K. Kur, Real-ime racking of muliple people using coninuous deecion, in IEEE Frame Rae Workshop, [16] D. Z. Wang, I. Posner, and P. Newman, Wha Could Move? Finding Cars, Pedesrians and Bicycliss in 3D Laser Daa, in ICRA, [17] Y. Ioannou, B. Taai, R. Harrap, and M. A. Greenspan, Difference of Normals as a Muli-Scale Operaor in Unorganized Poin Clouds, in 3DIMPVT, [18] M. Bansal, B. Maei, H. Sawhney, S.-H. Jung, and J. Eledah, Pedesrian Deecion wih Deph-guided Srucure Labeling, in ICCV Workshops, [19] J. Hosang, R. Benenson, and B. Schiele, How good are deecion proposals, really? in BMVC, [20] P. Xu, F. Davoine, J.-B. Bordes, H. Zhao, and T. Denoeux, Informaion Fusion on Oversegmened Images: An Applicaion for Urban Scane Undersanding, in MVA, [21] L. Ladicky, J. Shi, and M. Pollefeys, Pulling Things ou of Perspecive, in CVPR, [22] G. Ros, A. Bakhiary, S. Ramos, D. Vazqueuez, M. Granados, and A. M. Lopez, Vision-based Offline-Online Percepion Paradigm for Auonomous Driving, in WACV, [23] A. Geiger, M. Roser, and R. Urasun, Efficien Large-Scale Sereo Maching, in ACCV, [24] G. Heiz and D. Koller, Learning Spaial Conex: Using Suff o Find Things, in ECCV, [25] A. Geiger, J. Ziegler, and C. Siller, SereoScan: Dense 3d Reconsrucion in Real-ime, in Inel. Vehicles Symp. 11, [26] J. Papon, A. Abramov, M. Schoeler, and F. Wrger, Voxel Cloud Conneciviy Segmenaion - Supervoxels for Poin Clouds, in CVPR, [27] J. Shoon, J. M. Winn, C. Roher, and A. Criminisi, TexonBoos for Image Undersanding: Muli-Class Objec Recogniion and Segmenaion by Joinly Modeling Texure, Layou, and Conex, IJCV, vol. 81, no. 1, pp. 2 23, [28] D. Munoz, N. Vandapel, and M. Heber, Onboard Conexual Classificaion of 3-D Poin Clouds wih Learned High-order Markov Random Fields, in ICRA, [29] L. Breiman, Random Foress, Machine Learning, vol. 45, no. 1, pp. 5 32, [30] P. Krähenbühl and V. Kolun, Efficien Inference in Fully Conneced CRFs wih Gaussian Edge Poenials, in NIPS, [31] A. Vedaldi and S. Soao, Quick Shif and Kernel Mehods for Mode Seeking, in ECCV, [32] A. P. Wikin, Scale-Space Filering: A New Approach To Muli-Scale Descripion, in ICASSP, [33] R. Harley and A. Zisserman, Muliple view geomery in compuer vision. Cambridge Universiy Press, [34] B. Leibe, K. Schindler, N. Cornelis, and L. V. Gool, Coupled Objec Deecion and Tracking from Saic Cameras and Moving Vehicles, PAMI, vol. 30, no. 10, pp , [35] D. Mizel, E. Horber, A. Ess, and B. Leibe, Muli-person Tracking wih Sparse Deecion and Coninuous Segmenaion, in ECCV, [36] S. Thrun, W. Burgard, and D. Fox, Probabilisic Roboics Inelligen Roboics and Auonomous Agens). The MIT Press, 2005.

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding Moivaion Image segmenaion Which pixels belong o he same objec in an image/video sequence? (spaial segmenaion) Which frames belong o he same video sho? (emporal segmenaion) Which frames belong o he same