Extracting Adaptive Contextual Cues from Unlabeled Regions

Size: px

Start display at page:

Download "Extracting Adaptive Contextual Cues from Unlabeled Regions"

Delilah Eunice Fisher
5 years ago
Views:

Extractng Adaptve Contextual Cues from Unlabeled Regons Congcong L Cornell Unversty Dev Parkh Toyota Technologcal Insttute Chcago Tsuhan Chen Cornell Unversty

nformaton. As a consequence, they nadvertently commt to the granularty of nformaton mplct n the labels.

In ths paper, we overcome both these drawbacks and propose a contextual cue that explots unlabeled regons n mages.

In order to extract the proposed contextual cue, we consder a scene to be a structured confguraton of objects and regons; just as an object s a composton of parts.

Our results show that ncorporatng our proposed cue provdes a relatve mprovement of 12% over a state-of-the-art object detector on the challengng PASCAL dataset. 1. Introducton Object recognton s one of the central problems n computer vson.

These typcally leverage labeled data from other object categores to learn contextual relatonshps. Ths leads to two undesrable consequences.

In most scenaros such as the popular MSRC [1] or PASCAL [7] datasets, not all regons n the mages are accounted for by the manually chosen categores that are

Do these unlabeled regons contan useful nformaton that s worth capturng?

[19] showed that humans need contextual nformaton for recognton only when the appearance nformaton s mpovershed, such context for aeroplane context for

Many approaches to contextual reasonng for object detecton model the relatonshp of the object-of-nterest (yellow boxes) to other object categores labeled n the

They do not leverage unlabeled regons n the mages that do not belong to these manually chosen labeled categores, resultng n a hghly myopc vew of the scene (1st and

1 Extractng Adaptve Contextual Cues from Unlabeled Regons Congcong L Cornell Unversty Dev Parkh Toyota Technologcal Insttute Chcago Tsuhan Chen Cornell Unversty cl758@cornell.edu dparkh@ttc.edu tsuhan@ece.cornell.edu Abstract Exstng approaches to contextual reasonng for enhanced object detecton typcally utlze other labeled categores n the mages to provde contextual nformaton. As a consequence, they nadvertently commt to the granularty of nformaton mplct n the labels. Moreover, large portons of the mages may not belong to any of the manuallychosen categores, and these unlabeled regons are typcally neglected. In ths paper, we overcome both these drawbacks and propose a contextual cue that explots unlabeled regons n mages. Our approach adaptvely determnes the granularty (scene, nter-object, ntra-object, etc.) at whch contextual nformaton s captured. In order to extract the proposed contextual cue, we consder a scene to be a structured confguraton of objects and regons; just as an object s a composton of parts. We thus learn our proposed contextual meta-objects usng any off-the-shelf object detector, whch makes our proposed cue wdely accessble to the communty. Our results show that ncorporatng our proposed cue provdes a relatve mprovement of 12% over a state-of-the-art object detector on the challengng PASCAL dataset. 1. Introducton Object recognton s one of the central problems n computer vson. Many recent works leverage contextual nformaton surroundng the object-of-nterest for enhanced recognton [5, 11, 12, 14, 19, 22]. These typcally leverage labeled data from other object categores to learn contextual relatonshps. Ths leads to two undesrable consequences. Unlabeled regons: Frst, regons n the mages that are unlabeled are often neglected. In most scenaros such as the popular MSRC [1] or PASCAL [7] datasets, not all regons n the mages are accounted for by the manually chosen categores that are labeled. For nstance, 28.18% of the pxels n MSRC and 54.74% of pxels n the PASCAL 2007 dataset are not a part of the labeled categores. See Fgure 1. Do these unlabeled regons contan useful nformaton that s worth capturng? We conduct human studes on recognzng objects n low-resoluton mages1 from the PASCAL 1 Parkh et al. [19] showed that humans need contextual nformaton for recognton only when the appearance nformaton s mpovershed, such context for aeroplane context for potted-plant context for dnng-table Fgure 1. Many approaches to contextual reasonng for object detecton model the relatonshp of the object-of-nterest (yellow boxes) to other object categores labeled n the dataset. They do not leverage unlabeled regons n the mages that do not belong to these manually chosen labeled categores, resultng n a hghly myopc vew of the scene (1st and 3rd row). Our approach (blue boxes) ntellgently leverages nformaton present n these unlabeled regons to better detect the object of nterest. From left to rght: unlabeled regons lke sky and grass provde context for aeroplanes, and unlabeled objects lke wndows and pantngs are contextually relevant for potted-plants and dnng-tables respectvely. The granulartes of our blue boxes relatve to yellow boxes are learned adaptvely. dataset. We fnd that subjects perform better n scenaros where they see the entre mage ncludng the unlabeled regons (Fgure 2). Ths ndcates that the unlabeled regons, whch are often dscarded, ndeed contan useful contextual nformaton. Adaptve granularty: Second, n focusng only on labeled regons n the mage, models nadvertently lmt the contextual nformaton captured to the granularty mplct n the labels. For nstance, most works explorng contextual models for the PASCAL dataset consder only objectlevel nformaton, such as co-occurrence, relatve locaton and relatve sze [5, 9]. We argue that the granularty at whch contextual nformaton s most helpful s category specfc. Whle an aeroplane may only need to consder a as n low-resoluton mages.

58 Recognton Accuracy (%) 78 76 74 72 70 68 66 64 62 60 Excludng Unlabeled Entre Image Fgure 2. Human subjects were shown mages excludng the unlabeled regons (left), as well as entre mages (mddle).

Our results (rght) ndcate that subjects can recognze objects sgnfcantly more relably f nformaton from the unlabeled regons s avalable.

2 58 Recognton Accuracy (%) Excludng Unlabeled Entre Image Fgure 2. Human subjects were shown mages excludng the unlabeled regons (left), as well as entre mages (mddle). The object to be recognzed s shown wth and wthout a yellow-outlne (to avod dstracton). Our results (rght) ndcate that subjects can recognze objects sgnfcantly more relably f nformaton from the unlabeled regons s avalable. These experments were conducted on 394 PASCAL 2007 mages contanng 897 objects from 20 categores. neghborng sky regon as context, dnng-table can beneft from the entre scene layout. Moreover, whle a sheep may be surrounded by relevant contextual nformaton all around t, the most relable contextual nformaton for a bcycle s the person on top. Therefore, t s crucal that context s extracted from regons that adapt to dfferent objects. In ths work, we overcome both these drawbacks and propose a contextual cue that explots unlabeled regons n mages and automatcally adapts to each object category. Formulaton: What knd of nformaton s contextually most useful? Informaton that s relevant.e. has consstent spatal locaton wth respect to the object-of-nterest, and nformaton that s relable.e. has consstent appearance across mages for relable detecton. Interestngly, these are also aspects that make object models effectve: capturng spatally coherent and vsually consstent object-parts. Instantatng the popular herarchcal vew of scenes [18, 20, 25] (parts are to objects as objects are to scenes), we can cast the problem of learnng relevant contextual regons n a scene as that of learnng detectors for objects, whch we call contextual meta-objects (CMOs). These CMOs form our proposed contextual cues. Any exstng object detector can be used to learn our CMOs. In fact, the detector that one trans to detect objects-of-nterest (OOIs) can be seamlessly (and convenently so!) used to learn CMOs, essentally boostng the performance of the OOI detector wthout desgnng complex algorthms or cumbersome learnng procedures. Summary of approach: How can we extract meanngful contextual nformaton from unlabeled regons? Our approach explots the followng key observaton: object boundng boxes do not smply provde us wth nformaton about what the object looks lke whch can be used to tran a detector for the object. The presence of an object at a certan locaton n a natural mage also provdes an anchor pont that suggests a meanngful algnment among the scenes contanng these objects. Gven mages labeled wth boundng-boxes of the OOI, we algn and cluster the mages such that each cluster contans mages wth smlar contextual nformaton surroundng the OOI. We dentfy a CMO regon around the OOI that best captures ths contextual nformaton. Note that the granularty at whch the CMO regon s defned s not fxed, and adapts to the content of the mages. Any off-the-shelf object detector can then be traned to detect our CMOs. Durng testng, we apply the CMO detector on the test mage, and the score of the detecton captures our contextual cue. Ths cue can be combned wth the OOI detecton score, or other contextual cues for enhanced OOI detecton. Contrbutons: In ths work, we effectvely explot the unlabeled regons n mages that are often neglected by most exstng works, to extract contextually relevant cues for enhanced object detecton. Our contrbutons are threefold. Frst, we dscover contextual regons that automatcally adapt to the object category of nterest n order to capture contextual nteractons at varyng granulartes (the entre scene, nter-object, even ntra-object) for dfferent categores. Second, we cast the problem of extractng contextual regons n scenes nto the problem of learnng object models. Ths allows us to employ any off-the-shelf object detector to learn our contextual cue; a convenent choce beng the object-of-nterest detector whose performance we hope to enhance va contextual reasonng. Ths smplcty makes our proposed cue easly accessble to the communty. Lastly, our approach acheves hgher detecton performance when usng a sngle labeled category, than the state-of-theart approach [9] that utlzes labels for all 20 object categores on the PASCAL 2007 dataset. Ths demonstrates our effectve use of unlabeled regons. We use our approach to boost performance of several object detectors, and show that our contextual cue compares favorably to, and complements other sources of context. Our adaptve selecton of the granularty of contextual nformaton outperforms the fxed-granularty counterpart. 2. Related Work Many recent works leverage contextual nformaton for enhanced recognton and localzaton of objects n natural mages. Varous sources of context have been explored, rangng from the global scene layout, nteractons between objects and regons, as well as local features. Dvvala et al. [6] survey and study the effectveness of dfferent contextual cues and combne them to acheve superor performance. We vew our novel cue that extracts useful contextual nformaton from unlabeled regons as complementary to exstng cues, and can be easly ntegrated n most contextual models. Fxed-granularty models: Many exstng works commt to a fxed granularty of contextual nformaton. To ncorporate scene-level nformaton, Torralba et al. [26] use the statstcs of low-level features across the entre scene to prme object detecton. Hoem et al. [13] use 3D scene nformaton to provde prors on potental object locatons. Park et al. [21] use the ground plane estmaton as contextual cues for pedestran detecton. Probablstc models have

been proposed to capture the local nteractons between neghborng regons [12, 14, 17], objects [5, 9, 19, 22, 30], or both [2, 10]. Sadegh et al.

Whle the vsual phrases n [24] are manually labeled n the dataset, our work can be thought of as learnng the vsual phrases n an unsupervsed way.

However, ths mult-level aspect s ganed by explctly usng dfferent models for each nteracton level.

Leveragng unlabeled nformaton: A natural way to ncorporate nformaton from unlabeled regons n mages s to buld a global descrptor for the entre mage to provde contextual cues.

have poor algnments due to the hgh varance n object scales, poses, etc. [3]. Instead, local neghborng regons tend to be more effectve. Wolf et al.

[4] show that smply ncreasng the sze of the person boundng box by a small amount boosts the accuracy of pedestran detecton.

Our approach nstead automatcally and adaptvely determnes the extent of contextual nformaton to be captured around dfferent object categores. Felzenszwalb et al.

[15] utlze a few labeled categores to dscover other object categores n the background that have consstent appearance and are contextually coherent wth the labeled categores.

3 been proposed to capture the local nteractons between neghborng regons [12, 14, 17], objects [5, 9, 19, 22, 30], or both [2, 10]. Sadegh et al. [24] propose and detect vsual phrases whch correspond to chunks of meanng bgger than objects and smaller than scenes. Whle the vsual phrases n [24] are manually labeled n the dataset, our work can be thought of as learnng the vsual phrases n an unsupervsed way. Our learned compostes, however, may not have a clearly defned semantc meanng. Whle most works focus on one level of nteracton, Gallegullos et al. [10] explore contextual nteractons at multple levels. However, ths mult-level aspect s ganed by explctly usng dfferent models for each nteracton level. In contrast, our approach allows for adaptvely pckng dfferent granulartes of contextual nformaton wthn the same framework. Leveragng unlabeled nformaton: A natural way to ncorporate nformaton from unlabeled regons n mages s to buld a global descrptor for the entre mage to provde contextual cues. Though global mage statstcs show great potental n prmng object detecton [26], very modest mprovement can be acheved when appled to datasets lke PASCAL (also confrmed n our experments), where mages have poor algnments due to the hgh varance n object scales, poses, etc. [3]. Instead, local neghborng regons tend to be more effectve. Wolf et al. [29] sample contextual nformaton from pre-defned relatve locatons. Dalal et al. [4] show that smply ncreasng the sze of the person boundng box by a small amount boosts the accuracy of pedestran detecton. Ths s equvalent to leveragng potentally unlabeled regons n close proxmty of the object-of-nterest. Our approach nstead automatcally and adaptvely determnes the extent of contextual nformaton to be captured around dfferent object categores. Felzenszwalb et al. [9] often detect parts that le slghtly outsde the sldng wndow, and use these detectons to refne ther boundng box predcton. Lee et al. [15] utlze a few labeled categores to dscover other object categores n the background that have consstent appearance and are contextually coherent wth the labeled categores. Whle smlar n phlosophy to our work, our soluton s qute dfferent, as s the problem settng of enhancng object detecton. 3. Approach In order to extract contextual cues that explot unlabeled regons n mages, we work n the most extreme settng where only one object category s labeled (wth boundngboxes) n the tranng dataset, leavng a large porton of the mages unlabeled. We can seamlessly ncorporate more labeled categores n our approach, as we descrbe later. Our goal then s to extract useful contextual regons, gven mages labeled wth ground-truth boundng-boxes for just one object category.e. the object-of-nterest (OOI). The ground-truth OOI may occupy dfferent proportons of (a) (b) Fgure 3. Context around objects (potted-plant and dog shown n yellow boxes) can vary n extent (left) as well as content (rght). mages across the dataset, as seen n Fgure 3(a). Moreover, mages may exhbt dfferent contextual settngs, as seen n Fgure 3(b). To better model these, we frst group the tranng mages that exhbt smlar contextual extent and content (Sectons 3.1 and 3.2). As stated earler, we cast our problem of extractng a contextual regon n a scene as that of learnng an object model, whch we call contextual metaobject (CMO), for whch any off-the-shelf detector can be used. Each cluster of mages s used to tran a dfferent component of our CMO detector (Secton 3.3). Durng testng, we run both the traned OOI and CMO detectors, and the score of the latter provdes contextual nformaton for the former. Whle any contextual reasonng model can be used to ntegrate the two, n our mplementaton we adopt the smple contextual re-scorng scheme from [9] (Secton 3.4) Contextual-extent-based clusterng We consder the contextual-extent of an mage to be the porton of the mage that les outsde of and surrounds the OOI. Intutvely, we wsh to group all mages that contan the OOI wth smlar poses, at smlar scales, and n smlar locatons wth respect to the rest of the scene. We use all the tranng mages, as well as ther leftrght flpped versons. To enhance consstency n the data, we frst dvde the mages nto 2 groups: one contanng mages wth the OOI ground-truth boundng boxes on the rght, and the other contanng mages wth the OOI on the left. We then employ the followng procedure for both groups. Consder an mage I consstng of nb groundtruth OOI boundng-boxes {Bj }, j {1,..., nb }. I s descrbed by a set of nb fve-dmensonal descrptors F = {f1,..., fj,..., fnb } fj = # " xtl x j ytl y j xbr x j ybr y j hj (1) sj sj sj sj wj where xtl, ytl and xbr, ybr are the co-ordnates of the top-left and bottom-rght corners of the mage I respec tvely, x j, y j are the co-ordnates of the center of the j th ground-truth OOI boundng-box n mage I, and sj, hj, and wj are the scale, heght and wdth of the j th OOI boundng-box respectvely. Ths descrptor captures the extent of the contextual nformaton n the mage (n terms of relatve locaton and scale) wth respect to the OOI present n the mage, as well as the aspect rato of the OOI. For each of the groups mentoned above, we cluster the descrptors N =1 F collected from all N tranng mages contanng the

4 OOI nto K clusters usng k-means clusterng. Ths results n a total of 2K context-extent based clusters of mages. In our mplementaton, we use K = 3. For all mages n the k th cluster, k {1,..., K}, we determne the extent of the CMO to be the largest boundngbox surroundng the OOI such that the CMO s entrely contaned wthn at least 80% of the mages n the cluster. Ths boundng-box ndcates the presence and extent of the CMO we wll learn va an off-the-shelf object detector. We note that snce the extent of the CMO s not ted to the OOI boundng-box tself, and nstead depends on the layout of the OOI n the scene, t can freely capture any relevant nformaton n the mage, ncludng unlabeled regons, at any granularty. In fact, CMOs correspondng to the dfferent clusters capture contextual nformaton at dfferent granulartes for the same object category. We note that a tranng mage I contanng n B ground-truth OOI nstances wll have n B correspondng nstances of the CMO for tranng Contextual-content-based clusterng Whle we ensure that the extent of context s smlar among the mages wthn the 2K clusters, the contextual settng or content of the mages could be qute vared (Fgure 3(b)). We further cluster each of the 2K groups based on ther content. We extract gst [27] features wthn the CMO box, and perform k-means clusterng on these features to dvde each of the 2K groups nto two clusters. In our mplementaton, when less than 40 mages are assgned to a cluster, we drop the cluster and re-assgn the correspondng mages to the remanng clusters. Therefore, we have M 4K clusters, whch vares wth categores. For the PASCAL VOC 2007 dataset, we fnd a total of 198 clusters across the 20 categores. These clusters now have CMOs defned that are consstent n appearance as well as spatal relatonshps wth respect to the OOI, and thus have the potental to provde useful contextual nformaton to enhance the OOI detecton Tranng CMO detectors We now descrbe how we use the above clusters to learn our CMO detector. As stated earler, we can use any offthe-shelf object detector, whch we treat as a black-box parameterzed by a model θ, learnt va a tranng procedure f tran that takes n tranng data of the form (P, N ). P = {(I 1 +, B 1),..., (I + k, B k)} s a set of postve mage and boundng-box pars, and N = {(I1 ),..., (I l )} s a set of negatve mages. The tranng procedure can be vewed as ˆθ = optmze f tran (P, N ; θ). (2) The detector can then be evaluated on a test mage I, va the nference procedure f nfer to obtan a detecton (B, s) ncludng a boundng box B and a score s (B, s) = f nfer I; ˆθ. (3) We see that the above formulaton holds for a varety of detectors, be t sldng-wndow based [4, 9, 28] or houghtransform based [16]. So how do we use one of these black-box detectors to tran our CMO detector usng the M clusters formed n Secton 3.2? Instead of commttng to the hard-clusterng whch was blnd to the choce of detector that would follow, we tran an M-component detector usng an EM style approach, smlar to the strategy employed n [9]. We ntalze the components usng the above clusterng, and tran M detectors to obtan ˆθ m usng f tran (P m, N m ; θ), where P m s the set of mages n the m th cluster and the CMO wndows contaned (Secton 3.1), and N m are negatve mages. We then nfer these M components on all postve tranng mages (across all M clusters), and re-assgn each ground-truth CMO box B (Secton 3.1) n the tranng set to the component that best explans t: m = argmax (s m t m ), (4) m {1,...,M} s.t.(b m, s m ) =f nfer I; ˆθ m, overlap(b m, B) > 0.5 (5) where t m s a bas term used to normalze the scores across the components, and s set to the 5 th percentle of the scores assgned by component m to all detectons n the tranng mages. Each component m s now re-traned, but wth the new set of postve examples, to obtan an updated ˆθ m, and the teratons contnue. We fnd that 3-5 teratons suffce. Durng testng, all M components are evaluated on the test mage. Non-maxmal suppresson s used on the resultant detectons to elmnate repeated detectons Contextual re-scorng We now descrbe how we use our CMO to provde context to the OOI detecton. We evaluate our CMO as well as the OOI detectors on the test mage. The presence as well as the locaton of the CMO can provde useful contextual nformaton. However n ths work, we only consder the presence.e. the CMO detecton score. Based on the CMO detectons, we wsh to re-score the detected OOI boundng boxes. Whle any contextual reasonng mechansm can be used to ths end (such as [6, 9, 11, 23]), we tran a classfer, smlar to Felzenszwalb et al. [9]. Let D be the set of detectons obtaned for the OOI detector. Each detecton (B, s), (B, s) D s formed of a boundng-box wth co-ordnates B = (x tl, y tl, x br, y br ) (overloaded from prevous sectons) and a score s. Let s C be the score of the hghest scorng CMO detecton (across all M components) found n the mage. To re-score an OOI detecton (B, s), we buld a 6-dmensonal descrptor consstng of the orgnal score of the OOI detecton, the top-left and bottom-rght boundng box coordnates normalzed by mage sze, and the contextual nformaton provded by the CMO detecton, captured va s C : ψ 1c = [σ(s) x tl y tl x br y br σ(s C)] (6) where σ(x) = 1/ (1 + exp( αx)), α = 1.5, as n [8].

5 aero bke brd boat bottle bus car cat char cow table dog horse mbk pers plant sheep sofa tran tv MEAN GAIN Base w/o context [9] Scene ( [26]) % EXO ( [4]) % CMO % Table 1. Average precson (AP) for all 20 categores n PASCAL VOC 2007, mean AP across 20 categores, and the average relatve mprovement on 20 categores compared to the Base method. All methods lsted use labels from only one object category. ( [ ]) means the method s smlar n sprt to the reference work. Ths ψ 1c descrptor s fed to a classfer h traned to separate correct OOI detectons from false postves. The OOI boundng-box B s assgned a new score s = h (ψ 1c ). In our mplementaton, the classfer h s an SVM wth a polynomal kernel (parameters set va cross-valdaton), smlar to [8, 9]. The tranng data are obtaned by runnng the traned OOI detector on labeled tranng data, and collectng correct detectons and false postves. We note that ψ 1c uses labeled data from only one object category, and stll captures contextual nformaton. Our approach s not restrcted to usng labels from only one object category. If more object categores are avalable, they can be seamlessly ncorporated. A CMO detector would be traned for each labeled object category. Durng test tme, let (D 1,..., D n ) be the set of OOI detectons obtaned for n dfferent object categores (for PASCAL n = 20 f all categores are consdered). Let (s C1,..., s Cn ) be the scores of the hghest scorng CMO detectons for each of the correspondng n CMO detectors. The contextual descrptor to re-score an OOI detecton (B, s) D s now n + 5 dmensonal ψ nc = [σ(s) x tl y tl x br y br σ(s C1 )... σ(s Cn )]. (7) Smlar to tradtonal approaches that explot context, Felzenszwalb et al. [9] only use other OOI object categores to provde context, va a descrptor ψ no = [σ(s) x tl y tl x br y br σ(s D1 )... σ(s Dn )] (8) where s D s the score of the hghest-scorng OOI detecton from the th OOI category. We note that ths descrptor s dentcal to the one used n [9], whch we compare to n our experments. Fnally, a contextual descrptor capturng contextual nformaton provded by all n CMO and OOI detectors s gven as ψ nco = [σ(s) x tl y tl x br y br γ(c) γ(d)] (9) γ(c) = [σ(s C1 )... σ(s Cn )] (10) γ(d) = [σ(s D1 )... σ(s Dn )]. (11) A specal case of Equaton 9 for n = 1 dffers from Equaton 6 by one-dmenson correspondng to σ(s D ), the score correspondng to the hghest-scorng OOI detecton. Snce the use of σ(s D ) does not requre addtonal tranng data, and s obtaned by usng labeled data from a sngle objectcategory, we replace Equaton 6 wth the followng n our experments. ψ 1co = [σ(s) x tl y tl x br y br σ(s C) σ(s D)] (12) CMO Rato OOI 37.2% Other Categores Unlabeled Regons 17.5% 45.3% Number of CMO components Proporton of a CMO component occuped by OOI Fgure 4. Left: The proporton of hghest-scorng CMO detectons on postve testng mages occuped by dfferent content. Rght: The average proporton of a CMO component detecton occuped by OOI. The truly mult-granular nature of our learnt contextual cues s evdent. 4. Experments and Results We evaluate our approach usng the PASCAL VOC challenge dataset and protocol [7], whch contans 9963 mages of realstc scenes, contanng ground-truth boundng boxes for 20 object categores. Whle we provde results wth other object detectors n subsequent experments, we frst perform several comparsons and analyses usng the publcly avalable mplementaton [8] of the state-of-the-art deformable parts-based object detector [9] Quanttatve results We frst provde several quanttatve evaluatons, followed by some qualtatve llustratons. Useful contextual nformaton from unlabeled regons: We frst evaluate whether our proposed cue captures useful contextual nformaton extracted from the unlabeled regons. We compare the performance of the baselne OOI detector to the contextually re-scored detector, usng our proposed cue as the source of contextual nformaton as descrbed n Equaton 12. The results can be seen n Table 1 (Base w/o context vs. CMO). We see that across all 20 categores, our method outperforms the state-of-the-art OOI detector, ndcatng that our contextual cue does n fact extract useful contextual nformaton from unlabeled regons. Further, n Fgure 4 (left) we show the average proporton of the CMO detectons occuped by dfferent contents (OOI, other labeled categores, and unlabeled). We see that almost half of the CMO detectons cover unlabeled regons. We also see that about 1/5 of the CMO detectons capture 2 We are not proposng a novel object detector. Instead, we wll be performng a varety of comparsons to demonstrate the effectveness of our proposed contextual cue. As recommended by the challenge organzers [7], we work wth the 2007 test-set, whch s the latest PASCAL test-set wth publcly avalable annotatons.

6 aero bke brd boat bottle bus car cat char cow table dog horse mbk pers plant sheep sofa tran tv MEAN GAIN 20OOI ([9]) % 20CMO+20OOI % Table 2. Average precson (AP) for all 20 categores n PASCAL VOC 2007, mean AP across 20 categores, and the average relatve mprovement on 20 categores compared to the Base method n Table 1). All methods lsted use labels from all 20 categores. # of labels for tranng fuson methods mean AP Scene+EXO CMO 34.2 Scene+EXO+CMO OOI OOI+20CMO OOI+Scene+EXO OOI+20CMO+Scene+EXO 35.3 Table 3. The mean AP across the 20 categores n PASCAL VOC 2007 for fusng varous sources of contextual nformaton. other labeled categores n the mages, even though our contextual cue does not use these labels to explctly learn contextual relatonshps between the OOI and other categores. Adaptve scalng helps: We now compare our contextual cue to two cues that also explot the unlabeled regons but at a fxed granularty. The frst cue ( Scene ) captures the entre scene, and thus operates at the global granularty, smlar to the work of Torralba et al. [26]. We tran a bnary RBF-kernel SVM classfer on the gst descrptors [27] extracted from the entre mages to dscrmnate between mages wth and wthout the object-of-nterest. We then use the same re-scorng scheme n Equaton 12, by replacng the σ(s C ) wth σ(s g ), where s g s the score of the gst-based classfer for the test mage. The second cue ( EXO ) s local n nature. We expand the OOI boundng-box by a fxed amount (smlar to [4]) of 20% n all four drectons. Instead of tranng our CMO on the co-ordnates determned n Secton 3.1, we use ths expanded OOI boundng-box to learn a contextual object whch we call EXO. We use the same re-scorng scheme n Equaton 12, by replacng σ(s D ) wth σ(s E ), where s E s the hghest score of the EXO detectons. Note all three approaches only requre labels from a sngle object category. Table 1 shows that our approach CMO outperforms both fxed-granularty methods. Ths demonstrates our ablty to effectvely determne the granularty of useful contextual nteractons. Further, Fgure 4 (rght) shows the dstrbuton of the average proporton of CMO component detectons occuped by OOI. Hgh values (rght of the hstogram) correspond to contextual cues that capture local context around the OOI, whle low values (left of the hstogram) correspond to models that capture scene level context wth respect to a relatvely small OOI. The large varance n the dstrbuton demonstrates the truly mult-granular nature of the contextual nformaton learnt. Complementary cue: We now test the ablty of our proposed cue to provde complementary contextual nformaton. We consder several dfferent sources of context popularly explored n lterature: the global scene-level context ( Scene ) and local context ( EXO ) descrbed above, as Mean AP over 20 categores n OOI + n CMO (Our algorthm) n OOI (Felzenszwalb et. al [9]) No. of category labels for learnng contextual nformaton (n) Fgure 5. The effect of number of labeled categores on the average AP (across 20 categores). well as object-level context provded by other labeled objects n the mages, be t a subset of the categores (nooi) or all 20 (20OOI). In Table 1 we saw that CMO performs better than the global and local context ndvdually. Table 3 shows that our proposed cue learnt from a sngle labeled category s comparable to (slghtly outperforms) the contextual nformaton provded by all 20 labeled object categores (20OOI). We note that 20OOI corresponds exactly to the contextual approach used by Felzenszwalb et al. [9]. Hence, we see that our contextual cue performs better than any of the ndvdual sources of context, ncludng ones that utlze sgnfcantly more amounts of labels. Smlar to [6], we analyze the performance of fusng varous sources of contextual nformaton. We use the same rescorng method as descrbed n Secton 3.4 for combnng the dfferent cues, snce we wsh to evaluate the nformaton captured by the cues, and not partcular contextual reasonng technques. Table 3 shows results of fusng dfferent combnatons of the above mentoned contextual cues. We append the hghest score for each contextual cue beng fused to re-score the OOI detecton (smlar to Equaton 9). We see that our proposed cue ndvdually performs better than the combnaton of both global and local context (Scene+EXO). Moreover, fusng our cue to these exstng sources of context provdes a further boost n performance, demonstratng the truly complementary nature of our cue that extracts contextual nformaton from unlabeled regons at adaptve granulartes. We now compare our approach that leverages unlabeled regons to learn contextual cues (nooi + ncmo, Equaton 9), to the baselne approach of [9] (nooi, Equaton 8) that only utlzes other labeled categores, as the number of labeled categores s vared. For each object category (OOI), we pck n categores wth hghest mutual nformaton (based on co-occurrence of categores wth the OOI category across tranng data) to provde contextual nformaton. Fgure 5 shows the trends. The green pont at n = 0

ISM [16] HOG-SVM [4] Part-based Model [9] OOI (mean AP) 12.6 24.0 32.3 OOI+CMO (mean AP) 15.6 26.8 34.2 Gan 23.8% 11.2% 8.4% Table 4.

7 ISM [16] HOG-SVM [4] Part-based Model [9] OOI (mean AP) OOI+CMO (mean AP) Gan 23.8% 11.2% 8.4% Table 4. Detecton results wth and wthout the proposed CMO as addtonal contextual nformaton, by usng dfferent black-box detectors. [ ] ndcates t s our mplementaton of the reference work. gves the mean AP of the OOI detector usng no context. We can see that across the board, our approach leads to better AP than [9] whle usng the same amount of labeled data, demonstratng our effectve use of unlabeled regons. A break-down of the average AP across categores when usng all 20 labeled categores for both methods can be seen n Table 2. We see that ncorporatng our contextual cue that leverages unlabeled nformaton n addton to the 20 labeled object categores (20OOI+20CMO) provdes mprovements over 20OOI ([9]) n 19 out of 20 categores and matches the remanng category. We see that our contextual cue on average provdes a relatve mprovement of 12.05% over the state-of-the-art detector. We observe sgnfcant relatve mprovements of more than 20% n some categores such as brd, cat, dnng-table, and dog. Other detectors: As mentoned n Secton 3, our proposed contextual cue can be extracted va any object detector, and can nturn be used to enhance the performance of the detector for the OOI. We demonstrate that here. In addton to the sldng wndow part-based deformable model (Parts-based Model) by Felzenszwalb et al. [9] that we use n the above experments, we consder two other popular detectors: Implct Shape Model (ISM) [16] whch s houghtransform based, and the HOG-SVM detector [4] (also sldng wndow). These are used as the black-box modules to tran our CMO detectors, usng the procedure descrbed n Secton 3.3. Table 4 shows the mean AP across all 20 categores n the PASCAL 2007 dataset, acheved by our mplementaton of these detectors, wth and wthout the context provded by CMO. The results show that our proposed contextual cue CMO can be learnt va any detector, and consstently mproves the object detecton performance wth a sgnfcant gan. Note that we use the same detectors for both OOI and CMO, ndcatng that no addtonal technques are requred for achevng these performance gans Qualtatve results Fgure 6 shows some example CMO detectons usng the deformable parts-based model [9] as the detector. As quanttatvely demonstrated earler, we see that the CMO boundng-boxes contan a lot of unlabeled regons that are not labeled n the dataset, such as the sky regon for aeroplanes, road for bcycle, coffee table and wall pantngs for sofa, wndows for potted-plants, etc. Although our approach uses labeled data only from one object category to learn the contextual cues, we learn meanngful relatonshps among objects as well as object parts. For example, Fgure 7. The spatal maps ndcate the lkelhood of a person beng present gven the CMO detectons (red boxes). We see that the CMO provdes strong prmng for locatons of OOIs. the CMO detectons for bcycle n the 2nd column of Fgure 6 consstently nclude a person s body as parts of the CMO. In the 5th and 6th columns, we show detectons of two CMO components correspondng to the person category. We see that both components seem to capture two people, but whle the left-column models two people further apart from each other, the rght-column detects two people close together. We also see ntra-object CMO for cat n the last column. The false-postves shown n Fgure 6 clearly demonstrate that our learnt CMO models fre at contextually relevant regons n mages. Fnally, as seen n Fgure 4, these examples qualtatvely demonstrate that our cues can adaptvely learn contextual nformaton at dfferent granulartes n the scenes. Encouraged by the meanngful spatal nteractons observed n Fgure 6 (especally for the person category), we elaborate on that aspect a lttle further. As we demonstrated earler, our learnt CMO models often nclude objects from other labeled categores (e.g. the bcycle CMO often ncludes a person, a person CMO often ncludes another person, etc.). By examnng the CMO detectons on tranng mages, we can learn a dstrbuton of the locaton of each labeled category n the dataset relatve to our CMO models. These spatal dstrbutons can now be used as a pror to better gude the detecton of an OOI. We show a few examples of these spatal maps n Fgure 7. We also dsplay the HOG feature vsualzatons for the correspondng CMO. Whle n ths work we only leverage CMOs to provde co-occurrence based context, explorng the potental of our models to capture scene confguratons and provde explct spatal prmng for localzng objects s part of future work. 5. Concluson Regons that are not accounted for n the manually chosen lst of categores labeled n a dataset are often neglected by exstng works on contextual reasonng. In ths work, we explot these unlabeled regons to extract adaptve contextual cues for enhanced object detecton. We utlze the labeled object boundng-box as an anchor to algn scenes and learn spatally consstent and vsually dentfable contextual regons. The granularty of these regons s adap-

8 Fgure 6. Examples of the detected contextual meta-objects (CMO). Each column shows the CMO detectons for a specfc category. From left to rght: aeroplane, bcycle, sofa, potted-plant, person, person, cat. Each mage shows the hghest scorng CMO detecton. Red box ndcates the CMO boundng-box, whle the whte boxes represent the regon-flters wthn the CMO as learnt by the deformable parts-based model [9]. The frst four rows show true-postve detectons, and the last row shows false-postve detectons. Please refer to the authors webpages for more results. tvely and automatcally tuned for dfferent categores, and capture scene-level, nter-object as well as ntra-object nteractons. We cast the problem of learnng our proposed contextual cue nto that of learnng object models. Ths allows us to utlze any off-the-shelf object detector to learn our proposed contextual meta-objects. We present convncng quanttatve and qualtatve results on the challengng PASCAL VOC 2007 dataset, where we mprove on the performance of several object detectors, and compare favorably to exstng sources of context. The benefts of the adaptve granularty at whch we extract context, and the potental of our cue to provde complementary nformaton n addton to exstng cues are also demonstrated. These mprovements do not rely on advanced modelng technques or learnng algorthms, and ntellgently leverage exstng technology, makng them wdely accessble. [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Acknowledgements: Ths research was supported n part by the Natonal Scence Foundaton under IIS [20] References [22] [1] MSRC 21-class Dataset. [2] M. Blaschko and C. Lampert. Object localzaton wth global and local context kernels. In BMVC, [3] M. Cho, J. Lm, A. Torralba, and A. Wllsky. Explotng herarchcal context on a large database of object categores. In CVPR, [4] N. Dalal and B. Trggs. Hstograms of orented gradents for human detecton. In CVPR, [5] C. Desa, D. Ramanan, and C. Fowlkes. Dscrmnatve models for mult-class object layout. In ICCV, [6] S. Dvvala, D. Hoem, J. Hays, A. Efros, and M. Hebert. An emprcal study of context n object detecton. In CVPR, [7] M. Everngham, L. Gool, C. Wllams, J. Wnn, and A. Zsserman. The Pascal Vsual Object Classes Challenge. [8] P. Felzenszwalb, R. Grshck, and D. McAllester. Dscrmnatvely traned [21] [23] [24] [25] [26] [27] [28] [29] [30] deformable part models, release 4. pff/latentrelease4/. P. Felzenszwalb, R. Grshck, D. McAllester, and D. Ramanan. Object detecton wth dscrmnatvely traned part-based models. PAMI, IEEE Transactons on, 32(9): , sep C. Gallegullos, B. McFee, S. Belonge, and G. Lanckret. Mult-class object localzaton by combnng local contextual nteractons. In CVPR, C. Gallegullos, A. Rabnovch, and S. Belonge. Object categorzaton usng co-occurrence, locaton and appearance. In CVPR, Anchorage, AK, G. Hetz and D. Koller. Learnng spatal context: Usng stuff to fnd thngs. In ECCV, D. Hoem, A. Efros, and M. Hebert. Puttng objects n perspectve. In CVPR, S. Kumar and M. Hebert. A herarchcal feld framework for unfed contextbased classfcaton. In ICCV, Y. Lee and K. Grauman. Object-graphs for context-aware category dscovery. In CVPR, B. Lebe, A. Leonards, and B. Schele. Robust object detecton wth nterleaved categorzaton and segmentaton. Int. J. Comput. Vson, 77: , May J. Lm, P. Arbel andez, C. Gu, and J. Malk. Context by regon ancestry. In ICCV, K. Murphy, A. Torralba, and W. Freeman. Usng the forest to see the trees: A graphcal model relatng features, objects, and scenes. In NIPS, D. Parkh, C. Ztnck, and T. Chen. From appearance to context-based recognton: Dense labelng n small mages. In CVPR, D. Parkh, C. Ztnck, and T. Chen. Unsupervsed learnng of herarchcal spatal structures n mages. In CVPR, D. Park, D. Ramanan, and C. Fowlkes. Multresoluton models for object detecton. In ECCV, A. Rabnovch, A. Vedald, C. Gallegullos, E. Wewora, and S. Belonge. Objects n context. In ICCV, N. Raswasa and N. Vasconcelos. Holstc context modelng usng semantc co-occurrences. In CVPR, M. Sadegh and A. Farhad. Recognton usng vsual phrases. In CVPR, E. Sudderth, A. Torralba, W. Freeman, and A. Wllsky. Learnng herarchcal models of scenes, objects, and parts. In ICCV, A. Torralba. Contextual prmng for object detecton. Int. J. Comput. Vson, 53(2): , A. Torralba, A. Olva, M. Castelhano, and J. Henderson. Contextual gudance of eye movements and attenton n real-world scenes: the role of global features n object search. Psychol Rev, 113(4), A. Vedald, V. Gulshan, M. Varma, and A. Zsserman. Multple kernels for object detecton. In ICCV, L. Wolf and S. Blesch. A crtcal vew of context. Int. J. Comput. Vson, 69: , August B. Yao and L. Fe-Fe. Modelng mutual context of object and human pose n human-object nteracton actvtes. In CVPR, 2010.

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: