Unsupervised object segmentation in video by efficient selection of highly probable positive features

Size: px

Start display at page:

Download "Unsupervised object segmentation in video by efficient selection of highly probable positive features"

Michael Dean
6 years ago
Views:

1 Unsupervsed object segmentaton n vdeo by effcent selecton of hghly probable postve features Emanuela Haller 1,2 and Marus Leordeanu 1,2 1 Unversty Poltehnca of Bucharest, Romana 2 Insttute of Mathematcs of the Romanan Academy, Romana haller.emanuela@gmal.com marus.leordeanu@mar.ro Abstract We address an essental problem n computer vson, that of unsupervsed foreground object segmentaton n vdeo, where a man object of nterest n a vdeo sequence should be automatcally separated from ts background. An effcent soluton to ths task would enable large-scale vdeo nterpretaton at a hgh semantc level n the absence of the costly manual labelng. We propose an effcent unsupervsed method for generatng foreground object soft masks based on automatc selecton and learnng from hghly probable postve features. We show that such features can be selected effcently by takng nto consderaton the spato-temporal appearance and moton consstency of the object n the vdeo sequence. We also emphasze the role of the contrastng propertes between the foreground object and ts background. Our model s created over several stages: we start from pxel level analyss and move to descrptors that consder nformaton over groups of pxels combned wth effcent moton analyss. We also prove theoretcal propertes of our unsupervsed learnng method, whch under some mld constrants s guaranteed to learn the correct classfer even n the unsupervsed case. We acheve compettve and even state of the art results on the challengng outube-objects and SegTrack datasets, whle beng at least one order of magntude faster than the competton. We beleve that the strong performance of our method, along wth ts theoretcal propertes, consttute a sold step towards solvng unsupervsed dscovery n vdeo. 1. Introducton Unsupervsed learnng n vdeo s a very challengng task n computer vson. Fully solvng ths problem would shed new lght on our understandng of ntellgence from a scentfc perspectve. It would also have a strong mpact n many real-world applcatons, as large datasets of unlabeled vdeos could be collected at a relatvely low cost. There are several dfferent publshed approaches for unsupervsed learnng and dscovery of the salent object n vdeo [20, 12, 17, 16], but most have a hgh computatonal cost. In general, algorthms for unsupervsed mnng and clusterng are expected to be computatonally expensve due to the nherent combnatoral nature of the problem [7]. In ths paper we address the computatonal cost challenge and propose a method that s both accurate and fast. We acheve our goal based on a key nsght: we focus on selectng and learnng from features that are hghly correlated wth the presence of the object of nterest and can be rapdly selected and computed. ote: n ths paper, when referrng to hghly probable postve features, we use feature to ndcate a feature vector sample, not a feature type. Whle we do not requre these features to cover all nstances and parts of the object of nterest (we could expect low recall), we show that t s possble to fnd, n the unsupervsed case, postve features wth hgh precson (a large number of those selected are ndeed true postves). Then we prove theoretcally that we can relably tran an object classfer usng sets of postve and negatve samples, both selected n an unsupervsed way, as long as the set of features consdered to be postve has hgh precson, regardless of the recall, f certan condtons are met (and they are often met n practce). We present an algorthm that can effectvely and rapdly acheve ths task n practce, n an unsupervsed way, wth state-of-the art results n dffcult experments, whle beng at least 10x faster than ts competton. The proposed method outputs both the soft-segmentaton of the man object of nterest as well as ts boundng box. Two examples are shown n Fgure 1. Whle we do not make any assumpton about the type of object present n the vdeo, we do expect the sequence to contan a sngle salent object, as our method performs foreground soft-segmentaton and doesn t expect vdeos wth no 5085

Fgure 1. Qualtatve results of our method, whch provdes the soft-segmentaton of the man object of nterest and ts boundng box. salent object or wth multple objects of nterest.

2 Fgure 1. Qualtatve results of our method, whch provdes the soft-segmentaton of the man object of nterest and ts boundng box. salent object or wth multple objects of nterest. The key nsghts that led to our formulaton and algorthm are the followng: 1) Frst, the foreground and background are complementary and n contrast to each other - they have dfferent szes, appearance and movements. We observed that the more we can take advantage of these contrastng propertes the better the results, n practce. Whle the background occupes most of the mage, the foreground s usually small and has dstnct color and movement patterns - t stands out aganst ts background scene. 2) The second man dea of our approach s that we should use ths foreground-background complementarty n order to automatcally select, wth hgh precson, foreground features, even f the expected recall s low. Then, we could relably use those samples as postves, and the rest as negatves, to tran a classfer for detectng the man object of nterest. We present ths formally n Sec These nsghts lead to our two man contrbutons n ths paper: frst, we show theoretcally that by selectng features that are postve wth hgh probablty, a robust classfer for foreground regons can be learned. Second, we present an effcent method based on ths nsght, whch n practce outperforms ts competton on many dfferent object classes, whle beng 10x faster. Related work on object dscovery n vdeo: The task of object dscovery n vdeo has been tackled for many years, wth early approaches beng based on local features matchng [20, 12]. Current lterature offers a wde range of solutons, wth varyng degrees of supervson, gong from fully unsupervsed methods [17, 16] to partally supervsed ones [10, 25, 24, 11, 21] - whch start from regon, object or segmentaton proposals estmated by systems traned n a supervsed manner [1, 4, 3]. Some methods also requre user nput for the frst frame of the vdeo [8]. Most object dscovery approaches that produce a fne shape segmentaton of the object also make use of off-the-shelf shape segmentaton methods [19, 5, 14, 2, 15]. 2. Approach Our method receves as nput a vdeo sequence, n whch there s a man object of nterest, and t outputs ts softsegmentaton masks and assocated boundng boxes. The proposed approach has, as startng pont, a processng stage based on prncpal component analyss of the vdeo frames, whch provdes an ntal soft-segmentaton of the object - smlar to the recent VdeoPCA algorthm ntroduced as part of the object dscovery approach of [21]. Ths softsegmentaton usually has hgh precson but may have low recall. Startng from ths ntal stage that classfes pxels ndependently based only on ther ndvdual color, next we learn a hgher level descrptor that consders groups of pxel colors and s able to capture hgher order statstcs about the object propertes, such as dfferent color patterns and textures. Durng the last stage we combne the softsegmentaton based on appearance wth foreground cues computed from the contrastng moton of the man object vs. ts scene. The resultng method s accurate and fast ( 3 fps n Matlab, 2.60GHz CPU - see Sec. 3.3). Our code s avalable onlne 1. Below, we summarze the steps of our approach (also see Fgure 2), n relaton wth Algorthm 1 (the pseudocode of our approach). Step 1: select hghly probable foreground pxels based on the dfferences between the orgnal frames and the frames projected on ther subspace wth prncpal component analyss (Sec. 2.1, Alg. 1 - lnes [2, 5]). Step 2: estmate emprcal color dstrbutons for foreground and background from the pxel masks computed at Step 1. Use these dstrbutons to estmate the probablty of foreground for each pxel ndependently based on ts color (Sec , Alg. 1 - lne 6). Step 3: mprove the soft-segmentaton from Step 2, by projecton on the subspace of soft-segmentatons (Sec. 2.3, Alg. 1 - lnes [7, 9])

3 Step 4: re-estmate emprcal color dstrbutons for foreground and background from the pxel masks updated at Step 3. Use these dstrbutons to estmate the probablty of foreground for each pxel ndependently based on ts color (Sec , Alg. 1 - lne 10). Step 5: learn a dscrmnatve classfer of foreground regons wth regularzed least squares regresson on the soft segmentaton real output [0, 1]. Use a feature vector that consders groups of colors that co-occur n larger patches. Run classfer at each pxel locaton n the vdeo and produce mproved per frame foreground soft-segmentaton (Sec. 2.4, Alg. 1 - lnes [11, 15]). Step 6: combne soft-segmentaton usng appearance (Step 5) wth foreground moton cues effcently computed by modelng the background moton. Obtan the fnal soft-segmentaton (Sec. 2.5, Alg. 1 - lnes [16, 23]). Step 7: Optonal: refne segmentaton usng Grab- Cut [19], by consderng as potental foreground and background samples the pxels gven by the softsegmentaton from Step 6 (Sec. 2.5). Fgure 2. Algorthm overvew. a) orgnal mage b) frst pxel-level appearance model, based on ntal object cues (Step 1 & Step 2) c) refned pxel-level appearance model, bult from the projecton of soft-segmentaton (Step 3 & Step 4) d) patch-level appearance model (Step 5) e) moton estmaton mask (part of Step 6) f) fnal soft-segmentaton mask (Step 6). We reterate: our algorthm has at ts core two man deas. The frst s that the object and the background have contrastng propertes n terms of sze, appearance and movement. Ths nsght leads to the ablty of relably selectng a few regons n the vdeo that are hghly lkely to belong to the object. The followng, second dea, whch brngs certan formal guarantees, s that f we are able to select, n an unsupervsed manner, even a small porton of the foreground object, but wth hgh precson, then, under some reasonable assumptons, we could tran a robust foreground-background classfer that can be used for the automatc dscovery of the object. In Table 1 we present the mprovements n precson, recall and F-measure between the dfferent steps of our algorthm. ote that the arrows go from the precson and recall of the samples ntally consdered to be postve, to the precson and recall of the pxels fnally classfed as postve. The sgnfcant mprovement Step 1&2 Step 3&4 Step 5 precson recall F-measure Table 1. Evoluton of precson, recall and F-measure of the feature samples consdered as postves (foreground) at dfferent stages of our method (SegTrack dataset). We start wth a corrupted set of postve samples wth hgh precson and low recall, and mprove both precson and recall through the stages of our method. Thus the soft masks become more and more accurate from one stage to the next. Step 1&2 Step 3&4 Step 5 Step 6 F-meas. (SegTrack) F-meas. (TO) Runtme (sec/frame) Table 2. Performance analyss and executon tme for all stages of our method. n F-measure s explaned by our theoretcal result (stated n Proposton 1), whch shows that under certan condtons, a relable classfer wll be learned even f the recall of the corrupted postve samples s low, as long as the precson s relatvely hgh. In Table 2 we ntroduce quanttatve results of the dfferent stages of our method, along wth the assocated executon tmes. Algorthm 1 Vdeo object segmentaton 1: get nput framesf 2: PCA(A 1 ) => V 1 egenvectors;a 1 (,:) = F (:) 3: R 1 = Ā 1 +(A 1 Ā 1 ) V 1 V T 1 - reconstructon 4: P 1 = d(a 1 (,:),R 1 (,:)) 5: P 1 = P 1 G σ1 6: P 1 => pxel-level appearance model=> S 1 7: PCA(A 2 ) => V 2 egenvectors;a 2 (,:) = S 1 (:) 8: R 2 = Ā 2 +(A 2 Ā 2 ) V 2 V T 2 - reconstructon 9: P 2 = R 2 G σ2 10: P 2 => pxel-level appearance model=> S 2 11: D - data matrx contanng patch-level descrptors 12: s patch labels extracted froms 2 13: selectk features fromd=> D s 14: w = (λi+d s T D s ) 1 D s T s 15: evaluate=> patch-level appearance model=> S 3 16: for each framedo 17: compute I x,i y and I t 18: buld moton matrxd m 19: w m = (D T m D m ) 1 D T m I t 20: compute moton modelm 21: M = M G σ 22: combne S 3 andm => S 4 23: end for 5087

4 2.1. Select hghly probable object regons We estmate the ntal foreground regons by Prncpal Component Analyss, an approach smlar to the recent method for soft foreground segmentaton, VdeoPCA [21]. Other approaches for soft foreground dscovery could have been appled here, such as [26, 6, 9], but we have found the drecton usng PCA to be both fast and relable and to ft perfectly wth the later stages of our method. The prncpal components wll represent a lnear subspace of the background, as the object s expected to be an outler, not obeyng the prncpal varaton observed n the vdeo, thus harder to reconstruct. At ths step, we project the frames on the resulted subspace and compute reconstructon error mages as dfferences between orgnal frames and ther PCA reconstructed counter parts. If prncpal components are u, [0...n u ] (we used n u = 3) and frame f projected on the subspace s f r f 0 + n u =1 ((f f 0) u )u, where f 0 s the average frame, then we compute the error mage f dff = f f r. Hgh value pxels n the error mage are more lkely to belong to foreground. If we further smooth these regons wth a large enough Gaussan and multply the resultng smoothed dfference wth another large centered Gaussan (whch favors objects n the center of the mage), we obtan soft foreground masks that have hgh precson (most pxels on these masks ndeed belong to true foreground), even though they often have low recall (only a small fracton of all object pxels are selected). As dscussed, hgh precson and low recall s all we need at ths stage (see Table 1) Intal soft-segmentaton Consderng the small fracton of the object regons obtaned at the prevous step, the ntal whole object soft segmentaton s computed by capturng foreground and background color dstrbutons, followed by an ndependent pxel-wse classfcaton. Let p(c f g) and p(c bg) be the true foreground (f g) and background (bg) probabltes for a gven color c. Usng Bayes formula wth equal prors, we compute the probablty of foreground for a gven pxel, p(c fg) p(c fg)+p(c bg). wth an assocated color c, as p(fg c) = The foreground color lkelhood s computed as p(c f g) = n(c,fg) n(c), where n(c,fg) s the number of consdered foreground pxels havng color c and n(c) s the total number of pxels havng color c. The background color lkelhood s computed n a smlar manner. ote that when computng the color lkelhoods, we take nto consderaton nformaton gathered from the whole move, obtanng a robust model. The ntal soft segmentaton produced here s not optmal but t s computed fast (20 fps) and of suffcent qualty to ensure the good performance of the subsequent stages. The frst two steps of the method follow the algorthm VdeoPCA frst proposed n [21]. In Sec. 2.2 we present and prove our man theoretcal result (Proposton 1), whch explans n large part why our approach s able to produce accurate object segmentaton n an unsupervsed way Learnng wth HPP features Fgure 3. Learnng wth HPP feature vectors. Essentally, Proposton 1 shows that we could learn a relable dscrmnatve classfer from a small set of corrupted postve samples, wth the rest beng consdered negatves, f the corrupted postve set contans mostly good features such that the rato of true postves n the corrupted postve set s greater than the overall rato of true postves. Ths assumpton can often be met n practce and effcently used for unsupervsed learnng. In Proposton 1 we show that a classfer traned on corrupted sets of postve and negatve samples, can learn the rght thng as f true postves and negatves were used for tranng, f the followng condton s met: the set of corrupted postves should contan postve samples n a proporton that s greater than the overall proporton of true postves n the whole tranng set. Ths proposton s the bass for both stages of our method, the one that classfes pxels ndependently based on ther colors and the second n whch we consder hgher order color statstcs among groups of pxels. Let us start wth the example n Fgure 3, where we have selected a set of samples S (nsde the box) as beng postve. The sets has hgh precson (most samples are ndeed postve), but low recall (most true postves are wrongly labeled). ext we show that the sets S and S could be used relably (as defned n Proposton 1, below) to tran a bnary classfer. Let p(e + ) and p(e ) be the true dstrbutons of postve and negatve elements, and p(x S) and p(x S) be the probabltes of observng a sample nsde and outsde the consdered postve sets and negatve set S, respectvely. Proposton 1 (learnng from hghly probable postve (HPP) features): Consderng the followng hypotheses H 1 : p(e + ) < q < p(e ), H 2 : p(e + S) > q > p(e S), where q (0,1), and H 3 : 5088

p(x E + ) and p(x E ) are ndependent of S, then, for any sample x we have: p(x S) > p(x S) <=> p(x E + ) > p(x E ). In other words, a classfer that classfes pxels based on ther lkelhoods w.r.t to S and S wll take the same decson as f t was traned on the true postves and negatves, and we refer to t as a relable classfer.

5 p(x E + ) and p(x E ) are ndependent of S, then, for any sample x we have: p(x S) > p(x S) <=> p(x E + ) > p(x E ). In other words, a classfer that classfes pxels based on ther lkelhoods w.r.t to S and S wll take the same decson as f t was traned on the true postves and negatves, and we refer to t as a relable classfer. (Eq 1), usng the hypothess and the sum rule of probabltes. Consderng (Eq 1), hypothess H 1, H 2, and the fact that p(s) > 0, we obtan that p(e S) > q (Eq 2). In a Proof: We express p(e ) as (p(e ) p(e S) p(s)) (1 p(s)) smlar fashon, p(e + S) < q (Eq 3). The prevously nferred relatons (Eq 2 and Eq 3) generate p(e S) > q > p(e + S) (Eq 4), whch along wth hypothess H 2 help as conclude that p(e + S) > p(e + S) (Eq 5). Also, from H 3, we nfer that p(x E +,S) = p(x E + ) and p(x E,S) = p(x E ) (Eq 6). Usng the sum rule and hypothess H 3, we obtan that p(x S) = p(e + S) (p(x E + ) p(x E ))+p(x E ) (Eq 7). In a smlar way, t results thatp(x S) = p(e + S) (p(x E + ) p(x E ))+ p(x E ) (Eq 8). p(x S) > p(x S) => p(x E + ) > p(x E ): usng the hypothess and prevously nferred results (Eq 5, 7 and 8) t results thatp(x E + ) > p(x E ). p(x E + ) > p(x E ) => p(x S) > p(x S): from the hypothess we can nfer that p(x E + ) p(x E ) > 0, and usng (Eq 5) we obtanp(x S) > p(x S) Object proposals refnement Durng ths stage, the soft segmentatons obtaned so far are mproved usng a projecton on ther PCA subspace. In contrast to 2.1, now we select the probable object regons as the PCA projected versons of the soft segmentatons computed n prevous steps. For the projecton we consder the frst 8 prncpal components, wth the purpose of reducng the amount of nose that mght be leftover from the prevous steps. Further, color lkelhoods are re-estmated to obtan the soft-segmentaton masks Consderng color co occurrences The foreground masks obtaned so far were computed by treatng each pxel ndependently, whch results n masks that are not always correct, as frst-order statstcs, such as colors of ndvdual pxels, cannot capture more global characterstcs about object texture and shape. At ths step we move to the next level of abstracton by consderng groups of colors present n local patches, whch are suffcently large to capture object texture and local shape. We defne a patch descrptor based on local color occurrences, as an ndcator vector d W over a gven patch wndow W, such that d W (c) = 1 f color c s present n wndow W and 0 otherwse (Fgure 4). Colors are ndexed accordng to ther values n HSV space, where channels H, S and V are dscretzed n ranges [1,15], [1,11] and [1,7], generatng a total of 1155 possble colors. The descrptor does not take n consderaton the exact spatal locaton of a gven color n the patch, nor ts frequency. It only accounts for the presence of c n the patch. Ths leads to nvarance to most rgd or non-rgd transformatons, whle preservng the local appearance characterstcs of the object. Then, we take a classfcaton approach and learn a classfer (usng regularzed least squares regresson, due to ts consderable speed and effcency) to separate between hghly probable postve (HPP) descrptors and the rest, collected from the whole vdeo accordng to the soft masks computed at the prevous step. The classfer s generally robust to changes n vewpont, scale, llumnaton, and other noses, whle remanng dscrmnatve (Fgure 2). Fgure 4. Intal patch descrptors encodng color occurrences (n number of consdered colors). Unsupervsed descrptor learnng: ot all 1155 colors are relevant for our classfcaton problem. Most object textures are composed of only a few mportant colors that dstngush them aganst the background scene. Effectvely reducng the number of colors n the descrptor and selectng only the relevant ones can mprove both speed and performance. We use the effcent selecton algorthm presented n [13]. The method proceeds as follows. Let n be the total number of colors and k < n the number of relevant colors we want to select. The dea s to dentfy the group of k colors wth the largest amount of covarance - they wll be the ones most lkely to select well the foreground versus the background (see [13] for detals). ow consder C the covarance matrx of the colors formng the rows n the data matrx D. The task s to solve the followng optmzaton problem: s.t. w = argmaxw T Cw w n w = 1,w [0, 1 (1) k ] =1 The non-zero elements of w correspond to the colors we need to select for creatng our descrptor used by the classfer (based on regularzed least squares regresson 5089

6 model), so we defne a bnary mask w s R n 1 over the colors (that s the descrptor vector) as follows: { 1 fw () > 0 w s () = (2) 0 otherwse The problem above s P-hard, but a good approxmaton can be effcently found by the method presented n [13], based on a convergent seres of nteger projectons on the space of vald solutons. The optmal number of selected colors s a relatvely small fracton of the total number, as expected. Besdes the slght ncrease n performance, the real gan s n the sgnfcant decrease n computaton tme (see Fgure 5). ext we defne D s R m (1+k) to be the data matrx, wth a tranng sample per row, after applyng the selecton mask to the descrptor; m s the number of tranng samples andk s the number of colors selected to form the descrptor (we add a constant column of 1 s for the bas term). Then, the weghts w R (1+k) 1 of the regularzed regresson model are learned very fast, n closed-form: w = (λi+d s T D s ) 1 D s T s (3) where I s the dentty matrx, λ s the regularzaton term and s s the vector of soft-segmentaton masks values (estmated at the prevous step) correspondng to the samples chosen for tranng of the descrptor. Then, the fnal appearance based soft-segmentaton masks are generated by evaluatng the regresson model for each pxel. Fgure 5. Features selecton - optmzaton and senstvty analyss Combnng appearance and moton The foreground and background have complementary propertes at many levels, not just that of appearance. Here we consder that the object of nterest must dstngush tself from the rest of the scene n terms of ts moton pattern. A foreground object that does not move n the mage, relatve to ts background, cannot be dscovered usng nformaton from the current vdeo alone. We take advantage of ths dea by the followng effcent approach. Let I t be the temporal dervatve of the mage as a functon of tme, estmated as dfference between subsequent frames I t+1 I t. Also let I x and I y be the partal dervatves n the mage w.r.t x and y. Consder D m to be the moton data matrx, wth one row per pxel p n the current frame correspondng to [I x,i y,xi x,xi y,yi x,yi y ] at locatons estmated as background by the foreground segmentaton estmated so far. Gven such a matrx at tme t we lnearly regress I t on D m. The soluton would be a least square estmate of an affne moton model for the background usng frst order Taylor expanson of the mage w.r.t tme: w m = (D T m D m ) 1 D T m I t. Here w m contans the sx parameters defnng the affne moton (ncludng translaton) n 2D. Then, we consder devatons from ths model as potental good canddates for the presence of the foreground object, whch s expected to move dfferently than the background scene. The dea s based on an approxmaton, of course, but t s very fast to compute and can be relably combned wth the appearance soft masks. Thus we evaluate the model n each locaton p and compute errors D m (p)w m I t (p). We normalze the error mage and map t to [0,1]. Ths produces a soft mask (usng moton only) of locatons that do not obey the moton model - they are usually correlated wth object locatons. Ths map s then smoothed wth a Gaussan (wth σ proportonal to the dstrbuton onxand y of the estmated object regon). At ths pont we have a soft object segmentaton computed from appearance alone, and one computed ndependently, based on moton cues. The two soft results are multpled to obtan the fnal segmentaton. Optonal: refnement of vdeo object segmentaton Optonally we can further refne the soft mask by applyng an off-the-shelf segmentaton algorthm, such as GrabCut [19] and feedng t our soft foreground segmentaton. ote: n our experments we used GrabCut only for evaluaton on SegTrack, where we were nterested n the fne detals of the objects shape. All other experments are performed wthout ths step. 3. Expermental analyss Our experments were performed on two datasets: outube-objects dataset and SegTrack v2 dataset. We frst ntroduce some qualtatve results of our method, on the consdered datasets (Fgure 6). ote that for the fnal evaluaton on the outube-objects dataset, we also extract object boundng boxes, that are computed usng the dstrbuton of the pxels wth hgh probablty of beng part of the foreground. Both poston and sze of the boxes are computed usng a mean shft approach. For the fnal evaluaton on the SegTrack dataset, we have refned the softsegmentaton masks, usng the GrabCut algorthm [19]. In Tabel 2 we present evaluaton results for dfferent stages of our algorthm, along wth the executon tme, per stage. The F-measure s ncreased wth each stage of our algorthm. 5090

Fgure 6. Qualtatve results on outube-objects dataset and SegTrack dataset. 3.1.

7 Fgure 6. Qualtatve results on outube-objects dataset and SegTrack dataset outube Objects dataset Dataset: The outube-objects dataset [18] contans a large number of vdeos flmed n the wld, collected from outube. It contans challengng, unconstraned sequences of ten object categores (aeroplane, brd, boat, car, cat, cow, dog, horse, motorbke, tran). The sequences are consdered to be challengng as they are completely unconstraned, dsplayng objects performng rapd movements, wth dffcult dynamc backgrounds, llumnaton changes, camera moton, scale and vewpont changes and even edtng effects, lke flyng logos or jonng of dfferent shots. The ground truth s provded for a small number of frames, and contans boundng boxes for the object nstances. Usually, a frame contans only one prmary object of the consdered class, but there are some frames contanng multple nstances of the same class of objects. Two versons of the dataset were released, the frst (outube-objects v1.0) contanng 1407 annotated objects from a total of frames, whle the second (outube-objects v2.2) contans 6975 annotated objects from frames. Metrc: For the evaluaton on the outube-objects dataset we have adopted the CorLoc metrc, computng the percentage of correctly localzed object boundng-boxes. We evaluate the correctness of a box usng the PASCALcrteron (ntersecton over unon 0.5). Results: We compare our method aganst [10, 25, 18, 21, 17]. We consdered ther results as orgnally reported n the correspondng papers. The comparson s presented n Table 3. From our knowledge, the other methods were evaluated on outube-objects v1.0, on the tranng samples (the only excepton would be [21], where they have consdered the full v1.0 dataset). Consderng ths, and the dfferences between the two versons, regardng the number of annotatons, we have reported our performances on both versons, n order to provde a far comparson and also to report the results on the latest verson, outube-objects v2.2 (not consdered for comparson). We report results of the evaluaton on v1.0 by only consderng the tranng samples, for a far comparson wth other methods. Our method, whch s unsupervsed, s compared aganst both supervsed and unsupervsed methods. In the table, we have marked state-of-the-art results for unsupervsed methods (bold), and overall state-of-the-art results (underlned). We also menton the executon tme for the consdered methods, n order to prove that our method s one order of magntude faster than others (see Sec. 3.3 for detals). The performances of our method are compettve, obtanng state-of-the-art results for 3 classes, aganst both supervsed and unsupervsed methods. Compared to the unsupervsed methods, we obtan state-of-the-art results for 7 classes. On average, our method performs better than all the others, and also n terms of executon tme (also see Sec. 3.3). The fact that, on average, our algorthm outperforms other methods proves that t generalzes better for dfferent classes of objects and dfferent types of vdeos. Our soluton performs poorly on the horse class, as many sequences contan multple horses, and our method s not able to correctly separate the nstances. Another class wth low performance s the cow class, where we deal wth same problems as n the case of horse class, and where objects are usually stll, beng hard to segment n our system SegTrack v2 dataset Dataset. The SegTrack dataset was orgnally ntroduced by [22], for evaluatng trackng algorthms. Further, t was adapted for the task of vdeo object segmentaton [16]. We work wth the second verson of the dataset (Seg- Track v2), whch contans 14 vdeos ( 1000 frames), wth pxel level ground truth annotatons for the object of nterest, n every frame. The dataset s dffcult as the ncluded objects can be easly confused wth the background, appear n dfferent szes and dsplay complex deformatons. There are 8 vdeos wth one prmary object and 6 wth multple objects, from 8 dfferent categores (brd, cheetah, human, 5091

8 Method Supervsed? [10] [25] [18] [21] [17] Ours v1.0 Ours v2.2 aeroplane brd boat car cat cow dog horse motorbke tran Avg tme sec/frame /A /A /A Table 3. The CorLoc scores of our method and 5 other state-ofthe-art methods, on the outube-objects dataset (note that result for v2.2 of the dataset are not consdered for comparson). worm, monkey, dog, frog, parachute). Metrc. For the evaluaton on the SegTrack we have adopted the average ntersecton over unon metrc. We specfy that for the purpose of ths evaluaton, we use Grab- Cut for refnement of the soft-segmentaton masks. Results. We compare our method aganst [11, 24, 23, 17, 16]. We consdered ther results as orgnally reported by [23]. The comparson s presented n Table 4. Agan, we compare our method aganst both supervsed and unsupervsed methods, and, n the table, we have marked state-ofthe-art results for unsupervsed methods (bold), and overall state-of-the-art results (underlned). The executon tmes are also ntroduced, to hghlght that our method outperforms other approaches n terms of speed (see Sec. 3.3). The performance of our method s compettve, whle beng an unsupervsed method. Also, we prove that our method s one order of magntude faster than the prevous state-of-the-art [17] (see Sec. 3.3). Method Supervsed? [11] [24] [23] [17] [16] Ours brd of paradse brdfall frog grl monkey parachute solder worm Avg tme sec/frame >120 >120 /A Table 4. The average IoU scores of our method and 5 other stateof-the-art methods, on the SegTrack v2 dataset. Our reported tme also ncludes the computatonal tme requred for GrabCut Computaton tme One of the man advantages of our method s the reduced computatonal tme. ote that all per pxel classfcatons can be effcently mplemented by lnear flterng routnes, as all our classfers are lnear. It takes only 0.35 sec/frame for generatng the soft segmentaton masks (ntal object cues: 0.05 sec/frame, object proposals refnement: 0.03 sec/frame, patch-based regresson model: 0.25 sec/frame, moton estmaton: 0.02 sec/frame (Table 2)). The method was mplemented n Matlab, wth no specal optmzatons. All tmng measurements were performed usng a computer wth an Intel core GHz CPU. The method of Papazoglou et al. [17] report a tme of 3.5 sec/frame for the ntal optcal flow computaton, on top of whch they run ther method, whch requres 0.5 sec/frame, leadng to a total tme of 4 sec/frame. The method ntroduced n [21] has a total of 6.9 sec/frame. For other methods, lke the one ntroduced n [24, 11], t takes up to 120 sec/frame only for generatng the ntal object proposals usng the method of [3]. We have no nformaton regardng computatonal tme of other consdered methods, but due to ther complexty we expect them to be orders of magntude slower than ours. 4. Conclusons We have presented an effcent fully unsupervsed method for object dscovery n vdeo that s both fast and accurate. It acheves state of the art results on a challengng benchmark for boundng box object dscovery and very compettve performance on a vdeo object segmentaton dataset. At the same tme, our method s fast, beng at least an order of magntude faster than competton. We acheve an excellent combnaton of speed and performance by explotng the contrastng propertes between objects and ther scenes, n terms of appearance and moton, whch makes t possble to select postve feature samples wth a very hgh precson. We show, theoretcally and practcally, that hgh precson s suffcent for relable unsupervsed learnng (snce postves are generally less frequent than negatves), whch we perform both at the level of sngle pxels and at the hgher level of groups of pxels, whch capture hgher order statstcs about objects appearance, texture and shape. The top speed and accuracy of our method, combned wth theoretcal guarantees that hold n practce under mld condtons, make our approach unque and valuable n the quest for solvng the unsupervsed learnng problem n vdeo. Acknowledgements: The authors thank Otla Stretcu for helpful feedback. Ths work was supported by UEFISCDI, under project P-III-P4-ID-ERC

9 References [1] B. Alexe, T. Deselaers, and V. Ferrar. Measurng the objectness of mage wndows. IEEE transactons on pattern analyss and machne ntellgence, 34(11): , [2] J. Carrera and C. Smnchsescu. Cpmc: Automatc object segmentaton usng constraned parametrc mn-cuts. IEEE Transactons on Pattern Analyss and Machne Intellgence, 34(7): , [3] I. Endres and D. Hoem. Category ndependent object proposals. Computer Vson ECCV 2010, pages , [4] P. F. Felzenszwalb, R. B. Grshck, D. McAllester, and D. Ramanan. Object detecton wth dscrmnatvely traned partbased models. IEEE transactons on pattern analyss and machne ntellgence, 32(9): , [5] B. Fulkerson, A. Vedald, and S. Soatto. Class segmentaton and object localzaton wth superpxel neghborhoods. In Computer Vson, 2009 IEEE 12th Internatonal Conference on, pages IEEE, [6] X. Hou and L. Zhang. Salency detecton: A spectral resdual approach. In Computer Vson and Pattern Recognton, CVPR 07. IEEE Conference on, pages 1 8. IEEE, [7] A. K. Jan, M.. Murty, and P. J. Flynn. Data clusterng: a revew. ACM computng surveys (CSUR), 31(3): , [8] S. D. Jan and K. Grauman. Supervoxel-consstent foreground propagaton n vdeo. In European Conference on Computer Vson, pages Sprnger, [9] H. Jang, J. Wang, Z. uan,. Wu,. Zheng, and S. L. Salent object detecton: A dscrmnatve regonal feature ntegraton approach. In Proceedngs of the IEEE conference on computer vson and pattern recognton, pages , [10]. Jun Koh, W.-D. Jang, and C.-S. Km. Pod: Dscoverng prmary objects n vdeos based on evolutonary refnement of object recurrence, background, and prmary object models. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , [11]. J. Lee, J. Km, and K. Grauman. Key-segments for vdeo object segmentaton. In Computer Vson (ICCV), 2011 IEEE Internatonal Conference on, pages IEEE, [12] M. Leordeanu, R. Collns, and M. Hebert. Unsupervsed learnng of object features from vdeo sequences. In IEEE COMPUTER SOCIET COFERECE O COMPUTER VISIO AD PATTER RECOGITIO, volume 1, page IEEE Computer Socety; 1999, [13] M. Leordeanu, A. Radu, S. Baluja, and R. Sukthankar. Labelng the features not the samples: Effcent vdeo classfcaton wth mnmal supervson. arxv preprnt arxv: , [14] A. Levnshten, A. Stere, K.. Kutulakos, D. J. Fleet, S. J. Dcknson, and K. Sddq. Turbopxels: Fast superpxels usng geometrc flows. IEEE transactons on pattern analyss and machne ntellgence, 31(12): , [15] F. L, J. Carrera, G. Lebanon, and C. Smnchsescu. Composte statstcal nference for semantc segmentaton. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , [16] F. L, T. Km, A. Humayun, D. Tsa, and J. M. Rehg. Vdeo segmentaton by trackng many fgure-ground segments. In Proceedngs of the IEEE Internatonal Conference on Computer Vson, pages , [17] A. Papazoglou and V. Ferrar. Fast object segmentaton n unconstraned vdeo. In Proceedngs of the IEEE Internatonal Conference on Computer Vson, pages , [18] A. Prest, C. Lestner, J. Cvera, C. Schmd, and V. Ferrar. Learnng object class detectors from weakly annotated vdeo. In Computer Vson and Pattern Recognton (CVPR), 2012 IEEE Conference on, pages IEEE, [19] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactve foreground extracton usng terated graph cuts. In ACM transactons on graphcs (TOG), volume 23, pages ACM, [20] J. Svc, B. C. Russell, A. A. Efros, A. Zsserman, and W. T. Freeman. Dscoverng objects and ther locaton n mages. In Computer Vson, ICCV Tenth IEEE Internatonal Conference on, volume 1, pages IEEE, [21] O. Stretcu and M. Leordeanu. Multple frames matchng for object dscovery n vdeo. In BMVC, pages 186 1, [22] D. Tsa, M. Flagg, A. akazawa, and J. M. Rehg. Moton coherent trackng usng mult-label mrf optmzaton. Internatonal journal of computer vson, 100(2): , [23] L. Wang, G. Hua, R. Sukthankar, J. Xue, Z. u, and. Zheng. Vdeo object dscovery and co-segmentaton wth extremely weak supervson. IEEE Transactons on Pattern Analyss and Machne Intellgence, [24] D. Zhang, O. Javed, and M. Shah. Vdeo object segmentaton through spatally accurate and temporally dense extracton of prmary object regons. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , [25]. Zhang, X. Chen, J. L, C. Wang, and C. Xa. Semantc object segmentaton va detecton n weakly labeled vdeo. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , [26] C. L. Ztnck and P. Dollár. Edge boxes: Locatng object proposals from edges. In European Conference on Computer Vson, pages Sprnger,

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,