Detection of salient objects with focused attention based on spatial and temporal coherence

Size: px

Start display at page:

Download "Detection of salient objects with focused attention based on spatial and temporal coherence"

Jonas Marsh
5 years ago
Views:

1 ricle Informaion Processing Technology pril 2011 Vol.56 No.10: doi: /s SPECIL TOPICS: Deecion of salien objecs wih focused aenion based on spaial and emporal coherence WU Yang 1, ZHENG NanNing 1, YUN ZeJian 1, JING HuaiZu 1 & LIU Tie 2 1 Insiue of rificial Inelligence and Roboics, Xi an Jiaoong Universiy, Xi an , China; 2 IBM Research-China, Beijing , China Received Ocober 8, 2010; acceped November 8, 2010 The undersanding and analysis of video conen are fundamenally imporan for numerous applicaions, including video summarizaion, rerieval, navigaion, and ediing. n imporan par of his process is o deec salien (which usually means imporan and ineresing) objecs in video segmens. Unlike exising approaches, we propose a mehod ha combines he saliency measuremen wih spaial and emporal coherence. The inegraion of spaial and emporal coherence is inspired by he focused aenion in human vision. In he proposed mehod, he spaial coherence of low-level visual grouping cues (e.g. appearance and moion) helps per-frame objec-background separaion, while he emporal coherence of he objec properies (e.g. shape and appearance) ensures consisen objec localizaion over ime, and hus he mehod is robus o unexpeced environmen changes and camera vibraions. Having developed an efficien opimizaion sraegy based on coarse-o-fine muli-scale dynamic programming, we evaluae our mehod using a challenging daase ha is freely available ogeher wih his paper. We show he effeciveness and complemenariness of he wo ypes of coherence, and demonsrae ha hey can significanly improve he performance of salien objec deecion in videos. visual aenion, focused aenion, salien objec deecion, spaial and emporal coherence Ciaion: Wu Y, Zheng N N, Yuan Z J, e al. Deecion of salien objecs wih focused aenion based on spaial and emporal coherence. Chinese Sci Bull, 2011, 56: , doi: /s The rapid developmen of neworks and sorage devices has grealy encouraged he capure, spread and sharing of large quaniies of video daa. However, because hese exensive videos ofen conain unimporan or unineresing conen, searching for a desired video segmen in a large daabase becomes a very difficul and ime-consuming ask. To solve his problem, a plausible approach is o model visual aenion of human vision, as i can filer ou unwaned informaion in a scene [1]. In he lieraure, visual aenion has been widely modeled as a visual saliency esimaion problem [2 5], which has been exensively invesigaed using boh saic images and video segmens. Insead of disregarding he conen when dealing wih he generic visual saliency esimaion Corresponding auhors ( wuyang0321@gmail.com; nnzheng@mail.xju.edu.cn) problem in video [5], we focus on deecing a salien objec in video segmens. Video segmens wih a paricular focus are found in many kinds of videos, and research on hese has very imporan applicaions such as video summarizaion, rerieval, navigaion and ediing. s argued by cogniive scienis Sephen Palmer in his book [6], here are wo ypes of visual aenion: disribued aenion and focused aenion. The former relaes o visual signal processing ha occurs when subjecs are prepared for he arge o appear in any locaion, while he laer occurs when hey have seleced a single percepual objec. For he problem of deecing a salien objec in a video clip, focused aenion of he human vision sysem should play a criical role. Therefore, he percepive properies of focused aenion become our inspiraion for solving he problem. Mos of he relaed works focus on he inegraion of The uhor(s) This aricle is published wih open access a Springerlink.com csb.scichina.com

2 1056 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No.10 various boom-up cues for esimaing saliency [7 9], suppored by feaure inegraion heory [10]. Usually such cues are represened by conras-based feaures, including boh low-level measures of conras (e.g. pixel-wise color, gradien conras and regional appearance conras) and highlevel measures (e.g. objecness [11]). When human eyefixaion records or hand-labeled bounding boxes of salien regions are available for model learning [3 5], hese op-down priors can be used o rain a saliency model, whose properies depend largely on he raining daa. n exreme case is o rain class-specific saliency esimaors ha can be used o deec ineresing objec classes [12]. There are also some sudies on he combinaion of boom-up and op-down models for beer deecion performance [5,13]. Irrespecive of which cues or priors have been used, mos of hese effors are aimed a generaing a saliency map, wihou explicily modeling he coherence of he resuls, eiher spaially or emporally. lhough spaial and emporal informaion can somehow be represened by regional conras, we argue ha coherence is no equivalen o regional conras, because i emphasizes he absolue consisency of he objec iself, and no he relaive differences agains he changing background. Spaial coherence has been used widely in image and moion segmenaion [14,15], while emporal coherence plays a criical role in objec racking [16]. Though boh of hese have also been used for moving objec deecion [17,18], hey are usually modeled separaely. s far as we are aware, here is no prior work ha uses boh of hese facors o deec salien objecs in videos. However, when humans focus heir aenion on an objec, i means ha he objec remains somewhere consisenly in he spaial and emporal space. Therefore, spaial and emporal coherence are indispensable properies for deecing he objec. This paper proposes a novel mehod ha combines spaial and emporal coherence sysemaically wih he saliency measuremen for salien objec deecion. We presen encouraging resuls ha demonsrae he imporance and complemenariness of hese wo ypes of coherence. 1 Problem formulaion We formulae salien objec deecion as a binary labeling problem, where he salien objec is represened as a binary mask { = ax } for each frame I, {1,,T} in a video segmen I=I 1:T. In he mask, a x {0,1} is a binary label for each pixel x in frame I and T is he number of frames in he video segmen. More concreely, a x = 1 indicaes ha pixel x belongs o he salien objec, while a x = 0 means he opposie. Following he widely used condiional random field model [19], he probabiliy of he predicion = 1:T can be modeled as a condiional disribuion: 1 P( I ) = P( 1: T I1: T) = exp ( E( 1: T I1: T), (1) Z where E( ) is an energy funcion and Z is he pariion funcion. Therefore, he opimal soluion for he labeling 1:T based on he maximum a poseriori esimaion can be obained by minimizing E( 1:T I 1:T ). If he Markov propery holds for he daa, he opimizaion can be wrien as: ( ) = arg min E( I ) = arg min E I,. (2) 1: 1:, 1 1 T T 1: T 1: T The energy funcion of he binary mask of each frame E( I, 1, 1 ) can furher be decomposed ino he following hree erms: E( I, 1, 1) = S(, I, M) + α CS(, I, M) Saliency T 1 1 Temporal coherence Spaial coherence + β C (,, I, I ), ( 3) where M is he opical flow field from frame I 1 o frame I, while α and β are he balancing weighs for he spaial and emporal coherence, respecively. Noe ha eq. (3) is he general form of he global energy funcion based on boh he saic image and dynamic moion informaion, where he opical flow field M and he saliency map S(,I,M ) can be obained using any mehod. In fac, hese compuaions are currenly wo acive research opics [5,20] wih new approaches consanly being proposed despie he exisence of many soluions. In our former work [4], we proposed he hree differen ypes of feaures defined below, o compue he saliency maps of saic images. (i) Local muli-scale conras. Given an image I, a Gaussian image pyramid of L levels is compued, and hen he muli-scale conras feaure is defined as 1) L l l ( x ) ( x) ( x ) f, I I I, c l= 1 x N ( x) where I l is he lh level image in he pyramid and N ( x) is he neighborhood of pixel x. Typically, L=6 and N ( x) is a 9 9 window. (ii) Regional cener-surround hisogram. Suppose he salien objec is bounded by a recangle R, we consruc a surrounding conour R S wih he same area as R. Given a pixel x, we find he mos disinc recangle R (x) cenered a x, from he candidaes wih differen sizes and aspec raios: ( ) ( x) ( ( ) S ( )) 2 R = χ R R R 2 (4) x arg max x, x. (5) 1) The proporion sign is necessary because he definiion is no normalized. fer normalizaion, he righ hand side is he desired feaure. The oher wo feaures are designed similarly.

3 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No The size of R varies wihin [0.1,0.7] min (w,h), where w,h are he image widh and heigh, respecively. The aspec raio is one of he five values {0.5, 0.75, 1.0, 1.5, 2.0}. The χ 2 disance measures he dissimilariy of he wo RGB color hisograms of he cener and surrounding recangles. The cener-surround hisogram feaure is defined as a weighed sum of he χ 2 disance a neighboring pixels: ( S ) 2 ( x ) χ ( x ) ( x ) fh, I wxx R, R, (6) { xx R ( x )} 2 2 where w exp( 0.5σ ) = xx x x x is a Gaussian falloff weigh wih variance σ 2 x se o one hird of he size of R (x ), making he weigh adapive o he daa. (iii) Global color spaial-disribuion. This feaure measures he spaial variance of he color. Suppose he colors in he image can be represened by a Gaussian mixure model: wn c ( Ix μc, Σc) p( c Ix ) =, C (7) wn(, ) c 1 c I μ = x c Σc where C is he number of color componens. Then, afer obaining he posiion variance V(c) of each color componen c, he color spaial-disribuion feaure can be defined as f x, I p c I 1 V c. (8) s ( ) ( x ) ( ( )) c Furher deails abou he definiion and compuaion of V(c) can be found in [4]. These feaures have been exensively evaluaed, wih promising resuls, using a large-scale image daase. The inegraed saliency map based hereon is F x = λ f x, I + λ f x, I + λ f x, I, (9) ( ) ( ) ( ) ( ) S c c h h s s where he learned opimal parameers are λ c =0.24, λ h =0.54, and λ s =0.22 [4]. Saic saliency may be confused by a cluered background especially in videos, so we have also proposed compuing he same feaures on he opical flow fields o represen he dynamic moion saliency [21]. However, because of he insabiliy of he opical flow exracion algorihm, he opical flow field was smoohed before feaure exracion. If he opical flow is compued by a saisfacory approach such as he one proposed in [22], he feaures can be exraced direcly from M wihou exra smoohing. In he experimens presened in Secion 4, opical flow is compued by he algorihm inroduced in [22], and he same parameers (λ c, λ h and λ s ) are adoped o combine he hree ypes of feaures ino he moion saliency map FM ( x ). Having obained boh hese saliency maps, we can combine hem ino an overall saliency map F using he adapive sraegy proposed in [21]. Then, he saliency erm in eq. (3) can be defined as S(, I, M) = F( x, I, M) + (1 F( x, I, M)). (10) x: ax= 0 x: ax= 1 Figure 1 illusraes he framework of our mehod based on eq. (3). Besides he saliency maps, spaial and emporal coherence are wo oher imporan componens for global opimizaion, as explained in deail in he following secion. 2 Spaial and emporal coherence 2.1 Spaial coherence The spaial relaionship beween neighboring pixels has widely been used o represen he inrinsic propery of random fields. The simples relaionship is based on he Figure 1 Overall framework of our mehod.

4 1058 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No.10 assumpion ha proximal pixels are more likely o have he same caegory labels. In condiional random field-based applicaions, however, informaion from he daa is usually aken ino accoun as well, resuling in a more general spaial consrain because neighboring pixels belonging o he same caegory are more likely o share he same feaures/properies. Such a spaial coherence consrain is also widely referred o as he smoohing erm or smoohing model in he lieraure because i ends o smoohen he feaures/properies wihin he same caegory. Consequenly, we formulae he spaial coherence as S δ x x x,x C (, I, M ) = ( a a ) D( x, I, M ), ( 11) where x and x are neighboring pixels and D is he smoohness measuremen of he image and moion fields a pixel x. The smooher he image and moion fields are a x, he larger he value of D(x,I,M ) is. Therefore, when C S is minimized, he boundary pixels separaing he foreground objec and he background regions end o be locaed a or close o he edge poins of he image and moion field. simple ye effecive choice for D is he disance ransform [23], which is a sof smoohness measuremen. If chamfer disance is used as he disance meric, i can be compued efficienly. The formulaion above is a generic form ha conains boh he saic appearance and dynamic moion informaion. Here, we discuss hese wo ypical cues, ogeher wih he combinaion hereof. (i) Saic cue: appearance. lhough in general he saic cue can be any image feaures, we focus on he mos widely used one: raw color informaion. If only his cue is considered, eq. (11) becomes S δ x x ch x,x C (, I ) = ( a a ) D ( x, EM( I )), (12) where EM(I ) is he edge map of I and D ch is he chamfer disance. The saic cue uses he image edges o guide he labeling, which should be beneficial for he resuls if he background is relaively clean. Unforunaely, here are many cases in which he background is cluered, especially oudoor scenes as shown in Figure 2, where row (d) illusraes he Figure 2 Effeciveness of spaial coherence. (a) Sample video frames wih ground ruh labels; (b) saliency maps; (c) resuls based on saliency maps only; (d), (e), and (f) disance ransform maps of he spaial coherence based on color cue, moion cue, and boh color and moion cues, respecively, where he darkes areas (valleys) are poenial objec boundaries; (g) deecion resuls using boh saliency and spaial coherence.

5 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No disance ransform resuls using he saic cue only. The valleys (darkes areas) are spread ou across he whole images because of he cluered background. In hese cases, opimizaion using appearance-based spaial coherence can easily ge confused because i canno disinguish he real objec boundaries from he background cluer. (ii) Dynamic cue: moion. If he foreground objec moves differenly o he cluered background, he opical flow field M is useful for deecing he objec. Similar o he appearance-based spaial coherence, we can define spaial coherence of he moion field as S δ x x ch x,x C (, M ) = ( a a ) D ( x, EM( M )), (13) where EM(M ) is he edge map of M and D ch is once again he chamfer disance. s far as we are aware, his is he firs ime ha he opical flow field has been used as he spaial coherence consrain o deec salien objecs in videos. The firs 4 frames of row (e) in Figure 2 demonsrae he effeciveness of spaial coherence when he cluered background moves uniformly ye differenly o he salien objec. (iii) Inegraing saic and dynamic cues. When he salien objec moves in a dynamic scene wih an uneven background moion field, he power of he dynamic cue is weakened. In hese cases, he saic cue may be helpful if he moving pars of he background have weaker saic boundaries han he salien objecs. Therefore, combining he wo cues migh be a beer choice han using only one of hem. Inspired by he way in which he spaial and emporal smoohing erms are combined for moving objec deecion in reference [24], we combined he wo cues for spaial coherence by defining he smoohness measuremen D(x,I,M ) in eq. (11) as ( x,, ) = 1 1 ηi ch(, ( ) ) x ηm Dch( x EM ( M) ) D I M D EM I 1,, (14) where η I [0,1] and η M [0,1] are wo parameers conrolling he influence of he saic and dynamic cues, respecively (where 0 means ineffecive and 1 means fully involved). Such a combinaion can reduce he risk of being confused by he cluer or movemen in he background. The las wo rows in Figure 2 show he combined disance ransform map (i.e. D( x, I, M) a each pixel) and he deecion resuls using he wo-cue based spaial coherence, for which η I and η M are boh se o 1. I can be seen ha he resuls are much igher han hose using only saliency maps. 2.2 Temporal coherence Temporal coherence is designed o represen he emporal similariy beween he salien objecs from wo consecuive frames. Such emporal similariy can be measured by boh shape and appearance as C (, I ) = β C (, ) Shape T 1, 1, S T 1 ppearance + CT ( 1,, I 1, ), Shape 1 CT ( 1, ) = ax ax, x ppearance 2 T 1, 1, = χ 1 1 ( ( ) ( )) C (, I ) H, I, H, I, (15) where H(,I) is he appearance hisogram of area in frame I (in our experimens, only color has been used as he appearance represenaion). In pracice, we find ha seing he 0.5 w h balancing weigh o βs = 100 works well wih w+ h wh, denoing he image widh and heigh, respecively. Figure 3 shows an example of how emporal coherence works. The five consecuive frames are prone o a shor and irregular vibraion of he camera, causing he saliency maps o be heavily influenced by he disracing moion. However, he emporal coherence can smooh ou he randomness wihin he saliency maps and generae a sable sequence of objec bounding boxes. 3 Opimizaion sraegy Given he problem in eqs. (2) and (3), he opimizaion involves finding he bes labeling for each pixel x in each frame I of he video segmen I=I 1:T. This is compuaionally infeasible as he search space is 2 wht, where w,h denoe he image widh and heigh, respecively. In objec deecion, however, pixel-wise segmenaion resuls are usually unnecessary and he bounding boxes suffice. If he binary mask is resriced o a recangle, i can be represened by four parameers: cener x and size s, boh of which are wo-dimensional. The sae space hus decreases dramaically o no more han (w 2 h 2 ) T. Meanwhile, he emporal coherence based on shape can be simplified as Shape CT ( 1, ) = x x + γ s s, (16) where γ is a weigh balancing he locaion and scale differences according o [25]. The problem defined by eqs. (2) and (3) can be decomposed ino a se of subproblems, he soluions of which enable he original opimizaion problem o be solved quickly. Moreover, hese subproblems overlap, i.e., heir local soluions can be used muliple imes in he global opimizaion. Therefore, he opimizaion can be solved efficienly by dynamic programming. Le B ( ) be he summarized energy of frames 1 o, hen hese subproblems can be wrien recursively as

6 1060 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No.10 Figure 3 Effeciveness of emporal coherence. (a) Sample video frames wih ground ruh labels; (b) saliency maps; (c) resuls using saliency maps only; (d) resuls using boh saliency and emporal coherence. ( ) = ( T) = (, ) + α S (, ), ( ) = min { 1( 1) + E( I, 1, 1) } ( ) B E I S I C I : B B 1 N DP (,, ) α S (,, ) ( B 1 ( 1) βct ( 1,, I 1, ) ) N ( ) = S I M + C I M + min +. (17) 1 DP Noe ha we narrow down he search space of 1 by enforcing i o be in he neighborhood of, which coincides wih he emporal coherence requiremen. fer compuing all possible B ( ) for each frame I, he final soluion for he overall problem can be obained by ( ) = arg min B. (18) T T T T Then we race back he sequence o obain he opimal soluion for each of he res frames, {1,,T 1}: ( ( ) ( )) = arg min B,, + βct + I ( 1) 1, + 1. (19) N DP + Suppose he number of possible in I is N (no more han w 2 h 2 ), and NT is no very large, he index of can be cached o a able of every possible +1, wih size NT. Then he race back operaion can be done in O(T) ime by merely consuling he lookup able T( + 1) =. The compuaional expense of he algorihm is dominaed by he compuaion of B ( ), {1,,T}. Suppose here are n candidaes in he neighborhood space N DP ( ), hen he ime cos is O(nNT). Because of he variaions in objec size, i is no easy o se a proper neighborhood size n. If n is oo small, i may fail o represen he large movemens of big objecs, whereas conversely, an overlarge n loses he efficiency of he neighborhood search. To solve his problem and furher speed up he compuaion, a coarse-o-fine sraegy can be used [21]. ll he saliency maps are down-sampled o generae a pyramid (for example 6 layers) and hen, afer searching he enire sae space a he coarses scale, he opimizaion resuls serve as he iniial soluion for he finer scale. By doing his, he neighbor space can be limied o a small range, such as a circle wih a radius of 2 pixels. 4 Experimenal resuls To demonsrae quaniaively he advanages of using spaial and emporal coherence, we carried ou our experimens on 32 video segmens wih a oal of 4820 frames colleced from he inerne 2). Each video segmen conains a single salien objec, which ranges from humans performing various aciviies and animals in he wild, o vehicles boh on he ground and in he air. ll he frames are annoaed wih objec bounding boxes, and he deecion performance is evaluaed in erms of mean precision (P), recall (R), F-measure (F 0.5 ), and boundary displacemen errors (BDE) [3]. Four differen seings on eq. (3) have been adoped for performance comparison: saliency only, namely seing α = β = 0; involving spaial coherence (Saliency + SC), i.e. β = 0; involving emporal coherence (Saliency + TC), i.e. 2) This daase is freely available from he auhors.

7 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No α=0; and involving boh spaial and emporal coherence (Saliency + STC). For spaial coherence, we chose α=0.125 because oher values such as 0.025, 0.05, 0.25, 0.75 gave poorer resuls, while η I and η M were se o 1 because smaller values resuled in lower performance gain, and we found ha using boh cues was beer han using eiher of hem. For he emporal coherence, γ=2 and β=0.01 were found o be effecive. The mean performances of he four differen seings, illusraed in Figure 4, provide several valuable resuls. (i) Spaial coherence increases he precision significanly, bu a he same ime diminishes he recall somewha. Overall, i improves he performance in erms of boh he F-measure and BDE. Moreover, adding SC o Saliency yielded a 10.9% increase in precision and a 3.8% decrease in recall, whereas adding SC o Saliency + TC improved precision by 14.3% and reduced recall by 7.1%. The performance gains wih respec o he F-measure (BDE) for hese wo cases are 6.2% (10.3%) and 7.0% (10.1%), respecively. (ii) Temporal coherence increases boh he precision and he recall and herefore, significanly improves he F-mea- sure and BDE. Moreover, adding TC o Saliency improved precision by 9.2% and recall by 3.8%, whereas adding TC o Saliency + SC improved precision by 12.6% and recall by 0.3%. The performance improvemens in erms of F-measure (BDE) for hese wo cases are 7.6% (23.6%) and 8.3% (23.4%), respecively. (iii) Involving boh SC and TC is beer han using only one of hem, and can grealy improve he deecion performance. Compared wih Saliency, Saliency + STC improved he F-measure by 15.1% and reduced BDE by 31.3%. Figure 5 shows wo represenaive objec deecion resuls using hese four differen sraegies. The benefis of involving spaial and emporal coherence can be clearly seen. The experimenal resuls show ha spaial coherence and emporal coherence are in fac complemenary. The exisence of eiher of hem has no significan impac on he properies of he oher. lhough in our model he spaial coherence includes he beween-frame moion, i only requires he moion field wihin he objec o be as consisen as possible, whereas emporal coherence consrains he magniude of he moion vecors. Figure 4 Mean performance comparison for evaluaing he effeciveness of spaial and emporal coherence. The four differen seings on he horizonal axis are: 1, no coherence (saliency only); 2, involving spaial coherence; 3, involving emporal coherence; 4, involving boh spaial and emporal coherence. Figure 5 Qualiaive resuls using sampled video frames. (a) Inpu video segmens wih ground ruh labels; (b), (c), (d) and (e) deecion resuls using he four differen sraegies: Saliency, Saliency + SC, Saliency + TC and Saliency + STC, respecively.

8 1062 Wu Y, e al. Chinese Sci Bull pril (2011) Vol.56 No.10 5 Conclusion and discussion Inspired by he focused aenion inheren in human vision, we proposed a novel mehod for salien objec deecion ha incorporaes boh spaial and emporal coherence wih radiional saliency measuremens. Experimenal resuls demonsrae he effeciveness and complemenariness of he spaial and emporal coherence of saliency feaures. The mehod is simple and flexible in he sense ha any saliency compuaion models can be direcly embedded herein, and new cues on modeling he wihin-frame spaial coherence and beween-frame emporal coherence can easily be added. Moreover, dynamic programming and a coarse-o-fine search sraegy have been used o solve he problem efficienly. The limiaion of his work is ha i assumes ha he salien objec appears hroughou he whole video segmen, which implicily requires he segmen o be exraced in advance from a longer video. possible fuure work is o deec when he focused aenion sar and end, so ha he video segmen can be localized auomaically. This work was suppored by he Naional Naural Science Foundaion of China ( and ) and he Naional Basic Research Program of China (2007CB311005). 1 Ii L, Rees G, Tsosos J. Neurobiology of enion. San Diego: Elsevier, Treue S. Visual aenion: The where, wha, how and why of saliency. Curr Opin Neurobiol, 2003, 13: Liu T, Sun J, Zheng N N, e al. Learning o deec a salien objec. In: Proceedings of IEEE Compuer Sociey Conference on Compuer and Vision Paern Recogniion, Minneapolis, US, Liu T, Yuan Z J, Sun J, e al. Learning o deec a salien objec. IEEE Trans Paern nal Mach Inell, 02 Mar. 2010, doi: /TPMI Li J, Tian Y H, Huang T J, e al. Probabilisic muli-ask learning for visual saliency esimaion in video. In J Compu Vis, 2010, 90: Palmer S. Vision Science: Phoons o Phenomenology. Cambridge, M: The MIT Press, Ii L, Koch C, Niebur E. model of saliency-based visual aenion for rapid scene analysis. IEEE Trans Paern nal Mach Inell, 1998, 20: Ii L, Baldi P. principled approach o deecing surprising evens in video. In: Proceedings of he IEEE Compuer Sociey Conference on Compuer and Vision Paern Recogniion, San Diego, C, US, Walher D, Koch C. Modeling aenion o salien prooobjecs. Neural New, 2006, 19: Treisman M, Gelade G. feaure-inegraion heory of aenion. Cogn Psychol, 1980, 12: lexe B, Deselaers T, Ferrari V. Wha is an objec? In: Proceedings of IEEE Compuer Sociey Conference on Compuer and Vision Paern Recogniion, San Francisco, US, Moosmann F, Larlus D, Jurie F. Learning saliency maps for objec caegorizaion. In: Proceedings of ECCV Inernaional Workshop on he Represenaion and Use of Prior Knowledge in Vision, Graz, usria, Peers R J, Ii L. Beyond boom-up: incorporaing Task dependen influences ino a compuaional model of spaial aenion. In: Proceedings of IEEE Compuer Sociey Conference on Compuer and Vision Paern Recogniion, Minneapolis, US, Cao L L, Li F F. Spaially coheren laen opic model for concurren segmenaion and classificaion of objecs and scenes. In: Proceedings of IEEE Inernaional Conference on Compuer Vision, Rio de Janeiro, Brazil, Weiss Y, delson E H. unified mixure framework for moion segmenaion: incorporaing spaial coherence and esimaing he number of models. In: Proceedings of IEEE Compuer Sociey Conference on Compuer and Vision Paern Recogniion, San Francisco, US, Moscheni F, Dufaux F, Kun M. Objec racking based on emporal and spaial informaion. In: Proceedings of IEEE Conference on cousics, Speech, and Signal Processing, lana, US, Kim M, Choi J G, Kim D, e al. VOP generaion ool: uomaic segmenaion of moving objecs in image sequences based on spaio-emporal informaion. IEEE Trans Circuis Sys Video Technol, 1999, 9: Tsaig Y, verbuch. uomaic segmenaion of moving objecs in video sequences: region labeling approach. IEEE Trans Circuis Sys Video Technol, 2002, 12: Laffery J, McCallum, Pereira F. Condiional random fields: Probabilisic models for segmening and labeling sequence daa. In: Brodley C E, Danyluk P, eds. Proceedings of Inernaional Conference on Machine Learning, Williams College, Williamsown, M, US, 2001, Baker S, Roh S, Scharsein D, e al. daabase and evaluaion mehodology for opical flow. In: Proceedings of IEEE Inernaional Conference on Compuer Vision, Rio de Janeiro, Brazil, Liu T, Zheng N N, Ding W, e al. Video aenion: Learning o deec a salien objec sequence. In: Proceedings of Inernaional Conference on Paern Recogniion, Tampa, Florida, US, Brox T, Bruhn, Papenberg N, e al. High accuracy opical flow esimaion based on a heory for warping. In: Pajdla T, Maas J, eds. Proceedings of European Conference on Compuer Vision, Prague, Czech Republic, Rosenfeld, Pfalz J. Sequenial operaions in digial picure processing. J CM, 1966, 13: Yang H, Tian J, Chu Y, e al. Spaioemporal smooh models for moving objec deecion. IEEE Signal Process Le, 2008, 15: Sun J, Zhang W, Tang X, e al. Bidirecional racking using rajecory segmen analysis. In: Proceedings of IEEE Inernaional Conference on Compuer Vision, Beijing, China, Open ccess This aricle is disribued under he erms of he Creaive Commons ribuion License which permis any use, disribuion, and reproducion in any medium, provided he original auhor(s) and source are credied.

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding Moivaion Image segmenaion Which pixels belong o he same objec in an image/video sequence? (spaial segmenaion) Which frames belong o he same video sho? (emporal segmenaion) Which frames belong o he same