Long-Term Moving Object Segmentation and Tracking Using Spatio-Temporal Consistency

Long-Term Movng Obect Segmentaton Trackng Usng Spato-Temporal Consstency D Zhong Shh-Fu Chang {dzhong, sfchang}@ee.columba.edu Department of Electrcal Engneerng, Columba Unversty, NY, USA Abstract The success of obect-based meda representaton descrpton (e.g., MPEG-4 7) depends largely on effectve obect segmentaton tools. In ths paper, we exp our prevous work on automatc vdeo regon trackng develop a robust - movng obects detecton system. In our system, we frst utlze nnovatve methods of combnng color edge nformaton n mprovng the obect moton estmaton results. Then we use the long-term spato-temporal constrants to acheve relable obect trackng over long sequences. Our extensve experments demonstrate excellent results n hlng challengng cases n general domans (e.g., stock footage) ncludng depth-varyng mult-layer background fast camera moton. 1. Introducton The newly establshed MPEG-4 stard has proposed an obect-based framework for effcent multmeda representaton. Smlarly, the upcomng MPEG-7 stard, whch ams at offerng a comprehensve set of audovsual descrpton tools, also adopts an obect-orented model to capture nformaton about obects, events, scenes ther relatonshps. In both stards, segmentaton of obects s non-normatve s left to technology developers researchers. Thus, the success of obect-based meda representaton descrpton depends largely on effectve tools for obect segmentaton. Although much work has been done n decomposng mages nto regons wth unform features, we are stll lackng robust technques for segmentng semantc vdeo obects n general vdeo sources. In our prevous work, AMOS [7], we developed a general nteractve tool for semantc obect segmentaton. It can be used n offlne applcatons where obect-based compresson ndexng s needed. In the case when real-tme processng s requred, user nputs are usually not feasble or very lmted. For example, n broadcast sports or news programs, f we want to parse summarze vdeo obects events n real tme, automatc obect extracton methods are needed. In ths paper, we apply exp our prevous work on automatc vdeo regon trackng [6] develop an automatc movng obect trackng system by groupng lowlevel regons usng doman models. Specfcally, we wll look at the moton characterstcs of obects, extract salent movng obects from complex scenes. Our man obectves are: real-tme, fully automatc, capable of hlng practcal stuatons nvolvng complex scenes. These combned features dstngush our system from exstng works. Except for some specal cases (e.g., survellance vdeos), common TV programs home vdeos usually contan camera motons. In these stuatons, to detect movng obects, we frst need to compensate motons caused by camera operatons. As ponted out n [1], the camera nduced mage moton depends on the ego-moton parameters (.e., rotaton, zoom translaton) of the camera the depth of each pont n the scene. In general, t s an nherently ambguous problem to estmate the depth nformaton these physcal parameters. Exstng camera moton detecton approaches can be generally dvded nto two classes: 2D algorthms that assume the scene can be approxmated by a flat surface, 3D algorthms that work well only when sgnfcant depth varatons are preserved n the scene. It has been notced [1] that n 2D scenes when the depth varatons are not sgnfcant, the 3D algorthms are not robust or relable. On the other h, 2D algorthms usng a 2D global parametrc model (e.g., affne model) cannot hle 3D scenes where there are multple movng layers under camera motons. As typcally depth nformaton s not well preserved, 2D algorthms are used more wdely than 3D algorthms. When the scene s far from the camera /or the camera moton only ncludes rotaton zoom, a sngle affne moton model can be used to model compensate the cameruced moton. However, when the scene s close to the camera the camera s translatng, multple movng planar surfaces may be produced n the mage sequence. For example, n Fgure 3, the fourth sequence contans many moton layers- the ground, the skater the wall. In general, the above two scenaros may follow each other n

the same vdeo shot wth gradual transtons between them. To manage ths problem, many approaches have been proposed to use multple 2D parametrc models to capture multple moton layers. In [5], affne moton parameters are frst estmated from the optcal flow by lnear regresson, then spatotemporal segmentaton s obtaned by a clusterng n the affne parametrc space. In [2], a domnant moton s frst estmated by means of a merge procedure. Then moton vectors that can be well represented by ths domnant moton model are dentfed excluded, secondary affne parameters are estmated from remanng blocks. Ths procedure s repeated untl all moton layers are detected. Smlar approaches are also reported n [4]. These methods rely only on moton nformaton n groupng mage pxel or blocks nto moton layers, thus usually result n naccurate segmentaton on moton boundares. As there s a strong dependence between moton estmaton layer segmentaton, wthout good segmentaton to begn wth, moton estmaton results wll not be accurate. Another problem n most pror works s that obect trackng s not adequately addressed. It s assumed that movng obects detected at ndvdual frames automatcally form the temporal obect track. In real-world scenes, obects camera usually do not have unform temporal motons. Obects may show obvous moton n some frames, but show slght or even no motons n other frames. Ths ntroduces nconsstent detecton results n long sequences. To solve these problems, expng our regon segmentaton trackng algorthms proposed n [6], we develop a two-stage movng obects detecton method. Ths method uses regons wth accurate boundares to effectvely mprove moton estmaton results, uses the temporal constrant to acheve more relable obect trackng results over long sequences. In the rest of the paper, we wll frst gve an overvew of the system. The two detecton stages are dscussed n secton 3 4. Experment results dscussons are gven n secton 5. 2. System Overvew The system contans two stages (Fgure 1). In the frst stage, we apply an teratve moton layer detecton process based on the estmaton mergng of affne moton models. Each teraton generates one moton layer. The dfference from exstng methods s that moton models are estmated from spatally segmented color regons nstead of ust pxels or blocks. (1) Iteratve moton layer detecton (2) Obect extracton usng spto-temporal constrants Fgure 1. Two-stage movng obect detecton based on regon segmentaton trackng In the second stage, temporal constrants are appled to detect movng obects n spatal temporal space. Layers n ndvdual frames are lnked together based on characterstcs of ther underlyng regons. One or more layers wll be declared as moton obects accordng to specfc spato-temporal consstency rules. 3. Iteratve Moton Layer Detecton The teratve layer detecton s appled to each ndvdual frame as shown n Fgure 2. The ntal nput to the system ncludes mage regons automatcally extracted usng color edge nformaton. Frst, non-background regons 1 are merged nto moton layers accordng ther affne moton models, e.g. the 8-parameter ego-moton model. Because dfferent regons that belong to the same moton layer may have dfferent estmated parameters due to naccuracy n the ntal dense moton feld, a smple clusterng approach n the affne parametrc space usually does not work well. To solve ths problem, we use the followng dstance measure to compare two neghborng regons R R. Vdeo regons foreground layers background layer where regon Fgure 2. Iteratve moton layer detecton procedure D (, ) = mn( MCErr( R, M ), MCErr( R, M)) (1) M R Moton based regon merge M are the affne moton models of R respectvely. MCErr ( R, M ) s the Detect background layer moton compensaton error of regon R under moton model M. A regon s merged wth ts closest neghbor f ther dstance s below a gven threshold TH_AFF. After regons are merged nto moton layers, we try to dentfy one background layer n each teraton. Ths s based on the assumpton that a foreground layer usually has dscontnued moton felds around most of ts outer boundares, whle the background layer usually has contnuous outer boundares wth neghborng background layers. Boundares of a layer are conssted of pxels that 1 In the frst teraton, all regons are non-background regons Y Detect & exclude background regons Movng layers N Sold: layer boundary Dash: regon boundary

have at least one neghborng pxel not belongng to the layer. Outer boundary s the outmost closed curve that contans the whole layer. Assume b 1,, b n are the n ponts along the outer boundary of a layer l (do not consder pxels on the frame boundary), we defne the followng energy functon to measure ts boundary dscontnuty. 1 bn E l = G( p) n p = b1 G( p) = max( p 1 8 2 7 3 6 4 5 ) (2) where p1-p8 are moton vectors of p s 8 neghbors (clockwse1 at left-upper corner). Ths energy functon s smlar to common edge detecton operators such as the Roberts operator. A layer l s detected as a potental background layer only when E l s smaller than a threshold (e.g., 0.4). If no background layer s detected, the algorthm stops all remanng regons belong to foreground layers. When there are more than one possble background layers, the largest one s chosen as the background, ts affne moton model s used to compensate non-background regons. Those regons wth small compensaton errors are classfed as background, excluded from the next teraton of layer mergng detecton. After multple teratons, multple background layers may be produced, whle multple foreground layers reman. 4. Obect Extracton Usng Spato-Temporal Constrants The foreground layers detected at ndvdual frames may be relable. There are several reasons. Frst, the moton feld moton models may be naccurate. Second, more mportantly, a movng obect may have notceable motons n some frames where t can be easly detected. But n other frames, t may be statc s mstakenly treated as background. A long-term decson through a long-term nterval (e.g., a shot) s necessary to remove such errors acheve relable results. To apply temporal constrants, we frst lnk foreground layers (.e., trackng) n ndvdual frames accordng to ther underlyng regons. A foreground layer L m n frame m s lnked wth a layer L n n frame n, f the followng condton s satsfed: m n m n L I L = max( L k I L l ) (3) k, l m where L k L n l are the kth lth foreground layer n frame m n respectvely. The maxmum s computed over all foreground layers n frame m n. The ntersecton of two layers n Eq (3) s defned as the number of common regons they both contan. Two regons n dfferent frames are sad to be common f one s tracked by moton proecton from another one. In other words, layer L m n frame m s lnked to the layer n a prevous frame (n) that shares the most common regons. Ths process s terated to foreground layers remanng unlnked. In addton, we also defne the lnk as a conductve relatonshp, whch means f layer A B, B C are lnked respectvely, then A C are also lnked. Ths ensures that each local moton layer belongs to one only one temporal layer. The above lnkng or trackng process results n a number of groups of foreground layers. We wll refer these groups as temporal layers below. We use some spatotemporal constrants to valdate these temporal layers. The frst one s the duraton of a temporal layer. Layers wth short duraton are lkely to be nose or background regons, thus are dropped. Secondly, the frame-to-frame changes of center coordnates szes of a temporal layer are examned. If there are large abrupt changes, the temporal layer s not a vald trackng wll not be detected as a foreground obect. Fnally, a morphologcal open close procedure s appled at ndvdual frames to remove small solated regons to fll holes wthn a movng layer. There are some ssues that are not addressed n our approaches. For example, the temporal occluson s not consdered here. When one movng obect s frst movng, then occluded by another obect or background, later appear as a separate movng obect agan, t wll be treated as a new movng obect. However, we can use regon based obect matchng [3] to detect reoccurrence of the same obect. 5. Results Dscusson In Fgure 3, each row ncludes the mage of frame #1, then shows the movng obect trackng results at frame #1, #10, #20 #30. They all have depth varance camera moton (.e. followng the movng obects) n the scenes, resultng n multple moton layers. The frst sequence contans a skater runnng towards the camera. The ce feld has a gradual depth change from near to far. In the second sequence, a person s workng away from the camera n an offce. Cubc walls ext at dfferent depths. The thrd sequence s a brd-eye s vew of a soccer player runnng n the feld. Sequence 4 contans three background layers, whch are the ground, wall crowd. The last sequence contans the sky, the stage the umpng sker. Note that regons wthn segmented obects are shown n rom colors to demonstrate regon segmentaton results. One regon beng tracked at dfferent frames s shown wth the same color. The gradual depth change n the sequence 1 does not cause much problem as the ground s merged nto one large regon n the frst color based regon segmentaton stage. In sequence 2, the cubc walls are tracked as separated regons. Although these regons are classfed as foreground moton layers n some frames, ther temporal duratons are short thus are consdered as background. In the thrd sequence, both the player the grass feld have gradual

depth varances. Smlar to the frst sequence, color segmentatons are proven to be useful n hlng such stuatons. The above three sequences show good trackng results. Some small background regons are falsely ncluded n the sequence 4. These regons are manly from the connectng parts of two background regons, usually have naccurate moton felds. Some foreground pxels are mssed n (5) s because small solated regons are removed n the fnal morphologcal operatons. In summary, our experments demonstrated that longterm regon based movng obect detecton approach s more robust relable compared to exstng approaches that only uses local moton nformaton (e.g., frame-toframe moton feld). The method s desgned to automatcally detect track salent movng obects wthn scenes wth multple moton layers. By usng temporal constrant, we can robustly accurately segment movng obects over a long perod. Our method can also hle obects wth dscontnuous motons (.e., movng n some frames stll n other frames). References: 1. G. Adv, Inherent ambgutes n recoverng 3D moton structure from a nosy flow feld, IEEE Trans. on Pattern Analyss Machne Intellgence, 11:447-489, May 1989. 2. G. D. Borshukov, G. Bozdag, Y. Altunbasak A.M. Tekalp, Moton segmentaton by multstage affne classfcaton, IEEE transacton on mage processng, Vol 6, No 11, Nov 1997. 3. S.-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, "VdeoQ: An Automated Content-Based Vdeo Search System Usng Vsual Cues", ACM 5th Multmeda Conference, Seattle, WA, Nov. 1997. 4. F. Moschen, F.Dufaux M.Kunt, A new two-stage global/local moton estmaton based on a background/foreground segmentaton, IEEE Proc ICASSP 95, Detrot, MI, May 1995. 5. J.Y.A.Wang E.H.Adelson, Spato-temporal segmentaton of vdeo data, SPIE Proc Image Vdeo Processng II, San Jose, CA, Feb 1994. 6. D. Zhong S.-F.Chang, "Vdeo Obect Model Segmentaton for Content-Based Vdeo Indexng", ISCAS'97, HongKong, June 9-12, 1997. 7. D. Zhong S.-Fu Chang, "AMOS - An Actve MPEG-4 Vdeo Obect Segmentaton System", ICIP- 98, Chcago, Oct. 1998. (1) (2) (3) (4) (5) Fgure 3. Movng obect detecton trackng results of fve mage sequences (detected obects are show at frame #1, #10, #20 #30), test vdeos are kndly provded by actons, sports, adventures Inc. hot shots cool cuts Inc. for research.