Deformable Parts Correlation Filters for Robust Visual Tracking

Size: px

Start display at page:

Download "Deformable Parts Correlation Filters for Robust Visual Tracking"

Thomasine Underwood
5 years ago
Views:

PAPER UNDER REVISION Deformable Pars Correlaion Filers for Robus Visual Tracking Alan Lukežič, Luka Čehovin, Member, IEEE, and Maej Krisan, Member, IEEE ha par-based models should be considered in a

Mos par-based rackers use very small pars, apply low-level feaures for he appearance models, e.g.

Mos of he recen rackers use sar-based opolarxiv:65.372v [cs.

We presen a new formulaion of he consellaion model wih correlaion filers ha reas he geomeric and visual consrains wihin a single convex cos funcion and derive a highly efficien opimizaion for MAP

The coarse level corresponds a roo correlaion filer and a novel color model for approximae objec localizaion, while he mid-level represenaion is composed of he new deformable consellaion of

1 PAPER UNDER REVISION Deformable Pars Correlaion Filers for Robus Visual Tracking Alan Lukežič, Luka Čehovin, Member, IEEE, and Maej Krisan, Member, IEEE ha par-based models should be considered in a layered framework ha decomposes he model ino a global and local layer o increase he sabiliy of deformaion parameers esimaion in presence of uncerain visual informaion. Mos par-based rackers use very small pars, apply low-level feaures for he appearance models, e.g., hisograms [3], [4] or keypoins [5], [6] and increase heir discriminaion power by increasing he number of pars. Objec is localized by opimizing a rade-off beween he visual and geomeric agreemen. Mos of he recen rackers use sar-based opolarxiv:65.372v [cs.cv] 2 May 26 Absrac Deformable pars models show a grea poenial in racking by principally addressing non-rigid objec deformaions and self occlusions, bu according o recen benchmarks, hey ofen lag behind he holisic approaches. The reason is ha poenially large number of degrees of freedom have o be esimaed for objec localizaion and simplificaions of he consellaion opology are ofen assumed o make he inference racable. We presen a new formulaion of he consellaion model wih correlaion filers ha reas he geomeric and visual consrains wihin a single convex cos funcion and derive a highly efficien opimizaion for MAP inference of a fully-conneced consellaion. We propose a racker ha models he objec a wo levels of deail. The coarse level corresponds a roo correlaion filer and a novel color model for approximae objec localizaion, while he mid-level represenaion is composed of he new deformable consellaion of correlaion filers ha refine he objec locaion. The resuling racker is rigorously analyzed on a highly challenging OTB, VOT24 and VOT25 benchmarks, exhibis a sae-of-he-ar performance and runs in real-ime. Index Terms Compuer vision, visual objec racking, correlaion filers, spring sysems, shor-erm racking. INTRODUCTION Shor-erm single-objec visual racking has received a significan aenion of he compuer vision communiy over he las decade wih numerous concepually diverse racking algorihms being proposed every year. Recenly several papers reporing experimenal comparison of rackers on a common esing ground have been published [], [2], [3], [4]. Resuls show ha racking qualiy depends highly on he expressiveness of he feaure space in he objec appearance model and he inference algorihm ha convers he feaures ino a presence score in he observed parameer space. Mos of he popular rackers apply holisic appearance models which capure he objec appearance by a single pach. In combinaion wih efficien machine-learning and signal processing echniques from online classificaion and regression, hese rackers exhibied op performance across all benchmarks [5], [6], [7], [8]. Mos of hese approaches apply sliding windows for objec localizaion, and some exend he local search in he scale space [9], [], [], [2] o address he scale changes as well. Neverheless, a single pach ofen poorly approximaes objecs ha undergo significan, poenially nonlinear, deformaion, self occlusion and parial occlusions, leading o drif, model corrupion and evenual failure. Such siuaions are concepually beer addressed by par-based models ha decompose he objec ino a consellaion of pars. This ype of rackers shows a grea poenial in racking nonrigid objecs, bu heir performance ofen falls behind he holisic models [4], because of he large number of degrees of freedom ha have o be esimaed in he deformaion model during racking. Čehovin e al. [3] herefore propose A. Lukežič, L. Čehovin and M. Krisan are wih he Faculy of Compuer and Informaion Science, Universiy of Ljubljana, Slovenia. see hp:// Minimize he energy 4 Resul from - Updae he consellaion 5 6 New frame - Form a spring sysem 3? = Coarse localizaion Iniialize mid-level pars Updae he visual models The oupu bounding box #5 #26 #57 #29 Fig.. Illusraion of coarse-o-fine racking by spring sysem energy minimizaion in a deformable par model (op). Tracking examples wih our racker DPT (yellow), KCF (red), IVT (blue) and Sruck (magena) are shown in he boom. 2

2 PAPER UNDER REVISION 2 ogy, e.g. [4], [5], [7], [8], [9], [2], or local conneciviy, e.g. [3], insead of a fully-conneced consellaion [6] o make he inference racable, bu a a cos of a reduced power of he geomeric model. In his paper we presen a new class of layered parbased rackers ha apply a geomerically consrained consellaion of local correlaion filers [8], [] for objec localizaion. We inroduce a new formulaion of he consellaion model ha allows efficien opimizaion of a fully-conneced consellaion and adds only a negligible overhead o he racking speed. Our par-based correlaion filer formulaion is cas in a layered par-based racking framework [3] ha decomposes he arge model ino a coarse layer and a local layer. A novel segmenaion-based coarse model is inroduced as well. Our racker explicily addresses he nonrigid deformaions and (self-)occlusions, resuling in increased robusness compared o he recenly proposed holisic correlaion filers [] as well as sae-of-he-ar parbased rackers.. Relaed work Popular ypes of appearance models frequenly used for racking are generaive holisic models like color hisograms [2] and subspace-based [22], [23] or sparse reconsrucion emplaes [24]. Several papers explored muliple generaive model combinaions [2], [25] and recenly Gaussian process regressors were proposed for efficien updaing of hese models [26]. The cos funcion in generaive holisic models reflecs he qualiy of global objec reconsrucion in he chosen feaure space, making he rackers prone o drifing in presence of local or parial objec appearance changes or whenever he objec moves on a visually-similar background. This issue is beer addressed by he discriminaive rackers which rain an online objec/background classifier and apply i o objec localizaion. Early work includes suppor vecor machines (SVM) [27], online Adaboos [7], muliple-insance learning [6] and recenly excellen performance was demonsraed by srucured SVMs [5]. A color-based discriminaive model was recenly presened in [28] ha explicily searches for poenial visual disracors in he objec viciniy and updaes he model o increase he discriminaive power. The recen revival of he mached filers [29] in he conex of visual racking has shown ha efficien discriminaive rackers can be designed by online learning of a correlaion filer ha minimizes he signalo-noise raio cos funcion. These filers exhibi excellen performance a high speeds, since learning and maching is carried ou by exploiing he efficiency of he fas Fourier ransform. Bolme e al. [8] inroduced he firs successful online mached filer, now commonly known as a correlaion filer racker. Their racker was based on grayscale emplaes, bu recenly he correlaion filers have been exended o mulidimensional feaures [9], [], [], and Henriques e al. [] inroduced kernelized versions. Scale adapaion of correlaion filers was invesigaed by Danneljan e al. [9] and Zhang e al. [3] who applied correlaion filers o he scale space and [3] who combined voes of muliple auomaically allocaed filers. Zhang e al. [32] have shown he connecion o spaio-emporal conex learning. Hong e al. [33] have recenly inegraed correlaion filers in a muli-sore racking framework and demonsraed excellen performance. In fac, he correlaion filer-based rackers have demonsraed excellen performance across all he recen benchmarks. Sill, hese rackers suffer from he general drawbacks of holisic models is ha hey do no explicily accoun for deformaion, self occlusion and parial occlusions, leading o drif, model corrupion and evenual failure. This issue is concepually beer addressed by models ha decompose he objec ino pars. The par-based rackers apply consellaions of eiher generaive or discriminaive local models and vary significanly in he way hey model he consellaion geomery. Hoey [34] used a flock-of-feaures racking in which pars are independenly racked by opical flow. The flock is kep on objec by idenifying pars ha deviae oo far from he flock and replacing hem wih new ones. Bu because of weak geomeric consrains, racking is prone o drifing. Vojir e al. [35] addressed his issue by significanly consraining he exen of each par displacemen and inroduced ess of esimaion qualiy. Tracking robusness is increased by only considering he par displacemens deemed accuraely esimaed. Marinez e al. [36] proposed connecing riples of pars and racked hem by kernels while enforcing locally-affine deformaions. The local conneciviy resuled in inefficien opimizaion and pars required careful manual iniializaion. Arner e al. [6] proposed a key-poinbased racker wih a fully-conneced consellaion. They use he geomeric model ha enforces preservaion of inerkeypoin disance raios. Because he raios are no updaed during racking and due o he ad-hoc combinaion of geomeric and appearance models, he resuling opimizaion is quie brile, requiring manual iniializaion of pars and he resuling racker handles only moderae locally-affine deformaions. Pernici e al. [37] address nonrigid deformaions by oversampling key-poins o consruc muliple insancemodels and use a similariy ransform for maching. Bu, he racker sill fails a significan nonrigid deformaions. Several works simplify a geomeric model o a sar-based opology in ineres of simplified opimizaion. A number of hese works apply par deecors and a generalized Hough ransform for localizaion. Examples of par deecors are key-poins [5], random fores classifiers [9], ferns [38] and pixels [39]. Cai e al. [7] apply superpixels as pars combined wih segmenaion for efficien racking, bu he high reliabiliy on color resuls in significan failures during illuminaion changes. Kwon e al. [4] apply generaive models in a sar-based opology wih adding and removing pars and Čehovin e al. [3] increase he power of he geomeric model by local conneciviy. Boh approaches require efficien sochasic opimizers for inference. Yao e al. [8] address he visual and geomeric model wihin a single discriminaive framework. They exend he srucured SVM [5] o muliple par racking, bu canno handle scale changes. This model was exended by Zhu e al. [2] o accoun for conex as well, bu uses a sar-based opology for making he inference racable. Conex was also used by Duan e al. [4] where racking muliple objecs or objec pars was used o resolve ambiguiies. Par-based rackers ofen suffer from he poenially large number of parameers of he deformaion model o be esimaed from uncerain/noisy visual daa. This is addressed by he layered paradigm of par-based rackers in-

The paradigm applies a op-down localizaion o gradually esimae he sae parameers (i.e., arge cener and par locaions) and boom-up updaes o updae he appearance models. Čehovin e al.

3 PAPER UNDER REVISION 3 roduced by Čehovin e al. [3]. This paradigm decomposes he racker archiecure ino a global coarse and a local appearance layer. The global layer conains coarse arge represenaions such as holisic emplaes and global color hisograms, while he local layer is he consellaion of pars wih simple local appearance descripion. The paradigm applies a op-down localizaion o gradually esimae he sae parameers (i.e., arge cener and par locaions) and boom-up updaes o updae he appearance models. Čehovin e al. [3] analyzed various modaliies used a he global layer (i.e., color, local moion and shape) and heir influence on racking. They have concluded ha color plays he mos imporan role a he scale of he enire objec..2 Our approach and conribuions Our main conribuion is a new class of fully-conneced par-based correlaion filer rackers. Mos par-based rackers apply sar-based opology o simplify he inference or combine geomerical and visual consrains in an ad-hoc fashion ofen leading o a nonconvex opimizaion problem. In conras, our formulaion reas he geomeric and visual consrains wihin a single convex cos funcion. We show ha his cos funcion has a dual formulaion of a spring sysem and show ha MAP inference of he consellaion can be achieved by minimizing he energy of he dual spring sysem. We derive a highly efficien opimizer ha in pracice resuls in a very small compuaional overhead during racking. The racker is formulaed wihin he heoreical framework of layered deformable pars [3] ha decomposes he racker ino a coarse represenaion and a mid-level represenaion. The coarse represenaion is composed of a holisic correlaion filer and a novel global color model. The mid-level represenaion is composed of local correlaion filers fully-conneced by he new consellaion model. Tracking is performed by op-down localizaion and boomup updaes (Figure ): The coarse model iniializes he mid-level represenaion a approximae objec locaion. An equivalen spring sysem is formed and opimized, yielding a MAP consellaion esimae. The pars are updaed and he esimaed consellaion is used o updae he coarse model. In conras o he sandard holisic correlaion filers, he proposed deformable pars racker naurally addresses he objec appearance changes resuling from scale change, nonrigid deformaions and (self)occlusions increasing he racking robusness. Our racker and he proposed consellaion opimizaion are analyzed in deph. The racker is rigorously compared agains a se of sae-of-he-ar rackers on a highly challenging recen benchmarks OTB [], VOT24 [4] and VOT25 [4] and exhibis a sae-of-he-ar performance. Addiional ess show ha improvemens come from he fully-conneced consellaion and he op-down/boom-up combinaion of he coarse represenaion wih he proposed deformable pars model. 2 DEFORMABLE PARTS TRACKER As i is a common pracice in visual racking, he racker oupu a ime-sep is an axis-aligned bounding box. In our case his region is esimaed by he deformable pars correlaion filer as we describe in his Secion. Our racker is composed of a coarse represenaion described in Secion 2.2, and of a deformable consellaion of pars, a mid-level objec represenaion described in Secion 2.3. In he following, we will denoe he par posiions by ( ) (i), where he index i = denoes he roo par in he coarse layer and indexes i > denoe pars in he consellaion. Since boh represenaions apply kernelized correlaion filers (KCF) [] for par localizaion, we sar by briefly describing he KCF in Secion Kernelized correlaion filers This secion summarizes he main resuls of he recen advances in correlaion filers and heir applicaion o racking [], [42]. Given a single grayscale image pach z of size M N a linear regression funcion f(z) = w T z is esimaed such ha is response is maximal a he cener of he pach and gradually reduces for he pach circular shifs z m,n, (m, n) {,..., M } {,..., N } oward he pach edge. This is formulaed by minimizing he following cos funcion ɛ = w z φ 2 + λ w 2, () where denoes circular correlaion, φ is a Gaussian funcion cenered a zero shif (see Figure 2) and λ is a ridge regression regularizaion parameer which conrols overfiing. The correlaion in () is kernelized [] by redefining Fig. 2. The correlaion filer formulaion. We seek a weigh marix w ha resuls in a Gaussian response funcion φ when correlaed over he image pach z. he w as a linear combinaion of he circular shifs, i.e., w = m,n a m,nϕ(z m,n ), where ϕ( ) is a mapping o he Hilber space induced by a kernel κ(, ). The minimum of () is obained a Φ A = U z + λ, (2) where he capial leers denoe he Fourier ransforms of image-domain variables, i.e., A = F[a], Φ = F[φ], U z = F[u z ], wih u z (m, n) = κ(z m,n, z) and a is a dual represenaion of w []. A ime-sep, a pach y of size M N is exraced from he image and he probabiliy of objec a pixel locaion x is calculaed from he curren esimae of A and he emplae z as p(y x, z ) F [A U y ], (3) where U y = F[u y ], u y (m, n) = κ(y m,n, z ). In [], [42], he maximum on p(y x, z ) is aken as he new objec posiion. The numeraor and denominaor of A in (2) as well as he pach emplae z are updaed separaely a

PAPER UNDER REVISION 4 he esimaed posiion by an auoregressive model.

2 The coarse represenaion The coarse objec represenaion in our appearance model consiss of wo high-level objec models: he objec global emplae z (a roo correlaion filer) and a global color model C =

These models are used in each racking ieraion o coarsely es imae he cener x of he objec bounding box wihin a specified search region (Figure, sep ), which is subsequenly refined by he mid-level

(4) The firs erm, p(y x, z ), is he emplae probabiliy reflecing he similariy beween he pach cenered a x and he objec emplae z calculaed as he response from he correlaion filer (3), (see Figure 3b).

2. Color informaiveness es Whenever he objec color is similar o he background, or during sudden illuminaion variaions, he color segmenaion becomes unreliable and can degrade racking performance.

If he deviaion from he expeced objec area is wihin he allowed bounds, he uniform componen in (5) is se o a low value, oherwise i is se o, effecively ignoring he color informaion in he objec posiion

; α min < (siz) < αmax M αcol = (6) ; oherwise The parameers αmin and αmax specify he inerval of expeced number of pixels assigned o he arge relaive o he arge bounding box size from he previous

In Figure 4(a), he number of pixels assigned o he foreground is wihin he expeced bounds, while (b,c) show examples ha fail he es by assigning oo many or oo few pixels o he objec.

4 PAPER UNDER REVISION 4 he esimaed posiion by an auoregressive model. The exension of he kernelized filer from grayscale paches o muli-channel feaures is sraigh-forward and we refer he reader o [], [42] for deails. 2.2 The coarse represenaion The coarse objec represenaion in our appearance model consiss of wo high-level objec models: he objec global emplae z (a roo correlaion filer) and a global color model C = {p(x f ), p(x b)}, specified by he foreground and background color hisograms, p(x f ) and p(x b), respecively, where x denoes he pixel coordinaes. These models are used in each racking ieraion o coarsely es imae he cener x of he objec bounding box wihin a specified search region (Figure, sep ), which is subsequenly refined by he mid-level represenaion (Secion 2.3). Given an image pach y exraced from a search region, (Figure 3a), he cener is esimaed by maximizing he probabiliy of objec locaion x, p(x z, C, y ) p(y x, z )p(y x, C ). (4) The firs erm, p(y x, z ), is he emplae probabiliy reflecing he similariy beween he pach cenered a x and he objec emplae z calculaed as he response from he correlaion filer (3), (see Figure 3b). The second erm is he color probabiliy defined as p(y x, C ) = p(f x, y )( αcol ) + αcol, 2.2. Color informaiveness es Whenever he objec color is similar o he background, or during sudden illuminaion variaions, he color segmenaion becomes unreliable and can degrade racking performance. The color informaiveness es is performed (fg) by comparing he number of pixels, M, assigned o he foreground by he color model p(f x y ), and he objec (siz) size from he previous ime-sep M (i.e., he area of objec bounding box). If he deviaion from he expeced objec area is wihin he allowed bounds, he uniform componen in (5) is se o a low value, oherwise i is se o, effecively ignoring he color informaion in he objec posiion poserior (4), i.e., (fg) M. ; α min < (siz) < αmax M αcol = (6) ; oherwise The parameers αmin and αmax specify he inerval of expeced number of pixels assigned o he arge relaive o he arge bounding box size from he previous ime-sep. Since he aim of (6) is only o deec drasic segmenaion failures, hese values can be se o a very low and very large value, respecively. Figure 4 illusraes he color informaiveness es. In Figure 4(a), he number of pixels assigned o he foreground is wihin he expeced bounds, while (b,c) show examples ha fail he es by assigning oo many or oo few pixels o he objec. a (5) b c where p(f x, y ) is he probabiliy of a pixel a locaion x belonging o a foreground and αcol is a weak uniform disribuion ha addresses sudden changes of he objec color, since he p(f x, y ) migh be uninformaive in hese siuaions and would deeriorae localizaion. The value of αcol varies wih a color informaiveness as deailed in Secion The probabiliy p(f x, y ) is calculaed by hisogram backprojecion, i.e., by applying he Bayes rule wih p(x f ) and p(x b), and regularized by a Markov random field [43], [44] o arrive a a smoohed foreground poserior (Figure 3c). Muliplying he emplae and color probabiliies yields he densiy p(x z, C, y ) (Figure 3d). Noice ha on heir own, he emplae and color resul in ambiguous densiies bu heir combinaion drasically reduces he ambiguiy. Fig. 3. Example of a search region and he racked objec indicaed by a recangle and an arrow (a). The coarse emplae probabiliy, he color probabiliy and he full coarse model densiy are shown in (b), (c) and (d), respecively. Fig. 4. Three examples of he color backprojecion wihin he image pach denoed wih he yellow bounding box. The regularized backprojecion is shown on lef and he binarized segmenaion on righ under each image. Example (a) passes he color informaiveness es, while (b) and (c) fail he es since oo many or oo few pixels are assigned o he objec. 2.3 The mid-level represenaion The mid-level represenaion in our racker is a geomerically consrained consellaion of Np pars X = (i) (i) {x }i=:np, where x is he posiion of i-h par (see Figure 5, lef). Noe ha he par sizes do no change during (i) racking and herefore do no ener he sae variable x. (i) Each par cenered a x is a local mid-level represenaion of objec, a kernelized correlaion filer, specified by a fixed(i) (i) size par emplae z and A (Secion 2.). The probabiliy of he consellaion being a sae X (i) condiioned on he pars measuremens Y = {y }i=:np and parameers of he deformaion model Θ is decomposed ino p(x Y, Θ) p(y X, Θ)p(X Θ). (7)

PAPER UNDER REVISION 5 The densiy p(y X, Θ) is he measuremen consrain erm, reflecing he agreemen of measuremens wih he curren sae X of consellaion, whereas he second erm, p(x Θ), reflecs he agreemen

Geomeric consrains The consellaion is specified by a se of links (i, j) L indexing he conneced pairs of pars (Figure 5).

5 PAPER UNDER REVISION 5 The densiy p(y X, Θ) is he measuremen consrain erm, reflecing he agreemen of measuremens wih he curren sae X of consellaion, whereas he second erm, p(x Θ), reflecs he agreemen of he consellaion wih he geomeric consrains. Saic node Saic spring Dynamic spring Dynamic node 2.3. Geomeric consrains The consellaion is specified by a se of links (i, j) L indexing he conneced pairs of pars (Figure 5). The pars and links form an undireced graph and he join pdf over he par saes can be facored over he links as Consellaion model The corresponding spring sysem p(x Θ) = (i,j) L φ( d(i,j) ; µ (i,j), k (i,j) ), (8) where d (i,j) = x (i) x (j) is a difference in posiions of he linked pars, µ (i,j) is he preferred disance beween he pair of pars and k (i,j) is he inensiy of his consrain. The facors in (8) are defined as Gaussians φ( ; µ, k) wih mean µ and variance k meaning ha deviaions from he preferred disances decrease he probabiliy (8) Measuremen consrains Given a fixed par sae, x (i), he measuremen y (i) a ha par is independen from he saes of oher pars. The measuremen probabiliy decomposes ino a produc of perpar visual likelihoods p(y X, Θ) = p(y (i) i=:n x (i), Θ). (9) p To simplify he combinaion of he geomeric and he visual consrains (Secion 2.3.3) i is beneficial o chose he visual likelihoods from he same class of funcions as (8). We make use of he fac ha he pars appearance models are correlaion filers rained on Gaussian oupus, hus he visual likelihoods in (9) can be defined as Gaussians as well. Le x (i) A be he posiion in viciniy of x(i) ha maximizes he similariy of he appearance model z (i) and he measuremen y (i) (see Figure 5, lef). The visual likelihood can hen be defined as a Gaussian p(y (i) x (i), Θ) = φ( d (i) ;, k (i) ) where d (i) = x (i) x (i) A is he difference of he par curren sae and is visually-ideal posiion, and k (i) is he inensiy of his consrain The dual spring-sysem formulaion Subsiuing equaions (8,9) back ino (7) leads o an exponenial poserior p(x Y, Θ) exp( E), wih E = d (i) 2 + k (i,j) (µ (i,j) d (i,j) ) 2. 2 k (i) i=:n p i,j L () Noe ha E corresponds o an energy of a spring sysem in which pairs of pars are conneced by springs and each par is conneced by anoher spring o an image posiion mos similar o he par appearance model (Figure 5, righ). The erms µ (i,j) and k (i,j) are nominal lenghs and siffness of springs inerconnecing pars (dynamic springs), while k (i) is siffness of he spring connecing par o he image locaion (saic spring). In he following we will refer o he nodes in he spring sysem ha correspond o pars ha move during opimizaion as dynamic nodes and we will refer Fig. 5. Example of a consellaion model wih recangular pars and arrows poining o he mos visually similar posiions (lef) and he dual form corresponding o a spring sysem (righ). A consellaion wih only hree nodes is shown for clariy. o he nodes ha are anchored o image posiions as saic nodes, since hey do no move during he opimizaion. The siffness k (i) of a spring connecing a par o he image (in Figure 5 denoed as saic spring) should reflec he uncerainy of he visually bes-maching locaion x (i) A in he search region of he i-h par and is se by he oupu of he correlaion filer. The bes maching posiion x (i) A is esimaed as locaion a which he oupu of he corresponding correlaion filer (3) reaches a maximum value (denoed as w (i) ) and he spaial uncerainy in he search region is esimaed as he weighed variance σ 2(i), i.e., he average of squared disances from x (i) A weighed by he correlaion filer response map. The spring siffness is hus defined by he response srengh w (i) and spaial uncerainy, i.e., k (i) = w (i) /σ 2(i). () The siffness of springs inerconnecing he pars (in Figure 5 denoed as dynamic spring) should couner significan deviaions from he spring nominal lengh. Le d (i,j) A = x(i) A x(j) A be he posiion difference beween he visually mos similar posiions of he nodes indexed by i and j. The siffness of he spring connecing he nodes is se o ( (i,j) µ k (i,j) = d(i,j) 2.4 Efficien MAP inference A µ (i,j) ) 2. (2) The spring sysem from Secion is a dual represenaion of he deformable pars model and minimizaion of is (convex) energy funcion () corresponds o he maximum a poseriori sae esimaion (7) of he deformable pars model. This means ha general-purpose convex energy minimizers can be used o infer he MAP sae. Bu due o he dual spring sysem formulaion, even more efficien opimizers can be derived. In paricular, we propose an algorihm ha splis a 2D spring sysem ino wo D sysems, solves each in a closed form and hen re-assembles hem back ino a 2D sysem (see Figure 6). This parial minimizaion is ieraed unil convergence. In he following we derive an efficien closed-form solver for a D sysem.

6 PAPER UNDER REVISION 6 Subsiuing he definiions (7) and (8) ino (6) yields he following closed form for he dynamic nodes posiions x dyn, x dyn = K dyn(c dyn L K sa x sa ). (9) The opimizaion of a 2D spring sysem, which we call ieraive direc approach (IDA), is summarized in he Algorihm. A each ieraion, a 2D sysem is decomposed ino separae D sysems, each sysem is solved by (9) and he 2D sysem is re-assembled. The process is ieraed unil K convergence. Noe ha K sa x sa and dyn can be calculaed only once and remain unchanged during he opimizaion. Fig. 6. Example of decomposiion of a 2D spring sysem wih 4 dynamic nodes (circles) and 4 saic nodes (diamonds) on wo D spring sysems. Each D spring sysem has a closed-form soluion. Using sandard resuls from Newonian mechanics, he forces a springs F of a D spring sysem, can be wrien as F = K(Bx L), (3) where K = diag([k,, k N ]) is a diagonal marix of spring siffness coefficiens, x is a vecor of D nodes posiions, L = [l,, l N ] is a vecor of spring nominal lenghs and B is a N springs N nodes conneciviy marix ha represens direced connecions beween he nodes. Le {n i, n i2 } be indexes of wo nodes conneced by he i-h spring. The enries of B are hen defined as ; j n i b ij = ; j n i2 (4) ; oherwise The forces a nodes F nodes are given by lefmuliplicaion of (3) by B T, yielding F nodes = B T KBx + B T KL. (5) The equilibrium is reached when he forces a nodes vanish (i.e., become zero), resuling in he following linear sysem Kx = CL, (6) where K = B T KB and C = B T K. We will assume he following ordering in he nodes posiions vecor, x = [x dyn, x sa ] T, where x dyn and x sa are D posiions of he dynamic and saic nodes, respecively. The marix K can be wrien as [ K dyn K = K ] sa, (7) K rem where K dyn and K sa are N dyn N dyn and N dyn N sa submarices, respecively, realing he dynamic nodes o each oher and he saic nodes. Similar decomposiion can be performed on C, [ ] Cdyn C =. (8) C sa Algorihm : Opimizaion of a 2D spring sysem. Require: Posiions of dynamic and saic nodes, x dyn and x sa, siffness vecor k and adjacency marix B. Ensure: Equilibrium posiions of dynamic nodes x dyn. Procedure: : For each dimension separaely consruc K dyn, K sa and C dyn according o (7) and (8). 2: while sop condiion do 3: For each dimension do: 4: * Exrac D posiions of dynamic nodes from x dyn. 5: * Calculae he curren D spring lenghs vecor L. 6: * Esimae new values of x dyn by solving (9). 7: Reassemble he 2D sysem. 8: end while 2.5 Deformable pars racker (DPT) The coarse represenaion and he mid-level consellaion of pars from Secion 2.2 and Secion 2.3 are inegraed ino a racker ha localizes he objec a each ime-sep wihin a search region by a op-down localizaion and boomup updaes. In he following we will call his racker a deformable pars correlaion filer racker and denoe i by DPT for shor. The racker seps are visualized in Figure and deailed in he following subsecions Top-down localizaion The objec is coarsely localized wihin a search region corresponding o he roo correlaion filer cenered a he objec posiion from he previous ime-sep. The objec cener a ime-sep is approximaed by posiion ha maximizes he condiional probabiliy p(x z, C, y ) from Secion 2.2 and a coarse cener ranslaion from o is esimaed (Figure, sep ). The mid-level represenaion, i.e, consellaion of pars, is iniialized by his ranslaion. For each ranslaed par x (i), he par correlaion filer is applied o deermine he posiion of he maximum similariy response, x (i) A, along wih he siffness coefficiens k(i) and k (i,j) as deailed in Secion A MAP consellaion esimae ˆX is obained by minimizing he energy () of he equivalen spring sysem opimizaion from Secion 2.4 (Figure, seps 2-4).

7 PAPER UNDER REVISION Boom-up updae The mid-level and coarse represenaions are updaed as follows (Figure, seps 5,6). The par correlaion filers and heir appearance models z (i) are updaed a MAP esimaes of par posiions ˆx (i). Updaing all appearance models a consan rae migh lead o drifing and failure whenever he objec is parially occluded or self-occluded. An effecive mechanism is applied o address his issue. A par is updaed only if is response a he MAP posiion ˆx (i) is a leas half of he sronges response among all pars and if a leas weny percen of all pixels wihin he par region correspond o he objec according o he segmenaion mask esimaed a he roo par (Secion 2.2). The nominal spring lenghs (he preferred disances beween pars) are updaed by an auoregressive scheme µ (i,j) = µ (i,j) ( α (i,j) spr) + ˆd α spr, (2) (i,j) where ˆd is he disance beween he pars (i, j) in he MAP esimae ˆX and α spr is he updae facor. The coarse represenaion is updaed nex. The MAP objec bounding box is esimaed by ˆx = T ˆx, where T is a Euclidean ransform esimaed by leas squares from he consellaion MAP esimaes ˆX and ˆX. The roo correlaion filer z and he hisograms in he global color model C are updaed a ˆx. A hisogram h (f) is exraced from ˆx and anoher hisogram h (b) is exraced from he search region surrounding ˆx increased by a facor α sur. The foreground and background hisograms are updaed by an auoregressive model, i.e., p(x ) = p(x )( α his ) + h ( ) α his, (2) where α his is he forgeing facor. To increase adapaion robusness, he hisograms are no updaed if he color segmenaion fails he color informaiveness es from Secion The op-down localizaion and boom-up updae seps are summarized in Algorihm Tracker iniializaion The coarse represenaion a ime-sep = is iniialized from he iniial bounding box x. The mid-level he consellaion of pars is iniialized by spliing he iniial objec bounding box ino four equal non-overlapping pars. The par appearance models are iniialized a hese locaions and he preferred disances beween pars are calculaed from he iniialized posiions. 3 EXPERIMENTAL ANALYSIS This secion repors experimenal analysis of he proposed DPT. The implemenaion deails are given in Secion 3., Secion 3.2 deails he analysis of he design choices, Secion 3.3 repors comparison o he relaed sae-of-he-ar, Secion 3.4 repors performance on recen benchmarks and Secion 3.5 provides qualiaive analysis. Algorihm 2 : A racking ieraion of a deformable pars correlaion filer racker. Require: Coarse model {x, z, C } and mid-level model {X, Z } a ime-sep. Ensure: Coarse model {x, z, C } and mid-level model {X, Z } a ime-sep. Procedure: : Coarsely esimae he objec posiion by he roo node (Secion 2.2) and displace he mid-level pars. 2: Calculae he par correlaion filer responses and form a spring sysem according o Secion : Esimae he MAP mid-level pars consellaion by opimizing he energy of a dual spring sysem (Secion 2.4). 4: Updae he roo node posiion and size by he Euclidean ransform fied o he pars posiions before and afer MAP inference (Secion 2.5.). 5: Updae he spring sysem parameers and he consellaion appearance models (Secion 2.5.2). 6: Updae he coarse color model C and correlaion filer z. 3. Implemenaion deails and parameers Our implemenaion uses a kernelized correlaion filers (KCF) [] wih HOG [45] feaures and grayscale emplae in he par appearance models. All filer parameers and learning rae are he same as in []. The pars have o be large enough o capure locally visually-disincive regions on he objec and have o cover he objec wihou significanly overlapping wih each oher. The size of he racked arges herefore places a consrain on he maximal number of pars since heir size reduces wih his number. For small pars, he HoG feaures become unreliable. Bu even more pressing is he issue ha he capure range of correlaion filers is consrained by he emplae size and is even reduced in pracice due o he effecs of circular correlaion used for learning and maching. Therefore, small pars increasingly lose he abiliy o deec large displacemens. The pars have o be large enough o capure he objec parial appearance a sufficien level of deail, herefore we se he number of pars o N p = 4. The DPT allows any ype of conneciviy among he pars and our implemenaion applies a fully-conneced consellaion for maximally consrained geomery. The foreground/background models C are HSV color hisograms wih bins. The remaining parameers are as follows: he rae of spring sysem updae is α spr =.95, he background hisogram exracion area parameer is se o α sur =.6 and he hisogram updae rae is se o α his =.5. These parameers have a sraigh-forward inerpreaion, were se o he values commonly used in published relaed rackers. Recall ha he color informaiveness es from Secion 2.2. deecs drasic segmenaion failures. In our implemenaion he failure is deeced if he number of pixels pixels assigned o he objec relaive o he arge bounding box size eiher falls below 2 percen or exceeds he iniial size by percen, i.e., α min =.2 and α max = 2.. Noe ha hese are very weak

8 PAPER UNDER REVISION 8 consrains mean o deec obvious segmenaion failures and did no require special uning. The parameers have been fixed hroughou all experimens. The DPT was implemened in Malab wih backprojecion and HoG exracion implemened in C and performed a 9 FPS on an Inel Core i7 machine. Since our racker uses a KCF [] for roo and par appearance models, he complexiy of our racker is in order of he KCF complexiy, which is O(n log n), where n is he number of pixels in he search region. The DPT has complexiy five imes he KCF, because of he four mid-level pars plus a roo par. The localizaion and updae of five KCFs akes approximaely 4ms. Our racker consiss also of he spring sysem and objec segmenaion. The opimizaion of he spring sysem akes on average less han 3ms and he color segmenaion wih he hisogram exracion requires approximaely 9ms. 3.2 The DPT design analysis 3.2. Analysis of he spring sysem opimizaion This secion analyzes he ieraed direc approach (IDA) from Secion 2.4, which is he core of our par-based opimizaion. The following random spring sysem was used in he experimens. Dynamic nodes were iniialized a uniformly disribued posiions in a 2D region [, ] [, ]. Each node was displaced by a randomly sampled vecor d = [d x, d y ] U([.5;.5]) and he anchor nodes were se by displacing he corresponding dynamic nodes by he vecor b = [b x, b y ] U([.25;.25]). The siffness of i-h dynamic spring was se o k i = (σd i ) 2, where d i is he lengh of he spring and σ =. is he size change. The siffness of j-h saic spring was se o k j = 2 + u jk dyn, where k dyn is he average siffness of he dynamic springs and u j U([; ]). The IDA was compared wih he widely used conjugae gradien descen opimizaion (CDG), which guaranees a global minimum will be reached on a convex cos funcion and has shown excellen performance in pracice on non-convex funcions as well [46]. All resuls here are obained by averaging he performance on, randomly generaed spring sysems. The firs experimen evaluaed he convergence properies of IDA. Figure 7 shows he energy reducion in spring sysem during opimizaion for differen number of nodes in he spring sysem. The difference in he remaining energy afer many ieraions is negligible beween CGD and IDA, which means ha boh converged o equivalen soluions. Bu he difference in energy reducion in consecuive seps and he difference in seps required o reach convergence is significan. The IDA reduces he energy a much faser rae han CGD and his resul is consisen over various spring sysem sizes. Noice ha IDA significanly reduced he energy already wihin he firs few ieraions. The numeric behavior of IDA is much more robus han ha of he CGD. Figure 8 shows an example of a spring sysem, where CGD did no reach he opimal sae, bu he IDA converged o a sable sae wih much lower energy, han he CGD. The poor convergence in CGD is caused by he very small disance beween a pair of nodes compared o he oher disances resuling in poor gradien esimaion, while he IDA avoids his by he closed-form soluions for he marginal D spring sysems. The IDA Energy Energy Number of ieraions 26 7x7 spring sysem x2 spring sysem Number of ieraions Energy Energy Number of ieraions 3.4 x spring sysem x5 spring sysem Number of ieraions Fig. 7. The spring sysem remaining energy w.r.. he ieraions. Experimen is averaged over, random spring sysems. The red and he green curves represen IDA and CGD mehods, respecively. converged in 5 ieraions, while he CGD sopped afer 47 ieraions. The spring sysems like he one described here were auomaically deeced and removed in he simulaed experimen o preven skewing resuls for he CGD. The resuls conclusively show ha he IDA converges o a global faser han CGD and is more robus. IDA, E=.245 CGD, E=9.359 Fig. 8. The dynamic par of he spring sysem before and afer opimizaion is shown in blue and red, respecively. Dynamic nodes and anchor nodes are depiced by green circles and black crosses, respecively, and he black doed lines depic he saic springs. The remaining energy E of he opimized spring sysem is shown as well. The second experimen evaluaed he IDA scalabiliy. Figure 9 shows he opimizaion speed w.r.. he spring sysem size. The number of ieraions significanly increases for he CGD wih increasing he number of pars. On he oher hand, he IDA exhibis remarkable scalabiliy by keeping he number of seps approximaely consan over a range of sysem sizes. Furhermore, he variance in he number of ieraions is kep low and consisenly much lower han for he CGD. The ieraion sep complexiy is expeced o increase wih he number of pars, since larger sysems are solved. Figure 9 also shows ha he compuaion imes indeed increase exponenially for CGD, bu he IDA hardly exhibis increase for a range of spring sysem sizes. These resuls conclusively show ha IDA scales remarkably well.

9 PAPER UNDER REVISION 9 Number of seps IDA CGD Number of nodes Time [s] IDA CGD Number of nodes Fig. 9. The number of ieraions (lef) and ime (righ) spen by IDA and CGD on opimizaion wih respec o he spring sysem size The DPT parameers analysis The DPT design choices were evaluaed on a sae-of-hear shor-erm racking benchmark VOT24 [4], [47]. In conras o relaed benchmarks ha aim a large daases, he daases in VOT iniiaive [47] are consruced by focusing on he challenging, well annoaed, sequences while keeping he daase small. The objecs are annoaed by roaed bounding boxes and all sequences are per-frame annoaed by visual aribues. The VOT evaluaion proocol iniializes he racker from a ground ruh bounding box. Once he overlap beween he ground ruh and racker oupu bounding box falls o zero, a failure is deeced and racker is re-iniialized. The VOT oolki measures wo basic racking performance aspecs: rese-based accuracy and robusness. The rese-based accuracy is measured as he average overlap during successful racking, while he robusness measures he number of failures (i.e., number of acker re-ses). Apar from reporing raw accuracy/robusness values, he benchmark can rank rackers wih respec o hese measures separaely by aking ino accoun he saisical as well as pracical difference. Since 25 he VOT primary overall accuracy measure is he expeced average overlap (EAO). This measure calculaes he expeced overlap on fixedlengh sequences ha a racker would aain wihou rese. In addiion we also repor he primary OTB [] measure. The OTB performance evaluaion primarily differs from he VOT [4] in ha rackers are no rese a failure. The overall performance is repored by an average overlap (AO) over all sequences. The firs experimen analyzed he conribuions of he proposed segmenaion in he coarse layer and he lowerlayer consellaion model. The baseline racker was a DPT varian ha does no use he consellaion, nor he segmenaion (DPT nos crs ), which is in fac he original KCF [] correlaion filer. Adding a segmenaion model o he baseline racker resuls in he coarse layer in our par-based racker, which we denoe by DPT crs. Table clearly shows ha he number of failures is reduced by our segmenaion and he overall accuracy (EAO and AO) increases for DPT nos crs. By adding he lower layer o he DPT crs, we arrive a he proposed DPT, which furher booss he performance by all measures. In paricular, he number of failures is reduced by over 4%, he rese-based accuracy increases by over %, he expeced average overlap (EAO) increases by 8% and he OTB average overlap (AO) increases by %. The VOT ranking mehodology was applied o hese hree rackers. The DPT was ranked as he op-performing racker, which conclusively shows ha he improvemens are saisically as well as pracically significan. TABLE Performance of DPT varians in erms of raw rese-based accuracy (res. acc.) and robusness (rob.), he VOT rank, he VOT no-rese accuracy (expeced average overlap, EAO) and he OTB no-rese average overlap (AO). The arrows and indicae ha higher is beer and lower is beer, respecively. DPT VOT Raw values VOT OTB varian EA res. acc. rob. rank AO DPT DPT crs DPT nos crs DPT sr DPT loc DPT ov DPT nov The DPT varians wih fully conneced, locally conneced and sar-based opology, DPT, DPT loc, DPT sr, respecively, were compared o evaluae he influence of he lowerlayer opology. The op performance in erms of he VOT EAO as well as OTB AO is achieved by he fully-conneced opology, followed by he locally-conneced and sar-based opology. This order remains he same under he VOT ranking mehodology, which confirms ha he improvemens of he fully-conneced opology over he alernaives are saisically as well as pracically significan. For compleeness, we have furher esed he DPT performance wih he increased number of pars a he lower layer. Given he consrains imposed on he pars size (as discussed in Secion 3.), we esed wo varians wih 3 3 = 9 pars: one wih overlapping pars of he same size as in he original DPT (DPT ov 3 3 ) and one wih smaller, non-overlapping, pars (DPT nov 3 3 ). Table shows ha hese versions of DPT perform similarly in erms of overall performance (EAO and AO), wih DPT 3 3 obaining slighly beer rank, which is due o slighly beer robusness han DPT ov 3 3. Boh varians are ouperformed by he original 2 2 DPT. The improvemen of DPT over he bes DPT 3 3 racker is over 2% in erms of he expeced average overlap and approximaely 2% in erms of he OTB average overlap. The smaller difference in OTB AO is because DPT 3 3 has a similar accuracy as DPT, bu fails more ofen. The OTB AO effecively measures he accuracy only up o he firs failure. Bu he raw values clearly show superior robusness in DPT which is refleced in EAO. 3.3 Comparison o he sae-of-he-ar baselines The DPT racker is a layered deformable pars correlaion filer, herefore we compared i o he sae-of-hear par-based as well as holisic discriminaive rackers. The se of baselines included: (i) he recen sae-of-hear par-based baselines, PT [8], DGT [7], CMT [48] and LGT [3], (ii) he sae-of-he-ar discriminaive baselines TGPR [26], Sruck [5], DSST [9], KCF [] SAMF [], STC [32], MEEM [49], MUSTER [33] and HRP [5], and (iii) he sandard baselines CT [23], IVT [22], MIL [6]. This

10 PAPER UNDER REVISION is a highly challenging se of recen sae-of-he-ar conaining all published op-performing rackers on VOT24, including he winner of he challenge DSST [9] and rackers recenly published a major compuer vision conferences and journals. The AR-raw, AR-rank and he expeced average overlap plo of he VOT24 rese-based experimen are shown in Figure (a,b,c). In erms of AR-raw and AR-rank plos, he DPT ouperforms all rackers by being closes o he op-righ par of he plos. The racker exhibis excellen radeoff beween robusness and accuracy, aaining high accuracy during successful racks and rarely fails. This is refleced in he average expeced overlap measure, which ranks his racker as a op performing racker (Figure c and he las row in Table 2). The DPT ouperforms he bes par-based racker LGT [3] ha applies a locally-conneced consellaion model and color segmenaion by over 8% and he winner of he VOT24 challenge, he scale adapive correlaion filer DSST [9], by 3%. The VOT rese-based mehodology reses he racker afer failure, bu some rackers, like MUSTER [33], MEEM [49] and CMT [48] explicily address arge loss and implemen mechanisms for arge re-deecion upon drifing. Alhough hese are long-erm capabiliies and DPT is a shor-erm racker ha does no perform re-deecion, we performed he no-rese OTB [] experimen o gain furher insighs. The OTB [] mehodology repors he racker overlap precision wih respec o he inersecion hresholds in a form a success plo (Figure d). The rackers are hen ranked by he area under he curve (AUC) measure, which is equivalen o a no-rese average overlap [5]. The DPT ouperforms he bes baseline color-based superpixel shor-erm racker DGT [7] and he long-erm racker MUSTER [33], which combines robus keypoin maching, correlaion filer (DSST [9]), HoG and color feaures. The DPT also ouperformed he recen sae-of-he-ar discriminaive correlaion filer-based rackers like DSST [9], colorbased SAMF [], he recenly proposed muli-snapsho online SVM-based MEEM [49] and he recen logisic regression racker HRP [5] racker. The resuls conclusively show op global performance over he relaed sae-of-he-ar wih respec o several performance measures and experimenal seups Per-aribue analysis Nex we analyzed racking performance wih respec o he visual aribues. The VOT24 benchmark provides a highly deailed per-frame annoaion wih he following aribues: camera moion, illuminaion change, occlusion, size change and moion change. In addiion o hese, we manually annoaed sequences ha conained deformable arges by he deformaion aribue. If a frame did no conain any aribue or deforming arge, i was annoaed by an empy aribue. The racking performance wih respec o each aribue is shown in Figure and Table 2. The DPT ouperforms all rackers on occlusion, camera moion, moion change and deformaion and is among he op-performing rackers on illuminaion change, size change and empy. Noe ha he DPT ouperformed all rackers ha explicily address arge drif and parial occlusion, i.e., MUSTER [33], MEEM [49], Accuracy Average expeced overlap VOT24 AR raw plo 6 DPT DGT SAMF Robusness (S = 3) VOT24 expeced overlap scores 3 Order CMT 7 DSST 4 CT TGPR LGT PT STC IVT KCF MIL MEEM MUSTER HRP Sruck a c Accuracy rank Success rae VOT24 AR rank plo Robusness rank Success plos of OPE on VOT24 Sruck [.373] PT [.36] LGT [.353] STC [.326] IVT [.295] CMT [.27] MIL [.255] CT [.255] Overlap hresholds 4 2 DPT [.486] DGT [.483] MUSTER [.462] SAMF [.46] DSST [.446] TGPR [.49] MEEM [.44] HRP [.39] KCF [.377] Fig.. The VOT24 AR raw (a), AR rank (b), expeced average overlap (c) and he OTB success plo (d). camera moion illuminaion change occlusion size change moion change deformaion empy Expeced overlap.75 b d DPT DGT SAMF Sruck DSST KCF LGT MEEM MUSTER HRP CMT TGPR PT STC CT IVT MIL Fig.. The expeced average overlap wih respec o he visual aribues on he VOT24 daase. CMT [48], Sruck [5]. The DPT also ouperforms op parbased rackers ha address non-rigid deformaions, i.e., LGT [3], DGT [7], PT [8] and CMT [48]. These resuls indicae a balanced performance in ha he DPT does no only excel a a given aribue bu performs well over all visual aribues. 3.4 Performance on benchmarks For compleeness of he analysis we have benchmarked he proposed racker on he recen benchmarks. The DPT performance on he VOT24 benchmark [4] compared o he 38 rackers available in ha benchmark is shown in Figure 2. The DPT excels in he rese-based accuracy, robusness as well as he expeced average overlap accuracy measure and is ranked hird, ouperforming 92% of he rackers on he benchmark. The wo rackers ha ouperform he DPT are varians of he unpublished PLT racker [4]. The DPT performance on he mos recen and challenging VOT25 benchmark [4] compared o he 6 rackers included in ha benchmark are shown in Figure 3. The

11 PAPER UNDER REVISION TABLE 2 The per-aribue expeced average overlap, i.e., EAO measure, (Ω), rese-based overlap (O) and number of failures (F) for he op ranked rackers over 7 visual aribues: camera moion (CM), deformaion (DE), empy (EM), illuminaion change (IC), moion change (MC), occlusion (OC), size change (SC). The arrows and indicae ha higher is beer and lower is beer, respecively. ar. CM DE EM IC MC OC SC Average EA DPT LGT [3] DSST [9] DGT [7] SAMF [] VOT24 expeced overlap scores MEEM [49] KCF [] HRP [5] VOT25 expeced overlap scores PLT_3 ABS Sruck ACT CMT EDFT HMMTxD.2 PLT_4 SAMF IMPNCC DSST FoT OGT CT DPT LT_FLO IPRT DynMS IVT asruck LGTv IIVTv2 SIR_PF PTp NCC easms BDF ACAT KCF FRT DGT VTDMG MCT MaFlow Marioska FSDT MIL qwsedft Average expeced overlap Accuracy Average expeced overlap Accuracy TGPR [26] VOT25 AR raw plo VOT24 AR raw plo.9.2 MUSTER [33] VOT25 published soa bound ThunderSruck.4.6 Robusness (S = 3) Order Fig. 2. The AR raw plos and he expeced average overlap accuracy measures for VOT24 benchmark [4]. Please see [4] for he racker references Order ASMS FoT CMT zhang CT HMMTxD DAT HT DFT IVT sruck2 LT_FLO MCT MDNe MEEM MIL PKLTF cmil sumshif dracker SCBT fc SODLT gg TGPR kcf_msa am kcfdp mkcf_plus spst muser samf mvcf sceb ncc sme nsamf sra LGT EBT DPT DSST ACT ric KCF2 LAPG OAB baseline OACF bdf kcfv2 rajssc srdcf lof_lie s3racker maﬂow skcf RobSruck DeepSRDCF FragTrack AOGTracker Fig. 3. The AR raw plos and he expeced average overlap accuracy measures for VOT25 benchmark [4]. Please see [4] for he racker references. DPT [.54] SCM [.499] Sruck [.474] TLD [.437] ASLA [.434] CXT [.426] VTS [.46] VTD [.46] CSK [.398] LSK [.395] DFT [.389] LAPG [.38] MTT [.376] OAB [.37] LOT [.367] MIL [.359] IVT [.358] CPF [.355] TM-V [.352] Frag [.352] RS-V [.346] ORIA [.333] SemiT [.332] KMS [.326] BSBT [.322] PD-V [.38] CT [.36] VR-V [.268] SMS [.229] MS-V [.22] Success plos of OPE on OTB Success rae racker is ranked among he op % of all rackers, ouperforming 54 rackers (i.e., 9% of he benchmark). The DPT ouperforms all fifeen par-based rackers and foureen correlaion filer rackers, including he nsamf, which is an improved version of [] ha applies color as well as fusion wih various models, and he recenly published improved Sruck [52] ha applies addiional feaures and performs remarkably well compared o he original version [5]. The VOT25 provides a VOT25 published soa bound compued by averaging performance of rackers published in 24/25 in op compuer vision conferences and journals. Any racker wih performance over his boundary is considered a sae-of-he-ar racker according o VOT. The DPT is posiioned well above his boundary and is considered a sae-of-he-ar according o he sric VOT25 sandards. The DPT performance agains 29 rackers available on he sandard OTB [] benchmark is shown in Figure 4. The DPT ouperforms all rackers and is ranked op, exceeding he performance of he second-bes racker by over 8%..4.6 Robusness (S = ) Overlap hreshold 3.5 Qualiaive analysis Qualiaive analysis is provided for furher insighs. An experimen was performed o demonsrae he effeciveness of par adapaions during significan parial occlusions. The DPT was applied o a well-known sequence, in which he objec (face) undergoes repeiive parial occlusions by a book (see Figure 5). The DPT racked he face wihou failures. Figure 5 shows images of he face aken from he sequence along wih he graph of color-coded par weighs (i) w. The auomaically compued adapaion hreshold is shown in gray. Recall ha par is updaed if he weigh Fig. 4. The OPE performance plo for he op rackers on he OTB benchmark []. Please see [] for he racker references. exceeds his hreshold (Secion 2.5). Observe ha parial occlusions are clearly idenified by he weigh graphs, resuling in drif prevenion and successful racking hrough parial occlusions. Addiional qualiaive examples are provided in Figure 6. The firs row in Figure 6 shows performance on a non-deformable arge wih fas-varying local appearance.

PAPER UNDER REVISION 2.45.4.35.3.25.2.5..5 5 2 2 25 3 375 Par Par 2 Par 3 Par 4 Treshold P P 2 5 5 2 25 3 35 4 45 P P 3 4 Iniial posiions of pars Fig. 5. Qualiaive racking resuls of parially occluded objec.

The DPT racks he arge hroughou he sequence, while holisic correlaion- and SVM-based rackers [5], [9], [33] fail.

Noe ha he DPT comforably racks he arge, while he relaed rackers fail. The firs and second row in Figure 7 visualizes successful racking performance on arges undergoing significan illuminaion changes.

The consellaion model overcomes he occlusion and coninues racking during and afer he occlusions. 4 CONCLUSION A new class of deformable pars rackers based on correlaion filers is presened.

The pars appearance models are updaed by online regression o resul in Gaussian-like likelihood funcions and he geomeric consrains are modeled as a fully-conneced spring sysem.

12 PAPER UNDER REVISION Par Par 2 Par 3 Par 4 Treshold P P P P 3 4 Iniial posiions of pars Fig. 5. Qualiaive racking resuls of parially occluded objec. A skech of pars is shown on he righ-hand side. Par weighs are color-coded, wih he updae hreshold shown in gray. The DPT racks he arge hroughou he sequence, while holisic correlaion- and SVM-based rackers [5], [9], [33] fail. The second, hird and fourh row show racking of deformable arges of various degrees of deformaion. The fourh row shows racking of a gymnas ha drasically and rapidly changes he appearance. Noe ha he DPT comforably racks he arge, while he relaed rackers fail. The firs and second row in Figure 7 visualizes successful racking performance on arges undergoing significan illuminaion changes. The hird row shows racking hrough several long-erm parial occlusions. Again, he DPT successfully racks he arge even hough he boom par remains occluded for a large number of frames. The consellaion model overcomes he occlusion and coninues racking during and afer he occlusions. 4 CONCLUSION A new class of deformable pars rackers based on correlaion filers is presened. The developed deformable pars model joinly reas he visual and geomeric properies wihin a single formulaion, resuling in a convex opimizaion problem. The pars appearance models are updaed by online regression o resul in Gaussian-like likelihood funcions and he geomeric consrains are modeled as a fully-conneced spring sysem. We have shown ha he dual represenaion of such a deformable pars model is an exended spring sysem and ha minimizaion of he corresponding energy funcion leads o a MAP inference on he deformable pars model. A highly efficien opimizaion called ieraed direc approach (IDA) is derived for his dual formulaion. A deformable pars correlaion filer racker (DPT) is proposed ha combines a coarse objec represenaion wih a mid-level consellaion of deformable pars model in op-down localizaion and boom-up updaes. The exensive analysis of he new spring-sysem opimizaion mehod IDA showed remarkable convergence and robusness properies. In paricular, he IDA converges much faser han he conjugaed gradien descen, is numerically more robus and scales very well wih increasing he number of pars in he spring sysem. Our racker was rigorously compared agains he sae-of-he-ar wih respec o several performance measures and experimenal seups agains sixeen sae-of-he-ar baselines. The DPT racker ouperforms he relaed sae-of-he-ar par-based rackers as well as sae-of-he-ar rackers ha use a single appearance model, including he winner of he VOT24 challenge and runs in real-ime. Addiional ess show ha improvemens come from he fully-conneced consellaion and he op-down/boom-up combinaion of he coarse represenaion wih he proposed deformable pars model. The DPT racker was benchmarked on hree recen highly challenging benchmarks agains 38 rackers on VOT24 [4] benchmark, 6 rackers on VOT25 [4] benchmark and 29 rackers on he OTB [] benchmark. The DPT aained a sae-of-he-ar performance on all benchmarks. Noe ha, since five KCFs [] are used in DPT, he speed reducion is approximaely five imes compared o he baseline KCF. Bu he boos in performance is significan. The DPT reduces he failures compared o he baseline KCF by nearly 6%, he expeced average overlap is increased by over 8% and he OTB average overlap is increased by approximaely 3% while sill aaining real-ime performance. The proposed deformable pars model is highly exendable. The dual formulaion of he deformable consellaion and he proposed opimizer are generally applicable as sand-alone solvers for deformable pars models. The appearance models on pars can be poenially replaced wih oher discriminaive or generaive models or augmened o obain a consellaion of pars based on various feaures like key-poins and pars of differen shapes. The par-based models like flocks of feaures [35], key-poin-based [37], [48] and superpixel-based [7] ypically use more pars han he racker presened in his paper. Our analysis shows ha he proposed opimizaion of he deformaion model scales well wih he number of pars, and could be poenially used in hese rackers as a deformaion model. Pars could also be replaced wih scale-adapive pars, which could furher improve scale adapaion of he whole racker. Alernaively, saliency regions could be used o improve localizaion. One way o inroduce he saliency is a he coarse layer and anoher o apply i a he pars localizaion. Since he model is fully probabilisic, i can be readily inegraed wih probabilisic dynamic models. These will be he opics of our fuure work. REFERENCES [] Y. Wu, J. Lim, and M.-H. Yang, Online objec racking: A benchmark, in Comp. Vis. Pa. Recogniion, 23, pp [2] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, Visual racking: An experimenal survey, IEEE Trans. Paern Anal. Mach. Inell., vol. 36, no. 7, pp , July 24. [3] M. Krisan, R. Pflugfelder, A. Leonardis, J. Maas, F. Porikli, L. Čehovin, G. Nebehay, G. Fernandez, and T. e. a. Vojir, The visual objec racking vo23 challenge resuls, in Vis. Obj. Track. Challenge VOT23, In conjuncion wih ICCV23, Dec 23, pp. 98. [4] M. Krisan, R. Pflugfelder, A. Leonardis, J. Maas, L. Čehovin, G. Nebehay, T. Vojir, and G. e al. Fernandez, The visual objec racking vo24 challenge resuls, in Proc. European Conf. Compuer Vision, 24, pp [5] S. Hare, A. Saffari, and P. H. S. Torr, Sruck: Srucured oupu racking wih kernels, in In. Conf. Compuer Vision. Washingon, DC, USA: IEEE Compuer Sociey, 2, pp

13 PAPER UNDER REVISION Fig. 6. Qualiaive comparaive examples of racking for DPT, DSST, MUSTER and Sruck shown in green, red, magena and cyan, respecively Fig. 7. Qualiaive examples of DPT racker on hree sequences. Tracking bounding box is visualized wih yellow color and four pars on mid-level represenaion are shown in blue. [6] B. Babenko, M.-H. Yang, and S. Belongie, Robus objec racking wih online muliple insance learning, IEEE Trans. Paern Anal. Mach. Inell., vol. 33, no. 8, pp , Aug. 2. [7] H. Grabner, M. Grabner, and H. Bischof, Real-ime racking via on-line boosing, in Proc. Briish Machine Vision Conference, vol., 26, pp [8] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, Visual objec racking using adapive correlaion filers, in Comp. Vis. Pa. Recogniion. IEEE, 2, pp [9] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, Accurae scale esimaion for robus visual racking, in Proc. Briish Machine Vision Conference, 24, pp.. [] Y. Li and J. Zhu, A scale adapive kernel correlaion filer racker wih feaure inegraion, in Proc. European Conf. Compuer Vision, 24, pp [] J. F. Henriques, R. Caseiro, P. Marins, and J. Baisa, High-speed racking wih kernelized correlaion filers, IEEE Trans. Paern Anal. Mach. Inell., vol. 37, no. 3, pp , 24. [2] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based objec racking, IEEE Trans. Paern Anal. Mach. Inell., vol. 25, no. 5, pp , May 23. [3] L. Čehovin, M. Krisan, and A. Leonardis, Robus visual racking using an adapive coupled-layer visual model, IEEE Trans. Paern Anal. Mach. Inell., vol. 35, no. 4, pp , Apr. 23. [4] J. Kwon and K. M. Lee, Tracking by sampling and inegraing muliple rackers, IEEE Trans. Paern Anal. Mach. Inell., vol. 36, no. 7, pp , July 24. [5] M. E. Maresca and A. Perosino, Marioska: A muli-level approach o fas racking by learning, in Proc. In. Conf. Image Analysis and Processing, 23, pp

PAPER UNDER REVISION 4 [6] N. M. Arner, A. Ion, and W. G. Kropasch, Muli-scale 2d racking of ariculaed objecs using hierarchical spring sysems, Pa. Recogn., vol. 44, no. 4, pp. 8 8, 2. [7] Z. Cai, L.

Zhang, and A. van den Hengel, Parbased visual racking wih online laen srucural learning, in Comp. Vis. Pa. Recogniion, June 23, pp. 2363 237. [9] M. Godec, P. M. Roh, and H.

Briish Machine Vision Conference, 24, pp. 2. [2] R. T. Collins, X. Liu, and M. Lordeanu, Online selecion of discriminaive racking feaures, IEEE Trans. Paern Anal. Mach. Inell., vol. 27, no., pp. 63 643, 25.

14 PAPER UNDER REVISION 4 [6] N. M. Arner, A. Ion, and W. G. Kropasch, Muli-scale 2d racking of ariculaed objecs using hierarchical spring sysems, Pa. Recogn., vol. 44, no. 4, pp. 8 8, 2. [7] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Li, Robus deformable and occluded objec racking wih dynamic graph, IEEE Trans. Image Proc., vol. 23, no. 2, pp , 24. [8] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, Parbased visual racking wih online laen srucural learning, in Comp. Vis. Pa. Recogniion, June 23, pp [9] M. Godec, P. M. Roh, and H. Bischof, Hough-based racking of non-rigid objecs. Comp. Vis. Image Undersanding, vol. 7, no., pp , 23. [2] G. Zhu, J. Wang, C. Zhao, and H. Lu, Par conex learning for visual racking, in Proc. Briish Machine Vision Conference, 24, pp. 2. [2] R. T. Collins, X. Liu, and M. Lordeanu, Online selecion of discriminaive racking feaures, IEEE Trans. Paern Anal. Mach. Inell., vol. 27, no., pp , 25. [22] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremenal learning for robus visual racking, In. J. Compu. Vision, vol. 77, no. -3, pp. 25 4, May 28. [23] K. Zhang, L. Zhang, and M.-H. Yang, Real-ime compressive racking, in Proc. European Conf. Compuer Vision, 22, pp [24] X. Mei and H. Ling, Robus visual racking and vehicle classificaion via sparse represenaion, IEEE Trans. Paern Anal. Mach. Inell., vol. 33, no., pp , Nov 2. [25] Z. Hong, X. Mei, D. Prokhorov, and D. Tao, Tracking via robus muli-ask muli-view join sparse represenaion, in In. Conf. Compuer Vision, Dec 23, pp [26] J. Gao, H. Ling, W. Hu, and J. Xing, Transfer learning based visual racking wih gaussian processes regression, in Proc. European Conf. Compuer Vision, vol. 869, 24, pp [27] S. Avidan, Suppor vecor racking, IEEE Trans. Paern Anal. Mach. Inell., vol. 26, no. 8, pp , Aug 24. [28] H. Possegger, T. Mauhner, and H. Bischof, In defense of colorbased model-free racking, in Comp. Vis. Pa. Recogniion, 25, pp [29] P. Naidu, Improved opical characer recogniion by mached filering, Opics Communicaions, vol. 2, no. 3, pp , 974. [3] M. Zhang, J. Xing, J. Gao, and W. Hu, Robus visual racking using join scale-spaial correlaion filers, in Proc. In. Conf. Image Processing. IEEE, 25, pp [3] Y. Li, J. Zhu, and S. C. Hoi, Reliable pach rackers: Robus visual racking by exploiing reliable paches, in Comp. Vis. Pa. Recogniion, 25, pp [32] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang, Fas visual racking via dense spaio-emporal conex learning, in Proc. European Conf. Compuer Vision. Springer Inernaional Publishing, 24, pp [33] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, Muli-sore racker (muser): A cogniive psychology inspired approach o objec racking, in Comp. Vis. Pa. Recogniion, 25, pp [34] J. Hoey, Tracking using flocks of feaures, wih applicaion o assised handwashing, in Proc. Briish Machine Vision Conference, vol., 26, pp [35] T. Vojir and J. Maas, The enhanced flock of rackers, in Regisraion and Recogniion in Images and Videos, ser. Sudies in Compuaional Inelligence. Springer Berlin Heidelberg, 24, vol. 532, pp [36] B. Marinez and X. Binefa, Piecewise affine kernel racking for non-planar arges, Pa. Recogn., vol. 4, no. 2, pp , 28. [37] F. Pernici and A. Del Bimbo, Objec racking by oversampling local feaures, IEEE Trans. Paern Anal. Mach. Inell., vol. 36, no. 2, pp , 23. [38] X. Yang, Q. Xiao, S. Wang, and P. Liu, Real-ime racking via deformable srucure regression learning, in Proc. In. Conf. Paern Recogniion, 24, pp [39] S. Duffner and C. Garcia, PixelTrack: a fas adapive algorihm for racking non-rigid objecs, in In. Conf. Compuer Vision, 23, pp [4] G. Duan, H. Ai, S. Cao, and S. Lao, Group racking: exploring muual relaions for muliple objec racking, in Proc. European Conf. Compuer Vision. Springer, 22, pp [4] M. Krisan, J. Maas, A. Leonardis, M. Felsberg, L. Čehovin, G. Fernandez, T. Vojir, G. Häger, G. Nebehay, and R. e al. Pflugfelder, The visual objec racking vo25 challenge resuls, in In. Conf. Compuer Vision, 25. [42] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. van de Weijer, Adapive color aribues for real-ime visual racking, in Comp. Vis. Pa. Recogniion, 24, pp [43] M. Krisan, J. Perš, V. Sulič, and S. Kovačič, A graphical model for rapid obsacle image-map esimaion from unmanned surface vehicles, in Proc. Asian Conf. Compuer Vision, 24, pp [44] A. Diplaros, N. Vlassis, and T. Gevers, A spaially consrained generaive model and an em algorihm for image segmenaion, IEEE Trans. Neural Neworks, vol. 8, no. 3, pp , 27. [45] N. Dalal and B. Triggs, Hisograms of oriened gradiens for human deecion, in Comp. Vis. Pa. Recogniion, vol., June 25, pp [46] I. Griva, S. G. Nash, and A. Sofer, Linear and Nonlinear Opimizaion, Second Ediion. Siam, 29. [47] M. Krisan, J. Maas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Cehovin, A novel performance evaluaion mehodology for single-arge rackers, IEEE Trans. Paern Anal. Mach. Inell., 26. [48] G. Nebehay and R. Pflugfelder, Clusering of saic-adapive correspondences for deformable objec racking, in Comp. Vis. Pa. Recogniion, 25, pp [49] J. Zhang, S. Ma, and S. Sclaroff, MEEM: robus racking via muliple expers using enropy minimizaion, in Proc. European Conf. Compuer Vision, 24, pp [5] N. Wang, J. Shi, D.-Y. Yeung, and J. Jia, Undersanding and diagnosing visual racking sysems, in In. Conf. Compuer Vision, 25. [5] L. Čehovin, A. Leonardis, and M. Krisan, Visual objec racking performance measures revisied, IEEE Trans. Image Proc., vol. 25, no. 3, pp , 26. [52] S. Hare, S. Golodez, A. Saffari, V. Vinee, M. Cheng, S. Hicks, and P. Torr, Sruck: Srucured oupu racking wih kernels, IEEE Trans. Paern Anal. Mach. Inell., 26. Alan Lukežič received he Dipl.ing. and M.Sc. degrees a he Faculy of Compuer and Informaion Science, Universiy of Ljubljana, Slovenia in 22 and 25, respecively. He is currenly a researcher a he Visual Cogniive Sysems Laboraory, Faculy of Compuer and Informaion Science, Universiy of Ljubljana, Slovenia as a researcher. His research ineress include compuer vision, daa mining and machine learning. Luka Čehovin received his Ph.D from he Faculy of Compuer and Informaion Science, Universiy of Ljubljana, Slovenia in 25. Currenly he is working a he Visual Cogniive Sysems Laboraory, Faculy of Compuer and Informaion Science, Universiy of Ljubljana, Slovenia as a eaching assisan and a researcher. His research ineress include compuer vision, HCI, disribued inelligence and web-mobile echnologies. Maej Krisan received a Ph.D from he Faculy of Elecrical Engineering, Universiy of Ljubljana in 28. He is an Assisan Professor a he Vi- CoS Laboraory a he Faculy of Compuer and Informaion Science and a he Faculy of Elecrical Engineering, Universiy of Ljubljana. His research ineress include probabilisic mehods for compuer vision wih focus on visual racking, dynamic models, online learning, objec deecion and vision for mobile roboics.

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding Moivaion Image segmenaion Which pixels belong o he same objec in an image/video sequence? (spaial segmenaion) Which frames belong o he same video sho? (emporal segmenaion) Which frames belong o he same