Robust Visual Tracking via Structured Multi-Task Sparse Learning

Size: px

Start display at page:

Download "Robust Visual Tracking via Structured Multi-Task Sparse Learning"

Arthur Whitehead
5 years ago
Views:

1 Int J Comput Vs (203) 0: DOI 0.007/s z Robust Vsual Trackng va Structured Mult-Task Sparse Learnng Tanzhu Zhang Bernard Ghanem S Lu Narendra Ahuja Receved: 2 Aprl 202 / Accepted: 30 September 202 / Publshed onlne: 9 November 202 Sprnger Scence+Busness Meda New York 202 Abstract In ths paper, we formulate object trackng n a partcle flter framework as a structured mult-task sparse learnng problem, whch we denote as Structured Mult-Task Trackng (S-MTT). Snce we model partcles as lnear combnatons of dctonary templates that are updated dynamcally, learnng the representaton of each partcle s consdered a sngle task n Mult-Task Trackng (MTT). By employng popular sparsty-nducng l p,q mxed norms (specfcallyp {2, } and q = ), we regularze the representaton problem to enforce jont sparsty and learn the partcle representatons together. As compared to prevous methods that handle partcles ndependently, our results demonstrate that mnng the nterdependences between partcles mproves trackng Electronc supplementary materal The onlne verson of ths artcle (do:0.007/s z) contans supplementary materal, whch s avalable to authorzed users. T. Zhang Advanced Dgtal Scences Center (ADSC), Fusonopols Way, #08-0 Connexs North Tower, Sngapore38632, Sngapore e-mal: tzzhang0@gmal.com B. Ghanem Kng Abdullah Unversty of Scence and Technology (KAUST), Al Khwarzm Buldng #2224, Thuwal, Kngdom of Saud Araba e-mal: bernard.ghanem@kaust.edu.sa S. Lu (B) Department of Electrcal and Computer Engneerng, Natonal Unversty of Sngapore, 4 Engneerng Drve 3, Sngapore 7576, Sngapore e-mal: dcslus@nus.edu.sg N. Ahuja Department of Electrcal and Computer Engneerng, Beckman Insttute, and Coordnated Scence Laboratory, Unversty of Illnos at Urbana-Champagn, 204 Beckman Insttute, 405 N. Mathews Ave., Urbana, IL 680, USA e-mal: ahuja@vson.a.uuc.edu performance and overall computatonal complexty. Interestngly, we show that the popular L tracker (Me and Lng, IEEE Trans Pattern Anal Mach Intel 33(): , 20) s a specal case of our MTT formulaton (denoted as the L tracker) when p = q =. Under the MTT framework, some of the tasks (partcle representatons) are often more closely related and more lkely to share common relevant covarates than other tasks. Therefore, we extend the MTT framework to take nto account parwse structural correlatons between partcles (e.g. spatal smoothness of representaton) and denote the novel framework as S-MTT. The problem of learnng the regularzed sparse representaton n MTT and S-MTT can be solved effcently usng an Accelerated Proxmal Gradent (APG) method that yelds a sequence of closed form updates. As such, S-MTT and MTT are computatonally attractve. We test our proposed approach on challengng sequences nvolvng heavy occluson, drastc llumnaton changes, and large pose varatons. Expermental results show that S-MTT s much better than MTT, and both methods consstently outperform state-of-the-art trackers. Keywords Vsual trackng Partcle flter Graph Structure Sparse representaton Mult-task learnng Introducton The problem of trackng a target n vdeo arses n many mportant applcatons such as automatc survellance, robotcs, human computer nteracton, etc. For a vsual trackng algorthm to be useful n real-world scenaros, t should be desgned to handle and overcome cases where the target s appearance changes from frame-to-frame. Sgnfcant and rapd appearance varaton due to nose, occluson, varyng

368 Int J Comput Vs (203) 0:367 383 Fg. (Color onlne) Frames from a shakng sequence. The ground truth track of the head s desgnated n green.

2 368 Int J Comput Vs (203) 0: Fg. (Color onlne) Frames from a shakng sequence. The ground truth track of the head s desgnated n green. Due to fast moton, occluson, cluttered background, and changes n llumnaton, scale, and pose, vsual object trackng s a dffcult problem vewponts, background clutter, and llumnaton and scale changes pose major challenges to any tracker as shown n Fg.. Over the years, a plethora of trackng algorthms have been proposed to overcome these challenges. For a survey of many of these algorthms, we refer the reader to Ylmaz et al. (2006). Recently, sparse representaton (Candès et al. 2006) has been successfully appled to vsual trackng (Me and Lng 20; Me et al. 20; Lu et al. 200, 20). In ths case, the tracker represents each target canddate as a sparse lnear combnaton of dctonary templates that can be dynamcally updated to mantan an up-to-date target appearance model. Ths representaton has been shown to be robust aganst partal occlusons, whch leads to mproved trackng performance. However, sparse codng based trackers perform computatonally expensve l mnmzaton at each frame. In a partcle flter framework, computatonal cost grows lnearly wth the number of sampled partcles. It s ths computatonal bottleneck that precludes the use of these trackers n realtme scenaros. Consequently, very recent efforts have been made to speedup ths trackng paradgm (Me et al. 20; L et al. 20). More mportantly, these methods learn sparse representatons of partcles separately. Ignorng the relatonshps that ultmately constran partcle representatons tend to make the tracker more prone to drftng away from the target, especally n cases of sgnfcant changes n appearance. In ths paper, we propose a computatonally effcent mult-task sparse learnng approach for vsual trackng n a partcle flter framework. Here, learnng the representaton of each partcle s vewed as an ndvdual task. Inspred by the above work, the next target state s selected to be the partcle that has the hghest smlarty wth a dctonary of target templates. Unlke prevous methods, we explot smlartes among partcles and, therefore, seek an accurate, jont representaton of these partcles w.r.t. the dctonary. In our mult-task approach, partcle representatons are jontly sparse only a few (but the same) dctonary templates should be used to represent all the partcles at each frame. As opposed to sparse codng based trackers (Me and Lng 20; Me et al. 20; Lu et al. 200, 20) that handle partcles separately, our use of jont sparsty ncorporates the benefts of a sparse partcle representaton (e.g. partal occluson handlng), whle respectng the underlyng relatonshp between partcles, whch nherently yelds a tracker that s more robust aganst varous sources of appearance change. Therefore, we propose a mult-task formulaton (denoted as Mult-Task Trackng or MTT) for the robust object trackng problem. We explot nterdependences among the appearances of dfferent partcles to obtan ther representatons jontly. Jont sparsty s mposed on partcle representatons through an l p,q mxed-norm regularzer, whch s optmzed usng an Accelerated Proxmal Gradent (APG) method that guarantees fast convergence. In fact, jont sparsty can be vewed as a global form of structural regularzaton that nfluences all partcle representatons together. Furthermore, to extend the MTT framework to enforce local structure, we observe that some tasks (partcle representatons) are often more closely related and more lkely to share common relevant covarates than other tasks. Therefore, we expand the MTT framework to consder parwse structural correlatons between partcles (e.g. spatal smoothness of representaton) and denote the novel framework as Structured Mult-Task Trackng abbrevated as S-MTT. A prelmnary conference verson of ths work can be referred to n Zhang et al. (202b). Contrbutons: The contrbutons of ths work are three-fold.. We propose a mult-task sparse learnng method for object trackng, whch s a robust sparse codng method that mnes relatonshps between dfferent tasks to obtan better trackng results than learnng each task ndvdually. Ths s done by explotng both global and local structure among tasks. To the best of our knowledge, ths s the frst work to explot mult-task learnng n object trackng. 2. We show that the popular L tracker (Me and Lng 20) s a specal case of the proposed MTT framework. 3. Snce we learn partcle representatons jontly, we can solve the S-MTT and MTT problems effcently usng an APG method. Ths makes our trackng method computatonally attractve n general and sgnfcantly faster than the tradtonal L tracker n partcular. The rest of the paper s organzed as follows. In Sect. 2, we summarze the works most related to ours. The partcle flter algorthm s revewed n Sect. 3. Secton 4 gves a detaled descrpton of the proposed trackng approach, wth

3 Int J Comput Vs (203) 0: the optmzaton detals presented n Sect Expermental results are reported and analyzed n Sect. 5. We conclude the paper n Sect Related Work Vsual trackng s an mportant topc n computer vson and t has been studed for several decades. There s extensve lterature on vsual object trackng. In what follows, we only brefly revew nomnal trackng methods and those that are the most related to our own. We focus specfcally on trackng methods that use partcle flters and sparse representaton, as well as, general mult-task learnng methods. For a more thorough survey of trackng methods, we refer the readers to Ylmaz et al. (2006). 2. Object Trackng In general, object trackng methods can be categorzed as ether generatve or dscrmnatve. 2.. Generatve Trackers These methods adopt an appearance model to descrbe the target observatons. Here, the am of trackng s to search for the target locaton that has the most smlar appearance to the generatve model. Examples of generatve methods are egentracker (Black and Jepson 998), mean shft tracker (Comancu et al. 2003), appearance model based tracker (Jepson et al. 2003), context-aware tracker (Yang et al. 2009), fragment-based tracker (Frag) (Adam et al. 2006), ncremental tracker (IVT) (Ross et al. 2008), and VTD tracker (Kwon and Lee 200). In Black and Jepson (998), a vew-based representaton s used for trackng rgd and artculated objects. Ths approach bulds on and extends work on egenspace representatons, robust estmaton technques, and parameterzed optcal flow estmaton. The mean shft tracker (Comancu et al. 2003) s a popular mode-fndng method, whch successfully copes wth camera moton, partal occlusons, clutter, and target scale varatons. In Jepson et al. (2003), a robust and adaptve appearance model s learned for moton-based trackng of natural objects. The model adapts to slowly changng object appearance, and t mantans an acceptable measure of stablty n the observed mage structure durng trackng. Moreover, the context-aware tracker (Yang et al. 2009) focuses on an object s context for robust vsual trackng. Specfcally, ths method ntegrates nto the trackng process a set of auxlary objects that are automatcally dscovered n the vdeo va data mnng technques. Furthermore, the trackng method proposed n Ross et al. (2008) ncrementally learns a low-dmensonal subspace representaton, and effcently adapts to onlne changes n target appearance. To adapt to varatons n appearance (e.g. due to changes n llumnaton and pose), the appearance model can be dynamcally updated. The Frag tracker (Adam et al. 2006) ams to solve partal occluson wth a representaton based on hstograms of local patches. The trackng task s carred out by accumulatng votes from matchng local patches usng a template. However, ths template s not updated and, thus, t s not expected to handle changes n object appearance that can be due to scale and shape varatons. In the IVT tracker (Ross et al. 2008), an adaptve appearance model s constructed to account for appearance varaton due to rgd or lmted deformable moton. Although t has been shown to perform well when target objects undergo lghtng and pose varaton, IVT s less effectve n handlng heavy occluson or non-rgd dstorton as a result of the adopted holstc appearance model. Fnally, the VTD tracker (Kwon and Lee 200) effectvely extends the conventonal partcle flter framework wth multple moton and observaton models to account for appearance varaton caused by changes n pose, lghtng, and scale as well as partal occluson. Nevertheless, as a result of the adopted generatve representaton scheme, ths tracker s not equpped to dstngush between the target and ts context (background) Dscrmnatve Trackers These methods formulate vsual object trackng as a bnary classfcaton problem, whch seeks the target locaton that can best separate the target from ts background. Examples of dscrmnatve methods are on-lne boostng (OAB) (Grabner et al. 2006), sem-onlne boostng (Grabner et al. 2008), ensemble trackng (Avdan 2005), co-tranng trackng (Lu et al. 2009), onlne mult-vew forests for trackng (Lestner et al. 200), adaptve metrc dfferental trackng (Jang et al. 20) and onlne multple nstance learnng trackng (Babenko et al. 2009). In the OAB tracker (Grabner et al. 2006), onlne AdaBoost s adopted to select useful features for object trackng. Its performance s affected by background clutter, and the tracker can easly drft. The ensemble tracker (Avdan 2005) formulates the trackng task as a pxel based bnary classfcaton problem. Although ths method s able to dfferentate between target and background, the pxel-based representaton s rather lmted and thereby constrans ts ablty to handle heavy occluson and clutter. In the MIL tracker (Babenko et al. 2009), the multple nstance learnng method s extended to an onlne settng for object trackng. Whle t s capable of reducng tracker drft, ths method s unable to handle large nonrgd shape deformaton. In ensemble trackng (Avdan 2005), a feature vector s constructed for every pxel n the reference mage and an adaptve ensemble of classfers s traned to separate pxels that belong to the object from pxels that belong to the background. In Collns and Lu (2003), a target confdence map

4 370 Int J Comput Vs (203) 0: s bult by fndng the most dscrmnatve RGB color combnaton n each frame. Moreover, a hybrd approach that combnes a generatve model and a dscrmnatve classfer s proposed n Yu et al. (2008) to capture appearance changes and allow reacquston of an object after total occluson. Global mode seekng can be used to detect and rentalze the tracked object after total occluson (Yn and Collns 2008). Yet another approach uses mage fuson to determne the most dscrmnatve appearance model and then a generatve approach for dynamc target updates (Blasch and Kahler 2005). 2.2 Partcle Flters for Object Trackng Partcle flters (also known as condensaton or sequental Monte Carlo models) were ntroduced to vsual trackng (Isard and Blake 998). Snce then and over the last decade, t has become a popular trackng framework due prmarly to ts excellent performance n the presence of nonlnear target moton and to flexblty to dfferent object representatons (Wu and Huang 2004). In general, when more partcles are sampled and a better target representaton s constructed, partcle flter based trackng algorthms are more lkely to perform relably n cluttered and nosy envronments. However, the computatonal cost of partcle flter trackers tends to ncrease lnearly wth the number of partcles. Consequently, researchers have proposed varous means of speedng up the partcle flter framework. In Yang et al. (2005), tracked objects are descrbed usng color and edge orentaton hstogram features, and the observaton lkelhood s computed n a coarse-to-fne manner, whch allows the computaton to quckly focus on the more promsng regons. In Khan et al. (2004), subspace representatons are used n a partcle flter for trackng. Ths tracker s made effcent by applyng Rao-Blackwellzaton to the subspace coeffcents n the state vector. In Zhou et al. (2004), the number of partcle samples s adjusted accordng to an adaptve nose component. 2.3 Sparse Representaton for Object Trackng Recently, sparse representaton has been ntroduced to partcle flter based object trackng and has yelded noteworthy performance (Me and Lng 20; Me et al. 20; Lu et al. 200; L et al. 20; Bao et al. 202; Zhang et al. 202a). In Me and Lng (20), a trackng canddate s represented as a sparse lnear combnaton of object templates and trval templates. For each partcle, sparse representaton s computed by solvng a constraned l mnmzaton problem wth nonnegatvty constrants, thus, solvng the nverse ntensty pattern problem durng trackng. Although ths method yelds good trackng performance, t comes at the computatonal expense of multple l mnmzaton problems that are ndependently solved. In fact, the computatonal cost grows lnear wth the number of partcle samples. In Me et al. (20), an effcent L tracker wth mnmum error bound and occluson detecton s proposed. The mnmum error bound s quckly calculated from a lnear least squares equaton, and serves as a gude for partcle resamplng n a partcle flter framework. Wthout loss of precson durng resamplng, most of the rrelevant samples are removed before solvng the computatonally expensve l mnmzaton functon. In Lu et al. (200), dynamc group sparsty s ntegrated nto the trackng problem and hgh dmensonal mage features are used to mprove trackng robustness. In L et al. (20), dmensonalty reducton and a customzed orthogonal matchng pursut algorthm are adopted to accelerate the L tracker (Me and Lng 20). In Bao et al. (202), APG based soluton s used to mprove the L tracker (Me and Lng 20). In Zhang et al. (202a), low-rank sparse learnng s adopted to consder the correlatons among partcles for robust trackng. Inspred by these works, we should solve two problems, whch are how to consder the correlatons among partcles and how to make the tracker be fast. Therefore, we propose the S-MTT trackng method. 2.4 Mult-Task Learnng Mult-task learnng (MTL, Chen et al. 2009) has recently receved much attenton n machne learnng and computer vson. It captalzes on shared nformaton between related tasks to mprove the performance of each ndvdual task, and t has been successfully appled to popular vson problems such as mage classfcaton [(Yuan and Yan 200) and mage annotaton (Quatton et al. 2009]. The underlyng assumpton behnd many MTL algorthms s that the tasks are related. Thus, a key ssue les n how relatonshps between tasks are ncorporated n the learnng framework. Inspred by the above works, we want to mprove computatonal effcency and captalze on the nterdependence among partcle appearances (for addtonal robustness n trackng). To make ths come true, we propose a mult-task sparse representaton method for robust object trackng. 3 Partcle Flter The partcle flter (Doucet et al. 200) s a Bayesan sequental mportance samplng technque, whch recursvely approxmates the posteror dstrbuton usng a fnte set of weghted samples for estmatng the posteror dstrbuton of state varables characterzng a dynamc system. It provdes a convenent framework for estmatng and propagatng the posteror probablty densty functon of state varables regardless of the underlyng dstrbuton through a sequence of predcton and update steps. Let s t and y t denote the state varable descrbng the parameters of an

5 Int J Comput Vs (203) 0: object at tme t (e.g. locaton or moton parameters) and ts observaton respectvely. The predcton stage uses the probablstc system transton model p(s t s t ) to predct the posteror dstrbuton of s t gven all avalable observatons y :t = {y, y 2,...,y t } up to tme t s computed n Eq. (). p(s t y :t ) = p(s t s t )p(s t y :t )ds t () At tme t, the observaton y t s avalable and the state vector s updated usng Bayes rule, as n Eq. (2), where p(y t s t ) denotes the observaton lkelhood. p(s t y :t ) = p(y t s t )p(s t y :t ) (2) p(y t y :t ) In the partcle flter framework, the posteror p(s t y :t ) s approxmated by a fnte set of n samples { s n t} (called partcles) wth mportance weghts w. The partcle samples s t = are drawn from an mportance dstrbuton q(s t s :t, y :t ) and the mportance weghts are updated accordng to Eq. (3). wt = wt p ( y t s ( t) p s t s ) t (3) q(s t s :t, y :t ) To avod degeneracy, partcles are resampled accordng to the mportance weghts so as to generate a set of equally weghted partcles. For smplcty, n the case of the bootstrap flter (Doucet et al. 200), we set q(s t s :t, y :t ) = p(s t s t ), so that the weghts are updated by the observaton lkelhood p(y t s t ). Partcle flters have been used extensvely n object trackng (Ylmaz et al. 2006). In ths paper, we also employ partcle flters to track the target object. Smlar to Me and Lng (20), we assume an affne moton model between consecutve frames. Therefore, the state varable s t conssts of the sx parameters of the affne transformaton (2D lnear transformaton and a 2D translaton). By applyng an affne transformaton usng s t as parameters, we crop the regon of nterest y t from the mage and normalze t to the sze of the target templates n our dctonary. The state transton dstrbuton p(s t s t ) s modeled to be Gaussan wth the dmensons of s t assumed ndependent. The observaton model p(y t s t ) reflects the smlarty between a target canddate (partcle) and dctonary templates. In ths paper, p(y t s t ) s nversely proportonal to the reconstructon error obtaned by lnearly representng y t usng the dctonary of templates. 4 Structured Mult-Task Trackng (S-MTT) In ths secton, we gve a detaled descrpton of our partcle flter based trackng method that makes use of structured mult-task learnng to represent partcle samples. 4. Structured Mult-Task Representaton of Partcles In the MTL framework, tasks that share dependences n features or learnng parameters are jontly solved n order to captalze on ther nherent relatonshps. Many works n ths doman have shown that MTL can be appled to classcal problems [(e.g. mage annotaton (Quatton et al. 2009) and mage classfcaton (Yuan and Yan 200)] and outperform state-of-the-art methods that resort to ndependent learnng. In ths paper, we formulate object trackng as an MTL problem, where learnng the representaton of a partcle s vewed as a sngle task. Usually, partcle representatons n trackng are computed separately (e.g. L tracker, Me and Lng 20). In ths paper, we show that by representng partcles jontly n an MTL settng, trackng performance and trackng speed can be sgnfcantly mproved. In our partcle flter based trackng method, partcles are randomly sampled around the current state of the tracked object accordng to a zero-mean Gaussan dstrbuton. At nstance t, we consder n partcle samples, whose observatons (pxel color values) n the tth frame are denoted n matrx form as: X = [x, x 2,...,x n ], where each column s a partcle n R d. In the noseless case, each partcle x s represented as a lnear combnaton z of templates that form a dctonary D t = [d, d 2,...,d m ], such that X = D t Z. The dctonary columns comprse the templates that wll be used to represent each partcle. These templates nclude vsual observatons of the tracked object (called target templates) possbly under a varety of appearance changes. Snce our representaton s constructed at the pxel level, msalgnment between dctonary templates and partcles mght lead to degraded performance. To allevate ths problem, one of two strateges can be employed. () D t can be constructed from a dense samplng of the target object, whch can also nclude transformed versons of these samples. () Columns of X can be algned to columns of D t as n Peng et al. (202) to solve the geometrc transformaton. In ths paper, we employ the frst strategy, whch leads to a larger m but a lower overall computatonal cost. We denote D t wth a subscrpt because the dctonary templates wll be progressvely updated to ncorporate varatons n object appearance due to changes n llumnaton, vewpont, etc. Our dctonary update scheme s adopted from the work n Me and Lng (20), but for completeness, we present ts detals n Sect In many vsual trackng scenaros, target objects are often corrupted by nose or partally occluded. As n Me and Lng (20), ths nose can be modeled as sparse addtve nose that can take on large values anywhere n ts support. Therefore, n the presence of nose, we can stll represent the partcle observatons X as a lnear combnaton of templates, where the dctonary s augmented wth trval (or occluson) templates I d (dentty matrx of R d d ), as shown n Eq. (4). The representaton error e of partcle usng dctonary D t s the

6 372 Int J Comput Vs (203) 0: th column n E. The nonzero entres of e ndcate the pxels n x that are corrupted or occluded. The nonzero support of e can be dfferent among partcles and s unknown a pror. [ ] Z X = [D t I d ] X = BC (4) E 4.. Imposng Jont Sparsty va l p,q Mxed-Norm Because most partcles are densely sampled around the current target state, ther representatons wth respect to D t wll be sparse (few templates are requred to represent them) and smlar to each other (the support of partcle representatons s smlar) n general. These two propertes culmnate n ndvdual partcle representatons (sngle tasks) beng jontly sparse. In other words, jont sparsty wll encourage all partcle representatons to be ndvdually sparse and share the same (few) dctonary templates that relably represent them. Ths yelds a more robust representaton for the ensemble of partcles. In fact, jont sparsty has been recently employed to address MTL problems (Quatton et al. 2009; Yuan and Yan 200). A common technque to explctly enforce jont sparsty n MTL s the use of sparsty-nducng norms to regularze the parameters shared among the ndvdual tasks. In ths paper, we nvestgate the use of convex l p,q mxed norms (.e. p and q ) to address the problem of MTL n partcle flter based trackng (denoted as MTT). Therefore, we need to solve the convex optmzaton problem n Eq. (5), where λ s a tradeoff parameter between relable reconstructon and jont sparsty regularzaton. mn C 2 X BC 2 F + λ C p,q (5) Note that we defne C p,q as n Eq. (6), where C p s the l p norm of C and C s the th row of matrx C. C p,q = ( m+d ) /q ( C p ) q = As n Eq. (5), gven a dctonary B, for the n tasks X = [x, x 2,...,x n ] (each column s a partcle), we am to dscover, across these n tasks, a few common templates that are the most useful for partcle representaton. In ths settng, the constrant of jont sparsty across dfferent tasks s valuable snce dfferent tasks may favor dfferent sparse reconstructon coeffcents, yet the jont sparsty enforces the robustness n coeffcent estmaton. Moreover, jont sparsty explots correlatons among dfferent tasks to obtan better generalzaton performance as compared to learnng each task ndvdually. The goal of jont sparsty s acheved by mposng an l p,q mxed-norm penalty on the reconstructon coeffcents. In fact, jont sparsty can be vewed as a global form of structural regularzaton that nfluences all partcle (6) representatons together. In the next secton, we extend the MTT framework to enforce local structure as well Imposng Structure va Graph Regularzaton Enforcng jont sparsty usng the l p,q mxed-norm explots the global structure nherent to partcle representatons n any gven frame. However, n partcle based MTT, some of the tasks are often more closely related and more lkely to share common relevant covarates than other tasks. Ths nduces another layer of structure, whch affects partcle representatons locally. Therefore, we expand the MTT framework to consder parwse structural correlatons between partcles (e.g. spatal smoothness of representaton) and denote the novel framework as Structured Mult-Task Trackng abbrevated as S-MTT. The S-MTT formulaton can be vewed as a generalzaton of MTT, snce local structural nformaton endows MTT wth another layer of robustness n trackng. In fact, our experments show that ncorporatng such local structural nformaton sgnfcantly mproves the performance of MTT. We assume that the learned representatons C can be related through parwse nteractons, whch are consdered local structural prors to C. In ths paper, we use these structural prors to enforce spatal smoothness among partcle representatons. In other words, partcles that are spatally located near each other n the same frame should have smlar representatons. In general, hgher order relatonshps can be added to the S-MTT framework; however, such relatonshps sgnfcantly ncrease the complexty of learnng the optmal C. Therefore, we restrct ourselves to parwse relatonshps that are defned as edges n a graph, whose nodes consttute the partcle representatons (.e. columns of C). As such, we ncorporate these local parwse relatonshps nto Eq. (5)by addng a sutable graph regularzaton term. To do ths, we nvestgate the use of the well-known and wdely used, normalzed graph smoothness regularzer, whch s a weghted sum of parwse dstances between the normalzed representatons n C. The weght of each dstance term reflects how strongly the correspondng parwse relatonshp should be enforced. Snce the normalzed verson of ths regularzer has been shown to produce better and more stable results n many learnng problems, we prefer t over ts unnormalzed counterpart (Zhu 2008). In Eq. (7), we formalze the graph regularzer, denoted as G(C). Here, we defne a symmetrc weght matrx W R n n + that descrbes the smlarty measure between every par of partcle representatons. In fact, W represents the weghts of all edges n the graph. Therefore, W j s the smlarty measure between the th partcle c and the jth partcle c j. Here, we denote ˆd = n = W j, the sum of the elements of the th row of W, and ˆD = dag( ˆd, ˆd 2,..., ˆd n ). In graph

Int J Comput Vs (203) 0:367 383 373 Fg. 2 (Color onlne) Schematc example of the L 2 tracker. The representaton C of all partcles X w.r.t. dctonary B (set of target and occluson templates) s learned by solvng Eq.

7 Int J Comput Vs (203) 0: Fg. 2 (Color onlne) Schematc example of the L 2 tracker. The representaton C of all partcles X w.r.t. dctonary B (set of target and occluson templates) s learned by solvng Eq. (9) wth p = 2 and q =. Notce that the columns of C are jontly sparse,.e. a few (but the same) dctonary templates are used to represent all the partcles together. The partcle x s selected among all other partcles as the trackng result, snce x s represented the best by object templates only theory, ˆD s called the degree of the graph. In the last step, we denote L = ˆD W as the Laplacan of the graph and Tr(A) as the trace of matrx A. G(C) = 2 = Tr n n W j c c j ˆd ˆd j 2 ( C ˆLC ) T where ˆL = ˆD 2 L ˆD 2 (7) = j= We defne W j to decrease exponentally wth the dstance between the spatal locatons of the th and jth partcles, as n Eq. (8). Here, we denote l as the 2D locaton of the center of the th partcle n the current frame and δ as a smoothng factor. The δ s the average value of all dstances between l and l j. Note that other smlarty measures can be used to descrbe W j ncludng the PASCAL overlap score. ( W j = exp l ) l j 2 2 2δ 2 (8) Therefore, the partcle representatons C can be computed by solvng Eq. (9), whch smply adds the graph regularzer G(C) to Eq. (5). Here, ˆL s the normalzed Laplacan matrx, and λ and λ 2 are two parameters that quantfy the tradeoff between local and global structural regularzaton. mn C 2 X BC 2 F + λ 2 Tr(C ˆLC T ) + λ 2 C p,q (9) Dscusson To encourage a sparse number of dctonary templates to be selected for all partcles, we restrct our choce of l p,q mxed norms to the case of q=, thus, C p, = m+d = C p. Among ts convex optons, we select three popular and wdely studed l p, norms: p {, 2, }. The S-MTT objectve n Eq. (9) s composed of a convex 2 quadratc term and a non-smooth regularzer, and thus we conventonally adopt the APG method (Tseng 2008) for optmzaton. The soluton to Eq. (9) for these choces of p and q s descrbed n Sect Note that each choce of p yelds a dfferent S-MTT tracker, whch we wll denote as the L p tracker. To dscrmnate between S-MTT and MTT trackers, we denote the MTT tracker (.e. when λ = 0nEq.(9)) usng the l p, mxed norm as the L p tracker. In Sect. 5.6, we show that L p trackers can lead to sgnfcant mprovement n trackng performance over L p trackers, n general. In Fg. 2, we present an example of how the L 2 tracker works. Gven all partcles X (sampled around the tracked car) and based on a dctonary B, we learn the representaton matrx C by solvng Eq. (9). Note that smaller values are darker n color. Clearly, columns of C are jontly sparse,.e. a few (but the same) dctonary templates are used to represent all the partcles together. Partcle x s chosen as the current trackng result y t because t has the smallest reconstructon error w.r.t. to the target templates D t. Snce partcles x j and x k are msalgned The score s the rato of the ntersecton to the unon of two boundng boxes. In our case, t would be the rato of the ntersecton of the ground truth and the predcted tracks to ther unon n each frame. 2 Snce the degree matrx ˆD s dagonal and non-negatve and snce the Laplacan L of any graph s postve sem-defnte, the normalzed Laplacan ˆL s postve sem-defnte. Thus, G(C) s convex n C.

8 374 Int J Comput Vs (203) 0: versons of the car, they are not represented well by D t (.e. z j and z k have small values). Ths precludes the tracker from drftng nto the background. As for Eq. (6), when p = q =, we note that C, = m+d = C = n c, (0) = where c and C represent the th column and th row n C respectvely. Ths equvalence property between rows and columns (.e. the sum of the l p norms of rows and that of columns are the same) only occurs when p =. In ths case, MTT (Eq. (5)) and S-MTT (Eq. (9) when λ = 0) become equvalent to Eq. (0). Ths optmzaton problem s no longer an MTL problem, snce the n representaton tasks are unrelated and are solved separately. Interestngly, Eq. (0) s the same formulaton used n the popular L tracker (Me and Lng 20), whch can be vewed as a specal case of our proposed famly of S-MTT algorthms, namely the L tracker. In fact, usng the optmzaton technque n Sect. 4.3, we observe that our L mplementaton s two orders of magntude faster than the tradtonal L tracker. mn c,...,c n n j= ( ) 2 x j Bc j λ 2 c j () A detaled overvew of the proposed S-MTT trackng method s shown n Algorthm. Based on the prevous state s t, we can use the mportance samplng approach (Isard and Blake 998) to obtan new partcles and crop the correspondng mage patches to obtan ther observatons X. Then, we learn ther representatons C by solvng Eq. (9), to be shown n Sect The partcle that has the smallest reconstructon error s selected to be the current trackng result. Fnally, the dctonary templates n D t are updated adaptvely, to be shown n Sect Solvng Eq. (9) In S-MTT, we need to solve Eq. (9) when q= and p {, 2, }. Clearly, the overall objectve s non-smooth due to the non-smoothness of the l p, mxed norm. If straghtforward frst-order subgradent methods were used to solve the S-MTT problem, only sublnear convergence (.e. convergence to an ɛ-accurate soluton n O ( ) ) ɛ teratons s 2 guaranteed. To obtan a better convergence rate, we explot recent developments n non-smooth convex optmzaton. The unpublshed manuscrpt by Nesterov (2007) consders the problem of mnmzng a convex objectve composed of a smooth convex term and a smple non-smooth convex term. Here, smple means that the proxmal mappng 3 of the nonsmooth term can be computed effcently. In ths case, an APG method can be devsed to solve the non-smooth convex program wth guaranteed quadratc convergence. Because of ts attractve convergence property, ths APG method has been extensvely used to effcently solve smooth convex optmzaton problems wth non-smooth norm regularzers (e.g. MTL problems, Chen et al. 2009). It should be noted that Beck and Teboulle (2009) ndependently proposed the ISTA algorthm for solvng lnear nverse problem wth the same convergence rate. Ths work was further extended to convex-concave optmzaton n Tseng (2008). In general, APG terates between updatng the current representaton matrx C (k) and an aggregaton matrx V (k). Each APG teraton conssts of two steps: () a generalzed gradent mappng step that updates C (k) keepng V (k) fxed, and (2) an aggregaton step that updates V (k) by lnearly combnng C (k+) and C (k). To ntalze V (0) and C (0), we set them to Gradent Mappng Step Gven the current estmate V (k), we obtan C (k+) by solvng Eq. (2), whch s nothng but the proxmal mappng of λ Y p,. Here, η s a small step parameter, and λ = ηλ 2. C (k+) = arg mn Y 2 Y H λ Y p,q, (2) The temporal parameter H s an η step from the current estmate V (k) along the negatve gradent of the smooth term n Eq. (9) and s calculated n Eq. (3). 3 The proxmal mappng of a non-smooth convex functon h(.) s defned as: prox h (x) = arg mn u ( h(u) + 2 u x 2 2).

9 Int J Comput Vs (203) 0: H = V (k) η s (k) [ ] = V (k) η B T BV (k) + λ V (k) ˆL B T X. (3) Takng jont sparsty nto consderaton and snce q =, Eq. (2) decouples nto (m + d) dsjont subproblems (one for each row vector C ), asshownneq.(4), where Y and H denote the th row of the matrx Y and H, respectvely. C (k+) = arg mn Y 2 Y H λ Y p, (4) Each subproblem s the proxmal mappng of the l p vector norm, whch s a varant of the vector projecton problem unto the l p ball. The soluton to each subproblem and ts tme complexty depends on p. In Sect , we provde the soluton of ths subproblem for popular l p norms, namely p {, 2, } Aggregaton Step In ths step, we construct a lnear combnaton of C (k) and C (k+) to update V (k+) as follows: V (k+) = C (k+) + α k+ ( α k ) ( C (k+) C (k)), (5) α k where α k s conventonally set to k+3 2. Our overall APG algorthm s summarzed n Algorthm 2. Note that convergence s acheved when the relatve change n soluton or objectve functon falls below a predefned tolerance. S λ (a) = sgn(a) max(0, a λ). Ths operator s appled elementwse on H. C (k+) = arg mn Y 2 Y H λ Y = S λ (H ) (6) For p=2: Followng Chen et al. (2009), the update C (k+) s computed n closed form n Eq. (7). C (k+) = arg mn 2 Y H λ Y 2 ( ) = max 0, λ H 2 H (7) Y For p= The update C (k+) s computed va a projecton onto the l ball that can be done by a smple sortng procedure (Chen et al. 2009). In ths case, the soluton s gven n Eq. (8). ( C (k+) = max 0, λ ) a, (8) H where the jth element of vector a s defned as a j = sgn(h j ) mn H j, ĵ u r λ ĵ j =,...,n. r= The temporary parameters u r and ĵ are obtaned as follows. We set u j = C j j and sort these values n decreasng order: u u 2 u n. Then, we set ĵ = max{ j : j r= (u r u j )< λ}. Here, C j and H j denote the element n the th row and jth column of matrces C and H, respectvely Computatonal Complexty of Algorthm Solvng Eq. (4) for p {, 2, } The soluton to Eq. (4) depends on the value of p. For p {, 2, }, we show that ths soluton has a closed form. Note that these solutons can be extended to l p norms beyond the three that we consder here. For p=: The soluton of Eq. (4) s equvalent to the L tracker soluton, as shown n Eq. (). Here, the update C (k+) s computed n closed form n Eq. (6), where S λ s the soft-thresholdng operator defned n scalar form as The computatonal complexty of each teraton n Algorthm 2 s domnated by the gradent computaton n Step 3 and the update of C (k+) n Step 4. Explotng the structure of B, the complexty of Step 3 s O(mnd), whle that of Step 4 depends on p. The latter complexty s O(n(m+d)) for p {, 2} and O(n(m + d)( + log n)) for p =. Snce d m, the perframe complexty of the proposed S-MTT and MTT trackers s O(mndɛ 2 ), where the number of teratons s O(ɛ 2 ) for an ɛ-accurate soluton. In comparson, the tme complexty of the L tracker (that s equvalent to our L tracker) s at least O ( nd 2). In our experments, we observe that the L tracker (that uses APG) s two orders of magntude faster than the L tracker (that solves n Lasso problems ndependently) n general. For example, when m =, n = 200, and d = 32 32, the average per-frame run-tme for L and L are. and

10 376 Int J Comput Vs (203) 0: s respectvely. Ths s on par wth the accelerated realtme mplementaton of the L tracker n L et al. (20). 4.4 Dctonary Update Target appearance remans the same only for a certan perod of tme, but eventually the object templates n D t are no longer an accurate representaton of the target s appearance. A fxed appearance template s prone to the trackng drft problem, snce t s ncapable of handlng appearance changes over tme. In ths paper, our dctonary update scheme s adopted from the work n Me and Lng (20). Each target template n D t s assgned a weght that s ndcatve of how representatve the template s. In fact, the more a template s used to represent trackng results, the hgher ts weght s. When D t cannot represent some partcles well (up to a predefned threshold), the target template wth the smallest weght s replaced by the current trackng result. To ntalze the m target templates, we sample equal-szed patches at and around the ntal poston of the target. In our experments, we shft the ntal boundng box by 3 pxels n each drecton, thus, resultng n m = object templates as n Me and Lng (20). All dctonary templates are normalzed. 5 Expermental Results In ths secton, we present extensve expermental results that valdate the effectveness and effcency of our proposed S-MTT and MTT methods. We also conduct a thorough comparson between our proposed trackers and state-of-the-art trackng methods where applcable. The expermental results are organzed as follows. In Sect. 5., we gve an overvew of the vdeo dataset that we test our S-MTT/MTT trackers. Secton 5.2 enumerates the sx state-of-the-art trackers that we compare aganst. The mplementaton detals of our proposed trackers are hghlghted n Sect In Sect. 5.4, we report the average runtme of the S-MTT trackers wth varyng parameter settngs, as well as, compare t to the runtme of the L tracker. Qualtatve and quanttatve comparsons between S-MTT/MTT and the state-of-the-art trackers are made n Sects. 5.5 and 5.6 respectvely. The comparatve results demonstrate that our method provdes more robust and accurate trackng results than the state-of-the-art. Several vdeos for the trackng results can be found n the supplementary materal. The vdeos and codes wll be made avalable on our project webste Datasets To evaluate our proposed trackers, we comple a set of 5 challengng trackng sequences (denoted as car4, car, davd ndoor, sylv, trells70, grl, coke, faceocc2, shakng, football, snger, snger(low frame rate), skatng, skatng(low frame rate), soccer. The vdeo sequences car4, car, davd ndoor, sylv and trells70 can be downloaded from an onlne source. 5 The vdeo sequences grl, coke and faceocc2 can be downloaded from an onlne source. 6 The other vdeo sequences shakng, football, snger, snger(low frame rate), skatng, skatng(low frame rate) and soccer can be downloaded from an onlne source. 7 These vdeos are recorded n ndoor and outdoor envronments and nclude most challengng factors n vsual trackng: complex background, movng camera, fast movement, large varaton n pose and scale, occluson, as well as shape deformaton and dstorton (see Fgs. 3, 4). 5.2 Baselnes We compared the proposed algorthms (MTT and S-MTT) wth sx state-of-the-art vsual trackers: VTD tracker (Kwon and Lee 200), L tracker (Me and Lng 20), IVT tracker (Ross et al. 2008), MIL tracker (Babenko et al. 2009), Fragments-based tracker (Frag) (Adam et al. 2006), and OAB tracker (Grabner et al. 2006). We use the publcly avalable source codes or bnares provded by the authors themselves wth the same ntalzaton and parameter settngs to generate the comparatve results. In our experments, our proposed trackng methods use the same parameters for all the test sequences. 5.3 Implementaton Detals All our experments are done usng MATLAB R2008b on a 2.66 GHZ Intel Core2 Duo PC wth 6 GB RAM. The template sze d s set to half the sze of the target ntalzaton n the frst frame. Usually, d s n the order of several hundreds of pxels. For all experments, we model p (s t s t ) N (0, dag(σ )), where σ = [0.005, , , 0.005, 4, 4] T. We set the number of partcles n = 400, the total number of target templates m = [The templates are obtaned wth 3 pxels shft around the target poston as the same as the L tracker (Me and Lng 20)], and the number of occluson templates to d. In Algorthm 2, we set η = 0.0, λ =, λ (by cross-valdaton) to {0.0, 0.005, 0.2} for L 2, L and L respectvely, and λ to {0.005, 0.00, 0.2} for L 2, L and

11 Int J Comput Vs (203) 0: Table Average per-frame runtme (n seconds) of 4 trackers (L 2, L, L, and L ) wth varyng template szes d and number of partcles n n d L L 2 L L L L 2 L L L L 2 L L L L 2 L L L respectvely. Each tracker uses the same parameters for all vdeo sequences. In all cases, the ntal poston of the target s selected manually. 5.4 Computatonal Cost The popular L tracker that uses a smlar sparse representaton model for partcle appearance has shown to acheve better trackng performance than state-of-the-art trackers (Me and Lng 20). However, ts computatonal cost grows proportonally as the number of partcle samples n and template sze d ncrease. Due to the nherent smlarty between the L tracker and the proposed trackers (MTT and S-MTT), we compare ther average runtmes n Table. S-MTT and MTT has very smlar computatonal costs, so for smplcty, we just report the runtme results of S-MTT (L 2, L, L ) and L n Table. Based on the results, t s clear that our trackers are much more effcent than the L tracker. For example, when m =, n = 400, and d = 32 32, the average per-frame run-tme for L 2, L, L, and L trackers are 4.69, 5.39, 2.74, and s, respectvely. Interestngly, our L tracker, whch s smlar to L trackng but solved usng the APG method, s about 20 tmes faster than the L tracker. As we know, ncreasng n and d wll mprove trackng performance. For L trackng, the runtme cost ncreases dramatcally wth both n and d; however, ths ncrease s much more reasonable wth our trackers. Note that the computatonal complexty of S-MTT s derved n Sect Qualtatve Comparson The car4 sequence s captured n an open road scenaro. Trackng results at frames {20, 86, 235, 305, 466, 64} for all 2 methods are shown n Fg. 3a. The dfferent trackng methods are color-coded. OAB, Frag, and VTD start to drft from the target at frame 86, whle MIL starts to show some target drftng at frame 200 and fnally loses the target at frame 300. IVT and L track the target qute well. The target s successfully tracked throughout the entre sequence by our L, L 2, L, L, L 2, and L methods. In the car sequence, a car s drven nto a very dark envronment, whle beng vdeotaped from another movng car. Trackng results for frames {0, 0, 200, 250, 309, 393} are presented n Fg. 3b. Frag starts to drft around frame 60. Due to changes n lghtng, MIL starts to undergo target drft from frame 20. OAB and L methods start to fal n frame 284. IVT and VTD can track the target through the whole vdeo sequence; however, these tracks are not as robust or accurate as the proposed L 2, L, L 2 and L trackers. The coke sequence contans frequent occlusons and fast moton, whch cause moton blur. The S-MTT, MTT, L, OAB, and MIL trackers can track the target accurately almost throughout the entre sequence. The other trackers fals due to pose change and occluson as shown n Fg. 3c. In thedavd sequence, a movng face s tracked. The trackng results at frames {354, 423, 465, 502, 588, 760} are shown n Fg. 3d. Frag and VTD fal around frames 423 and 465 respectvely. OAB starts to drft at frame 550. MIL and L adequately track the face, but experence target drft, especally at frames 690 and 500, respectvely. The IVT, S-MTT and MTT methods track the movng face accurately. In the faceocc sequence, a movng face s tracked, whch can evaluate the robustness to occlusons of dfferent methods. The trackng results at frames {00, 23, 34, 474, 57, 865} are shown n Fg. 3e. Because there s only occluson by a book and no changes n llumnaton and moton, most of the methods can track the face accurately except OAB and MIL, whch encounter mnor drft. Results on the faceocc2 sequence are shown n Fg. 3f. Most trackers start drftng from the man s face when t s almost fully occluded by the book. Because the L, MTT and S-MTT methods explctly handle partal occlusons, and update the object dctonary progressvely, they handle the

378 Int J Comput Vs (203) 0:367 383 (g) football (f) faceocc2 (e) faceocc (d) davd ndoor (c) coke (b) car (a) car4 Fg.

12 378 Int J Comput Vs (203) 0: (g) football (f) faceocc2 (e) faceocc (d) davd ndoor (c) coke (b) car (a) car4 Fg. 3 (Color onlne) Trackng results of 2 trackers on 7 vdeo sequences denoted wth dfferent colors. Frame numbers are overlayed n red. See text for detals appearance changes n ths sequence very well and contnue trackng the target durng and after the occluson. The football sequence ncludes severe background clutter, whch s smlar n appearance to the tracked target. For the other methods, trackng drfts from the ntended object (helmet) to other smlar lookng objects n the vcnty. Ths s especally the case when the two football players collde at frame 362 (refer to Fg. 3g). The proposed trackers (such asl 2, L and L 2 ) overcome ths problem and successfully track the target because they explot structural relatonshps between partcle representatons. Fgure 4a shows trackng results for the grl sequence. Performance on ths sequence exemplfes the robustness of MTT to occluson (complete occluson of the grl s face as she swvels n the char) and large pose change (the face undergoes sgnfcant 3D rotaton). S-MTT, MTT and L are capable of trackng the target durng the entre sequence. Other trackers experence drft at dfferent nstances: Frag at frame 248, OAB and IVT at frame 436, and VTD at frame 477. In the onelsr sequence, the background color s smlar to the color of the woman s trousers, and the man s shrt and pants have a smlar color to the woman s coat. In addton, the woman undergoes partal occluson when the man n the scene walks behnd her. Some results are shown n Fg. 4b. Whle trackng the woman, IVT, MIL, Frag, OAB, and VTD start trackng the man when the woman s partally occluded

Int J Comput Vs (203) 0:367 383 379 (h) trells70 (g) sylv (f) soccer (e) skatng (d) snger (c) shakng (b) onelsr (a) grl Fg.

13 Int J Comput Vs (203) 0: (h) trells70 (g) sylv (f) soccer (e) skatng (d) snger (c) shakng (b) onelsr (a) grl Fg. 4 (Color onlne) Trackng results of 2 trackers on 8 vdeo sequences delneated by dfferent colors. Frame numbers are overlayed n red. See text for detals around frame 200, and are unable to recover from ths falure after that. The L tracker tracks the woman qute well. Compared wth other trackers, our L 2, L, L 2 and L trackers are more robust to the occluson. In addton, our L 2, L, L 2 and L trackers are much better than L, L, and L, whch demonstrate that mposng jont sparsty between partcle representatons s helpful for robust object trackng. In the shakng sequence, the tracked object s subject to changes n llumnaton and pose. Whle the stage lghtng condton s drastcally changed, and the pose of the object s severely vared due to head shakng, our method successfully tracks the object (refer to Fg. 4c). Compared wth L, L and L, L 2, L, L 2 and L perform better because ther jont sparse partcle representaton s more robust to rapd changes. Other methods (OAB, IVT, L, and Frag) fal

14 380 Int J Comput Vs (203) 0: to track the object when these changes occur. VTD and MIL methods can track the object qute well except for some errors around frame 60. The snger(l) sequence contans abrupt object moton wth sgnfcant llumnaton and scale changes, whch cause most of the trackers to drft as shown n Fg. 4d. S-MTT, MTT and VTD handle these changes well. Compared wth MTT (L 2, L and L ), S-MTT (L 2, L and L ) obtans much better performance, whch shows that harnessng local structure between partcle representatons s useful for object trackng. In the skatngl sequence, there are abrupt object moton, severe llumnaton and scale changes, vewpont changes and occlusons, whch lead most of the trackers to fal. Our proposed trackers (MTT and S-MTT) and VTD handle these changes well as shown n Fg. 4e. Note that, n the 353th frame n the Fg. 4e, our proposed trackers are slghtly better than the VTD method, whch s the most recent state-of-theart trackng method that can cope wth abrupt moton and appearance changes. Results on the soccer sequence are shown n Fg. 4f. They demonstrate how our proposed method outperforms most of the state-of-the-art trackers when the target s severely occluded by other objects. The L 2, L, L, L 2 and L methods accurately track the player s face despte scale and pose changes as well as occluson/nose from the confett ranng around hm. Other methods (IVT, L, L, OAB, MIL, and Frag) fal to track the object relably. The VTD tracker can track the target n ths sequence qute well. Results on the sylv sequence are shown n Fg. 4g. In ths sequence, a stuffed anmal s beng moved around, thus, leadng to challengng pose, lghtng, and scale changes. IVT fals around frame 623 as a result of a combnaton of pose and llumnaton change. The rest of the trackers are able to track the target throughout the sequence, although the Frag, MIL, VTD, OAB, L and L encounter mnor drft from the target. The trells70 sequence s captured n an outdoor envronment where lghtng condtons change drastcally. The vdeo s acqured when a person walks underneath a trells covered n vnes. As shown n Fg. 4h, the cast shadow changes the appearance of the target face sgnfcantly. Furthermore, the combned effects of pose and lghtng varatons along wth a low frame rate make vsual trackng extremely dffcult. Nevertheless, the L 2, L, L 2 and L trackers can follow the target accurately and robustly, whle the other trackng methods perform below par n ths case. VTD and Frag fal around frame 85. L starts drftng at frame 287, whle MIL and OAB fal at frame 323. IVT starts drftng at frame Quanttatve Comparson To gve a quanttatve comparson between the 2 trackers, we obtan ground truth for all 5 sequences. Here, we note that ground truth for some of the vdeo sequences s readly avalable. We manually label the other sequences. Tracker performance s evaluated accordng to the average per-frame dstance (n pxels) between the center of the trackng result and that of ground truth as used n Babenko et al. (2009) and Me and Lng (20). Clearly, ths dstance 8 should be small. In Fg. 5, we plot the dstance of each tracker over tme on 2 sequences for smplcty. From ths fgure, we see that MTT and S-MTT trackers consstently produce a smaller dstance than other trackers n general. Ths mples that MTT and S-MTT can accurately track the target despte severe occlusons, pose varatons, llumnaton changes, and abrupt motons. In Table 2, we show the average center dstance for each tracker over the 5 sequences. It s clear that the S-MTT and MTT methods are consstently better than the other trackers n most sequences. Among the MTT methods, L 2 outperforms L and L n general. For S-MTT method, L 2 are much better than L and L. In fact, except for the faceocc, football, shakng and skatngl sequences, n whch we obtan smlar results as IVT and VTD, the L 2 and L 2 trackers do outperform the other methods. Frag and L perform well under partal occluson but tend to fal under severe llumnaton and pose changes. The IVT tracker s hardly affected by changes n appearance except those due to llumnaton. OAB s effected by background clutter and easly drfts from the target. MIL performs well except when severe llumnaton changes force the tracker to drft nto the background. VTD tends to be robust aganst llumnaton change, but t cannot handle severe occlusons and vewpont changes adequately. Now, we compare the performance of S-MTT and MTT methods. Based on the results n Table 2, L 2 and L outperform L. Ths s due to the fact that the L tracker learns partcle representatons separately, whle L 2 and L captalze on dependences between dfferent partcles to obtan more robust jontly sparse representatons. These results demonstrate that t s useful for vsual trackng to mpose jont sparsty among partcle representatons. For S-MTT, the performance of the correspondng three trackers (L 2, L and L ) has smlar tendences as MTT. However, S-MTT s reasonably better than MTT. Ths valdates the mpact of usng local graph structure to regularze partcle representatons and yeld more robust object trackng. In addton, we compare MTT and S-MTT wth the L tracker, whch s the most related tracker to ours and has shown state-of-the-art performance (Me and Lng 20). Based on the results n Table 2, MTT(L 2 and L ) and S-MTT (L 2 and L ) outperform the L tracker. That s because L trackng represents partcles separately, whle the 8 Ths dssmlarty measure s used often to compare trackng performance. Other measures can be used, ncludng the PASCAL overlap score.

15 Int J Comput Vs (203) 0: Fg. 5 (Color onlne) Center dstance (n pxels) between trackng result and ground truth over tme for 2 trackers appled to 5 vdeo sequences proposed trackers captalze on dependences between dfferent partcle representatons to obtan a more robust jontly sparse representaton. Our results demonstrate that t s useful for vsual trackng to mne partcle relatonshps. L outperforms the L and L trackers, snce the L tracker makes use of the local graph structure. Moreover, n theory, the L

Discriminative Dictionary Learning with Pairwise Constraints

Discriminative Dictionary Learning with Pairwise Constraints Dscrmnatve Dctonary Learnng wth Parwse Constrants Humn Guo Zhuoln Jang LARRY S. DAVIS UNIVERSITY OF MARYLAND Nov. 6 th, Outlne Introducton/motvaton Dctonary Learnng Dscrmnatve Dctonary Learnng wth Parwse