We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Size: px

Start display at page:

Download "We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors"

Cynthia Leona Wilkerson
5 years ago
Views:

1 We are InechOpen, he world s leading publisher of Open Access books Buil by scieniss, for scieniss 4, , M Open access books available Inernaional auhors and ediors Downloads Our auhors are among he 154 Counries delivered o TOP 1% mos cied scieniss 12.2% Conribuors from op 500 universiies Selecion of our books indexed in he Book Ciaion Index in Web of Science Core Collecion (BKCI) Ineresed in publishing wih us? Conac book.deparmen@inechopen.com Numbers displayed above are based on laes daa colleced. For more informaion visi

2 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs Irene Y.H. Gu and Zulfiqar H. Khan Dep. of Signals and Sysems, Chalmers Univ. of Technology, Gohenburg, Sweden 5 1. Inroducion This chaper describes a novel hybrid visual objec racking scheme ha joinly explois local poin feaures, global appearance and shape of arge objecs. The hybrid racker conains wo baseline candidae rackers and is formulaed under an opical crierion. One baseline racker, a spaioemporal SIFT-RANSAC, exracs local feaure poins separaively for he foreground and background regions. Anoher baseline racker, an enhanced anisoropic mean shif, racks a dynamic objec whose global appearance is mos similar o he online learned disribuion of reference objec. An essenial building block in he hybrid racker is he online learning of dynamic objec, where we employ a new approach for learning he appearance disribuion, and anoher new approach for updaing he wo feaure poin ses. To demonsrae he applicaions of such online learning approaches o oher rackers, we show an example in which online learning is added o an exising JMSPF (join mean shif and paricle filer racking) racking scheme, resuling in improved racking robusness. The proposed hybrid racker has been esed on numerous videos wih a range of complex scenarios where arge objecs may experience long-erm parial occlusions/inersecions from oher objecs, large deformaions, abrup moion changes, dynamic cluered background/occluding objecs having similar color disribuions o he arge objec. Tracking resuls have shown o be very robus in erms of racking drif, accuracy and ighness of racked bounding boxes. The performance of he hybrid racker is evaluaed qualiaively and quaniaively, wih comparisons o four exising sa-of-he-ar racking schemes. Limiaions of he racker are also discussed. 2. Relaed work 2.1 Visual racking Visual objec racking has drawn increasing ineres in recen years, parly due o is wide variey of applicaions, e.g., video surveillance in airpors, schools, banks, hospials, raffic, freigh, and e-healh cares. Tracking is ofen he firs sep owards a furher analysis abou he aciviies, behaviors, ineracions and relaionships beween objecs of ineres. Many objec racking mehods have been proposed and developed, e.g., sae-space based racking using Kalman filers and paricle filers (Welch & Bishop,97; Rosales &

3 90 Objec Tracking Sclaroff,99; Gordon e al.,01; Gordon,00; Wang e al.,08; Vermaak e al.,03; Okuma e al.,04), join sae-space represenaion and associaion (Bar-Shalom & Formann,98), muliple hypohesis racking (MHT) (Reid,79), anisoropic mean shif racking (Comaniciu e al.,03; Khan & Gu,10), opical flow-based racking (Shi & Tomasi,94), and poin feaure-based racking (Srandmark & Gu,09; Haner & Gu,10), among many ohers. An overview on visual racking mehods can be found in (Yilmaz e al.,06; Sankaranarayanan e al.,08). In he sae space-based racking approach using Kalman Filers (KFs), he assumpions of Gaussian noise and linear models of sae vecor are made. A sae vecor ypically includes differen aribues of objec, e.g. objec appearance and shape, and/or oher objec feaures. G.Welch (Welch & Bishop,97) applies KFs o rack user s poses in man-compuer ineracive graphics. R.Roosales (Rosales & Sclaroff,99) uses Exended Kalman Filers (EKFs) o esimae he 3D objec rajecory from 2D moions. (Gordon e al.,01) uses Unscened Kalman Filers (UKFs) ha enforces Gaussian disribuions while keeps nonlineariy by using discree samples o esimae he mean and covariance in poserior densiies. Under Muliple Hypohesis Tess (MHTs), (Reid,79) uses an ieraive process o rack muliple objecs and finds he bes maching beween he real objec descripors. Gordon e al. (Gordon,00) uses paricle filers (PFs) o rack 1D (one dimensional) signals. Tracking is achieved by esimaing he probabiliy densiy of sae vecor from synheic nonlinear and non-gaussian disribued 1D signals, and is formulaed under he Bayesian framework by esimaing he poserior probabiliy using he rule of propagaion of sae densiy over ime. Exension of PFs o visual racking is no sraigh forwards, since he size of sae vecor for racking a visual objec is significanly larger han ha of a 1D arge. This requires a large number of paricles and consequenly a heavy compuaion, which ofen hampers he pracical use of PFs. To overcome his, (Wang e al.,08) proposes o use Rao-Blackwellized PFs ha marginalizes ou he linear par of he sae vecor (he appearance), while he nonlinear shape and pose pars are hen esimaed by PFs, while (Khan e al.,09; Deguchi e al.,04) propose o embed he objec appearance in he likelihood of PFs so ha he size of he sae vecor can be kep small. Visual racking from mean shif has drawn much ineres laely, parly due o is compuaional efficiency and relaively robus performance. Differen from he convenional mean shif for nonlinear image smoohing or segmenaion ha seeks he local modes in he kernel esimae of pdf, mean shif racking is an efficien and fas implemenaion of he similariy meric, he Bhaacharyya coefficien, ha maximizes he similariy beween he reference and a candidae objec regions. I is worh menioning ha oher similariy merics, e.g., Kullback-Leibler divergence (Khalid e al.,05), or SSD measure (Hager e al.,04), can also be used as well. The main drawback of mean shif is ha racking may drif away or fail especially when he background cluer and he objec of ineres have similar color disribuions, or when long erm parial occlusions of objecs, pose changes of large objecs, and fas change of objec moion occur. Following he pioneering work of mean shif racking by (Comaniciu e al.,03), various aemps are made o address hese issues. (Collins,03) exends he mean shif by inroducing a normalizing facor o he bandwidh marix o capure arge variaions in scales (Brezner & Lindeberg,98). I performs exensive search wihin a range of ellipses and is compuaionally expensive. (Yilmaz,07) proposes racking by using a level-se asymmeric kernel. I is performed in image coordinaes by including he scale and orienaion as addiional dimensions and simulaneously esimaing all unknowns. (Sumin & Xianwu,08) proposes o simulaneously rack he posiion, scale and orienaion of bounding box by using anisoropic mean shif, where he bandwidh marix is used o compue he scale and orienaion. (Zivkovic & Krose,04) proposes

4 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 91 an EM-like algorihm ha racks a deformable objec whose bounding box conains five degrees of freedom. I simulaneously esimaes he cener and he bandwidh marix of kernel. (Maggio & Cavallaro,05; Xu e al.,05; Parameswaran e al.,07; Khan e al.,09) include he spaial informaion in he color hisogram by dividing an ellipse shape bounding box ino muliple pars o make he racker more robus. (Maggio & Cavallaro,05; Khan e al.,09) furher inegrae he muli-par mean shif ino he paricle filer framework using overlapped, or non-overlapped regions where improved resuls are repored. The racking performance is raher robus, however, racking drif or racking failure may sill occur in some occasions, especially when a cluered background or an inersecing objec has similar color disribuions o he arge objec. While global appearance disribuions are widely used in visual objec racking, local poin feaures of objec are ofen used as an alernaive. One of he main advanages of using poin feaures is heir resilience o parial objec occlusions. When one par of an objec is occluded, poin feaures from he non-occluded par can sill be used for he racking. Local appearance-based racking usually involves deecing and characerizing he appearance of objec by local feaures from poins, lines or curves, esablishing correspondences beween deeced feaure poins (lines, or curves) across frames and esimaing he parameers of he associaed ransformaion beween wo feaure poin (line, or curve) ses. Several sraegies are used o selec feaure poins ha are invarian o affine or projecive ransformaions. (Harris & Sephens,88) proposes o exrac roaional and ranslaional-invarian feaures by combining corner and edge deecors based on local auocorrelaion funcions. (Shi & Tomasi,94) proposes o hreshold he minimum eigen values of image gradien marices a candidae feaure poins and use hem as he appropriae feaure poins for racking. These mehods generae roaional and ranslaional invarian poin feaures however are varian o affine or projecive ransformaions. (Lowe,04) proposes a Scale-Invarian Feaure Transform (SIFT) ha is invarian o roaions, ranslaions, scalings, affine ransformaions, and parially invarian o illuminaions. Each poin feaure is described by a feaure descripor or a vecor, formed from he gradien direcions and he magniudes around he poin. (Bay e al.,06) proposes o use Speeded Up Robus Feaures (SURF), having similar performance as SIFT (Bauer e al.,07) bu faser. Due o he robusness of SIFT feaures, various aemps (Skrypnyk & Lowe,04; Mondragon w al.,07; Li e al,06; Xu e al,08; Baiao e al.,07) have been made o inegrae SIFT in he racking. (Skrypnyk & Lowe,04) proposes o use SIFT feaures o rack camera poses and o regiser virual objecs in online videos. Since feaure poins are ofen sensiive o noise, he consensus of a group of poins can be exploied. (Mondragon w al.,07) uses SIFT and RANSAC o deec poins of ineres and rejec ouliers when esimaing projecive ransformaions, where videos are capured by an online UAV camera. (Li e al,06) handles objec occlusions by maching local invarian feaures learned online raher han prediced from moions, since he local poin feaures from a non-occluded objec par can sill be used for he racking. (Xu e al,08) proposes vehicle racking by using SIFT feaures exraced from deeced moving objec bounding boxes, followed by frame-by-frame maching. (Baiao e al.,07) proposes a video sabilizer by inferring he iner-frame moion hrough SIFT feaure racking in consecuive frames. These mehods are efficien and invarian o scaling, roaion and moderae lighing changes, however require he appearance of objec conaining sufficien exures. Several aemps are made o solve his problem by combining SIFT feaures wih oher racking mehods. To exend he visual racking from 2 images o a video sequence, efficien mehods for esablishing spaioemporal local feaure poin correspondences hrough video frames

5 92 Objec Tracking are required. (Srandmark & Gu,09) proposes muliple moion models and feaure poin mainenance by employing an online updaing process ha may add new feaure poins, prune he exising poins, or emporally freeze he updaing, and (Haner & Gu,10) furher improves he mehod by inroducing wo local feaure poin ses, one for he foreground and anoher for he background, where he background poin se is used o provide priors on possible occlusions. While boh he global and local objec models offer some aracive properies for visual racking, hybrid models ha combines hese wo ypes of models may offer beer resuls as hey can complemen each oher. (Zhou e al.,08) proposes an expecaion-maximizaion algorihm ha inegraes SIFT feaures along wih he color-based appearance in he mean shif, resuling in a beer racking performance even if one of he wo mehods becomes unreliable. (Wu e al.,08) enhances he performance of paricle filers in cluered background by aking ino accoun he SIFT feaures in paricle weighs along wih he color similariy measure. (Zhao e al.,08) uses feaure poin analysis o recover affine parameers, from which relaive scales beween wo frames are esimaed. I reconsrucs arge posiions and relaive scales using he affine parameers esimaed from he SIFT feaure correspondences. (Chen e al.,08) uses a similar mehod o handle he occlusion and scaling under he mean shif framework. While local feaure poins are promising for racking, several problems remain, e.g. lacking of SIFT poin feaures in smooh objecs; lacking of sufficien feaure poin correspondences hrough video frames especially when he objec conains pose changes, inersecions and large deformaions. (Khan & Gu,10) proposes o combine an enhanced anisoropic mean shif and a spaioemporal SIFT-RANSAC procedure ino a unified framework by using an opimal crierion, where he mean shif seeks global objec appearance similariy and he spaioemporal SIFT-RANSAC finds local feaure poins in he foreground/background. To furher enhance he robusness agains he racking drif, he scheme also includes online learning of global appearances and local feaures. 2.2 Online learning Online learning is anoher key issue ha significanly impacs he performance of visual racking. Mos racking mehods require some kind of reference objec models. Offline raining videos from he same arge objec usually are no available, since video scenes from specific objecs (e.g. suspicious acions of a paricular person) capured by a surveillance camera are rarely repeaable. For racking dynamic arge objecs, online learning of reference objec appearances and/or shape is hus crucial. This is no rivial as he change of objec can be caused by he objec iself (e.g. change in colors, poses or shape), bu also by parial/full occlusions from an occluding objec or cluered background, in addiion o oher changes such as lighing and illuminaions. Online learning of dynamic objecs requires ha only he change associaed wih he arge objec iself is learned/updaed ino he model, while he remaining change from he background or oher objecs does no rigger he learning process. This is challenging since we have neiher priors on he background/occluding objecs, nor he informaion on when and where an occlusion may occur. Techniques for online learning vary depending on he aribues of objec (e.g. global/local appearance, shape, moion) o be learned. Furher, i depends on he echnique used in he visual racking as racking and online learning are usually incorporaed under a same framework. Many online learning echniques have been proposed. For example, (Lim e al.,04; Yang e al.,04; Wang e al.,07) perform incremenal learning of 1D/2D PCA ha describes he objec appearance in a visual racker. (Wang e al.,08) proposes an online Grassmann manifold learning scheme where

6 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 93 dynamic objec appearances are consrained on a smooh curved surface raher han a linear subspace. For online learning of pdf associaed wih each individual pixels, (Li e al.,08) suggess a color co-occurrence based mehod o learn he ime-varying principal pdf of individual pixels ha conain moions. For online learning of objec appearance pdf used in he mean shif, (Khan & Gu,10) propose a robus online learning mehod in he regular frame inervals based on he crierion ha deermines wheher he change is likely caused by he arge objec. For online learning of local feaures, (Srandmark & Gu,09) proposes muliple moion models and dynamic mainenance of he feaure poin se by using SIFT and RANSAC allowing online adding and pruning he feaure poins, or freezing he updaing. (Haner & Gu,10; Khan & Gu,10) furher improve he mehod by applying dynamic mainenance of separae foreground and background feaure poin ses under differen crieria, where he background se is used o provide priors on occlusions o he foreground objec. 2.3 Chaper ouline This chaper is focused on describing a hybrid visual racking scheme, where boh he local feaures and he global dynamic objec appearance and shape are exploied. A key componen of he racker, he online learning, is employed o wo baseline rackers: one is used o mainain he dynamic local feaure poin ses, and he oher is used o learn he global appearance of dynamic objec. The hybrid racking scheme combines local poin feaures and global appearances. I includes: (a) A local poin feaure-based candidae racking mehod by employing consensus poin feaure correspondences separaely in he foreground se and in he surrounding background se hrough using a spaioemporal SIFT-RANSAC procedure. They are accompanied wih an online mainenance process ha can add, prune, freeze and re-iniialize feaure poins in he ses; (b) A global appearance similariy-based candidae racking mehod by using an enhanced anisoropic mean shif whose iniial kernel is parially guided by he local poin feaures, and is equipped wih he online learning of reference objec disribuion; (c) The final hybrid racker by combining he candidae rackers in (a) and (b) hrough using an opimal crierion. We hen show ha he online learning sraegy for he candidae racker in (b) can be direcly applied o he online learning of anoher sae-of-he-ar visual racking mehod, he join anisoropic mean shif and paricle filer (JMSPF) racker, which may resul in furher racking robusness in erms of racking drif, ighness of racked bounding boxes, and racking failures in complex scenarios. Experimenal resuls on visual racking of video objecs wih a range of difficul scenarios are included. Several disance merics are used o quaniaively evaluae he robusness of he racker, o evaluae he performance of he racker wih and wihou he online learning. Furher performance evaluaions are made qualiaively wih hree exising rackers, and quaniaively wih wo exising rackers. The compuaions are also compared for he hybrid racker and hree exising rackers. The remainder of he chaper is organized as follows. The general srucure and overall descripion of he hybrid racker are given in Secion 3. In Secion 4, we describe wo baseline rackers, one is based on using local feaure poin correspondences exraced from he spaioemporal SIFT-RANSAC, and he oher is based on he global objec appearance similariies. In paricular, Secion 5 emphasizes wo online learning echniques employed o hese wo baseline rackers. In Secion 6, he hybrid racker is formulaed from he wo baseline rackers under an opimal crierion. Secion 7 describes a direc applicaion of he

7 94 Objec Tracking online learning mehod o an exising join anisoropic mean shif and paricle filer (JMSPF) racker. Secion 8 is conribued o he experimenal resuls and performance evaluaions aimed a demonsraing he feasibiliy and robusness of he hybrid racker. The advanages and limiaions are also discussed. Finally, conclusions are given in Secion A Robus hybrid visual racking scheme: The big picure This secion describes he general srucure and gives he big picure of he hybrid racking scheme, where muliple issues are reaed in a unified racking framework. The racking scheme, as shown in he block diagram of Fig.1, can be spli ino several basic modules. This is briefly summarized as follows: Fig. 1. Block diagram of he hybrid racking scheme, where V (i), i = 1, 2, is he parameer vecor for he racked candidae region R (i) from he baseline racker-a and racker-b a ime, I is he h video frame, R (obj) 1 is he finally racked objec region from he hybrid racker a 1, and q is he esimaed appearance pdf for he reference objec a. (a) Two online learning mehods. Two novel mehods for online learning of dynamic objec are described: one is used for online mainenance of local poin feaure ses, and he oher is for dynamically updaing he objec appearance disribuion. This is aimed a keeping a ime-varying reference objec descripion hereby he racking drif can be reduced. The mehod is based on seeking he bes frame, indicaed by reliable racking wihou occlusions, in each fixed-size frame inerval. This is achieved by using a crierion funcion in he inerval. (b) Baseline racker-a: exploi local poin feaures for objec racking. The baseline racker-a in he hybrid racker is realized by exploiing he local poin feaures. Local poin feaures are useful when parial occlusions occur: poin feaures in non-occluded par remain unchanged, despie he global objec appearance may experience significan changes. To make he poin feaure-based racking robus, wo ses of poin feaures are uilized: one for he foreground region and he oher for he background region (see Fig.1). The idea of using he background poin feaures is o provide priors o he foreground on possible occlusions. The local poin feaures are colleced by inroducing a novel spaioemporal SIFT-RANSAC procedure followed by an online mainenance process ha may add, prune and updae feaure poins in boh ses. To preven drif and error propagaion, a re-iniializaion process is also used.

8 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 95 (c) Baseline racker-b: exploi global objec appearance for objec racking. The baseline racker-b in he hybrid racker is realized by exploiing global objec appearance disribuions. To find he mos similar appearance objec compared wih he reference (e.g., previously racked objec, or a reference objec), anisoropic mean shif wih a 5-degree parameric bounding box is used. An enhancemen is added o he convenional mean shif by allowing is kernel parially guided by he local poin feaures. This may reduce he racking drif as he mean shif is sensiive o cluered background/occluding objecs having he similar color disribuion o he foreground objec. Online learning of objec appearance disribuions and re-iniializaion inroduced for achieving more racking robusness and prevening propagaion of racking drif across frames. (d) The hybrid racking scheme: formulaed from an opimal crierion. The above local poin feaure-based racker and global objec appearance-based racker in (b) and (c) are exploied joinly o form he hybrid racker. I is formulaed by using an opimal crierion ha parallel employs he wo baseline rackers and one from heir weighed combinaion. 4. Baseline racking mehods using local feaures and global appearances This secion describes wo baseline racking mehods, one is based on using local poin feaures of objec (Tracker-A in Fig.1), and he oher is based on using global objec appearance (Tracker-B in Fig.1). 4.1 Local feaure poin-based visual racking This secion describes one baseline racking mehod, racker-a, in he hybrid racker shown in Fig.1. I is a local feaure poin-based racking mehod realized by a spaioemporal SIFT-RANSAC procedure. I generaes separae local feaure poin ses in he foreground and he surrounding background respecively. The use of local poin feaures is moivaed by problems encounered in racking parially occluded objecs, or objecs having similar color disribuions o he cluered background. In hese scenarios, local salien poin feaures from non-occluded objec pars, or salien poin feaures of objec may be exploied for racking. For maching local poin feaures, wo well known compuer vision echniques are employed: SIFT (scale-invarian feaure ransform) (Lowe,04) and RANSAC (random sample consensus) (Fischler & Bolles,81) are used. The former is used o mach scale-invarian feaure poins, while he laer is used o remove ouliers. A brief review of SIFT and RANSAC is given in Secion For inroducing more robusness o parial occlusions/iersecions, wo ses of feaure poins, one o he foreground area and anoher o he background area surrounding he candidae objec, are employed. The background se is used o provide priors on occlusions. This is described in Secion I is worh emphasizing ha one of he key seps o realize he spaioemporal SIFT-RANSAC is he online mainenance of poin feaure ses (in Secion 5.1), while he convenional SIFT and RANSAC usually canno be applied successfully o a long video sequence (e.g. of a few hundreds of frames) Review: SIFT and RANSAC for feaure poin correspondences Two sandard compuer vision echniques, SIFT (Lowe,04) and RANSAC (Fischler & Bolles,81), are briefly reviewed in his subsecion. In SIFT, each poin feaure is described by a feaure vecor f i = {p i, Φ i } = {p i, σ i, ϕ i, g hi } (1)

9 96 Objec Tracking where p i =(x i, y i ) is he 2D posiion of SIFT keypoin, Φ i = {σ i, ϕ i, g hi } is he parameer vecor associaed wih he poin p i, including he scale σ i, he main gradien orienaion wihin he region ϕ i and he gradien orienaion hisogram g hi (128 bins). Firs, he original image I is convolved wih a bandpass filer h o obain a filered image I o, I o (x, y, σ) = I(x, y) h(x, y, kσ), whereh( ) is formulaed from he difference of wo scale Gaussian shape kernels h(x, y, σ)=g(x, y, kσ) g(x, y, σ). The locaion of each feaure poin (SIFT keypoin) p i, i = 1, 2,,ascaleσ i corresponds o he hresholded local exrema of I o (x, y, σ). Foreach SIFT keypoin, one or more principal orienaions θ i are assigned by compuing he gradien magniudes and orienaions in a region surrounding he poin and finding 80% peaks in he orienaion hisogram. The orienaion hisogram is compued from a region cenered a p i, pariioned ino 4 4 blocks each consising of 8 bins. This resuls in a oal of 8 16 = 128 bins. The value of orienaion hisogram h i is obained by summing up he gradien magniudes in each bin. Maching SIFT keypoins in wo image frames is obained by searching he Euclidian disance from he neares neighboring keypoins wih he minimum errors. Under a pre-defined moion model, corresponding SIFT keypoins across wo image frames are relaed. For example, if wo images are relaed o an affine ransform T(β, θ,(d x, d y )), hen each pair of [ ] [ ][ ] [ ] x cos θ sin θ x dx SIFT keypoins is relaed by = β +. Equivalenly, given wo ỹ sin θ cos θ y d y ses of keypoins{(x 1, y 1 ),,(x n, y n )} and {( x 1, ỹ 1 ),,( x n, ỹ n )} from wo images (e.g. a ( 1) and ), hese pairs of keypoins are relaed by, x 1 x 1 y 1 10 ỹ 1 y 1 x 1 01 β cos θ.. = β sin θ d x (2) x n x n y n 10 d y ỹ n y n x n 01 The LS (Leas Squares) esimaion, argmin (β,θ,dx,d y,indexorderofp j ) p i T(p j ) 2, can be applied o find he bes maching and o esimae he ransform parameers β, θ,(d x, d y ) a. Tomake he maching robus agains he false posiive, he bes maching is seleced if he raio of disances beween he firs and he second neares neighbors is less han some empirically deermined hreshold (Lowe,04). Since individual SIFT keypoins are prune o noise, RANSAC is ofen followed o remove ouliers hrough finding he maximum number of consensus correspondences and esimaing he associaed moion parameers. RANSAC conains a wo-sep ieraive process: esimae and find a subse of inlier poins from he SIFT keypoins ha yield he maximum consensus under T () i. In he firs sep, i differeniaes ouliers from inliers by selecing a minimum number of poins needed o esimae he ransform from he SIFT keypoin se a random. Then, he parameers of he ransform are esimaed. Using he esimaed parameers, more poins are picked up if hey fi o his specific ransform. This is done by calculaing he error for each pair of keypoins and comparing wih a small error hreshold T e. The ieraion is repeaed unil he error is smaller han T e, or he oal number of ieraions exceeds a pre-specified maximum ieraion number T ier. In he second sep, i fis he ransform o he inliers while ignoring he ouliers. The ransform parameers are updaed using all colleced inlier poins. A racked candidae objec region is hen obained by drawing a igh recangle surrounding he consensus poins. he parameers of he ransform T () i

Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 97 4.1.

10 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs Using separae feaure poin ses for he foreground and he background In he local poin feaure-based baseline racker, wo separae feaure poin ses, P F and P B, are formed. P F = {pi F pi F : Φi F} is for he foreground, and P B = {pi B pi B : Φi B} is for he background surrounding he candidae objec. In each se, he parameer vecor Φi F (or, Φi B ) is defined according o (1). The basic idea of using background poins is o exrac priors on possible objec occlusions or inersecions. As shown in Fig.2, a searching area (he black recangle) is defined o be larger han ha of an objec region (he red recangle). The region beween he searching area and he candidae objec region (beween he black and red recangles) is defined as he background region. Fig. 2. Foreground and background regions. Red recangle: a foreground objec region; Black recangle: he searching area. The area beween he black and red recangles is he background region. Red ellipse: maybe used for some objecs (e.g humans) o minimize he possible inclusion of background poins around he four corner areas. Boh ses of feaure poins are exraced by combining SIFT and RANSAC as described above, however, in differen regions. Finally, a igh ouer recangular boundary surrounding he seleced foreground feaure poins in P F is drawn as he racked candidae objec region R (1) for he baseline racker-a. The region is specified by he parameer vecor V (1) = [p c =(y (1) 1,c, y(1) 2,c ), w(1), h (1), θ (1),P F ] T including he 2D cener posiion, he widh, heigh and orienaion of he region, and he foreground ses. To reduce he possibiliy of including background poins in he foreground se, he shape of objec ype may be considered. For example, for human objecs, foreground feaure poins may be consrained wihin an ellipse ha is ighly made wihin he recangular region R (1). I is worh menioning ha he resuling region R (1) is also provided o he baseline racker-b (in Secion 4.2) o enhance he convenional mean shif which is prune o he background or occluding objecs wih similar color disribuions Re-iniializaion A re-iniializaion process may be applied o some frames o preven racking drif or racking error propagaion across frames. The idea can be analogue o using I (inra-coding) frames in video compression. A re-iniializaion process for he baseline racker-a is used o avoid severe errors, e.g., when he number of corresponding poins is very small, unreliable racking or accumulaed racking drif may occur. In he former case, a small number of feaure poins may lead o an ill-posed RANSAC esimaion, or a unreliably racked region. In he laer case, accumulaed drif may evenually lead o racking failure. A racker hus needs o be re-iniialized o avoid he propagaion of errors across frames. Based on he observaion ha a bounding box does no change significanly in consecuive frames, and ha a very low similariy value beween he racked region and he reference objec indicaes a possible racking drif or unsable racking, he frames for re-iniializaion

11 98 Objec Tracking can be seleced. Two condiions, he disance of consecuive box shape and he Bhaacharyya coefficien beween he racked and he reference objec regions, are used o deermine wheher a re-iniializaion is applied. Tha is, if one of he following wo condiions is saisfied, dis (1) hen he baseline racker-a is re-iniialized a by: = 4 i=1 x(1),i x (1) 1,i 2 > T (1) 1, or ρ (1) < T (1) 2 (3) where R (obj) 1 R (1) R (obj) 1, V(1) V (obj) 1, and ρ(1) 0 (4) is he racked bounding box from he final hybrid racker a ( 1), andv(obj) 1 is is he normalized Bhaacharyya coefficien defined he corresponding parameer vecor, ρ (1) in (21), x (1),i and x (1) 1,i are he four corners of he objec bounding box a and ( 1) from he baseline racker-a, ρ (1) is he Bhaacharyya coefficien for he baseline racker-a, and T (1) 1 and T (1) 2 are he empirically deermined hresholds. 4.2 Global appearance-based visual racking This secion describes anoher baseline racking mehod, racker-b, in he hybrid racker shown in Fig.1. I is a global objec appearance-based racking mehod realized by an enhanced anisoropic mean shif. Based on he observaion ha mean shif racking yields reasonably good racking resuls, however, racking drif may occur when nearby objecs or background cluer have similar color disribuions. To ackle he problem, wo sraegies are inroduced here o he mean shif: The firs one is o employ an enhanced anisoropic mean shif, where he resul from he poin feaure-based racking (he baseline racker-a) is used o parially guild he locaion of mean shif kernel. The second sraegy is o add online learning of reference objec appearance (in Secion 5.2), as well as a re-iniializaion process o preven he propagaion of racking drif. This baseline racker-b resuls in a racked candidae objec region R (2) specified by a parameer vecor V (2) =[y (2) =(y (2) 1, y(2) 2 ), w(2), h (2), θ (2), h (2) rgb ]T, where y (2) =(y (2) 1, y(2) 2 ), w(2), h (2) and θ (2) are he 2D cener, widh, heigh and orienaion of R (2),andh (2) rgb is he color hisogram Anisoropic mean shif This subsecion briefly reviews he anisoropic mean shif for visual racking. More deails of mean shif can be found in (Sumin & Xianwu,08; Comaniciu e al.,03). Le he pdf esimae p(y, Σ)={p u (y, Σ), u = 1,, m} for a candidae objec be he spaial kernel-weighed color hisogram wihin he bounding box in he image I(y), and he spaial kernel-weighed color hisogram q(x c, Σ c )={q u (x c, Σ c ), u = 1,, m} for he reference objec wihin he bounding box in I 0 (x). The corresponding hisogram bins for he candidae and reference objecs are described respecively as follows: p u (y, Σ) = c Σ 2 1 q u (x c, Σ c )= c 0 Σ c 1 2 n j=1 k( y j T Σ 1 ỹ j )δ[b u (I(y j )) u] n j=1 k( x j T Σc 1 x j )δ[b u (I 0 (x j )) u] (5) where ỹ j =(y j y), x j =(x j x c ), Σ (or, Σ c ) is a kernel bandwidh marix, b u (I(y j )) (or, b u (I 0 (x j ))) is he index of color hisogram bin a he locaion y j (or, x j ) associaed wih he

12 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 99 candidae (or, reference) objec region, y j (or, x j ) is summed over all pixels wihin he bounding box, c (or, c 0 ) is a consan used for he normalizaion, m is he oal number of bins, k( ) is he spaial kernel profile, and y (or, x c ) is he cener of he kernel (or, bounding box). To measure he similariy beween a candidae and he reference objec region, he Bhaacharyya coefficien ρ defined below, is used: ρ(p, q)= m p u (y, Σ)q u (6) u=1 Applying he firs-order Taylor series expansion o (6) around (y 0, Σ 0 ),(wherey 0 and Σ 0 are he kernel cener and bandwidh in he previous frame) yields ρ u 1 2 qu p u (y 0, Σ 0 )+ c 2 Σ 1 2 j ω j k(ỹ T j Σ 1 ỹ j ),whereω j = u qu p u (y 0,Σ 0 ) δ[b u(i(y j )) u]. bounding box cener) can be esimaed by seing y ρ(p, q)=0. This leads o: The kernel cener (or, ŷ = n j=1 g(ỹt j Σ 1 ỹ j )ω j x j n j=1 g(ỹt j Σ 1 ỹ j )ω j (7) where g( ) = k ( ) is he shadow of he kernel. To esimae ˆΣ, aγ-normalized kernel bandwidh Σ (in (Brezner & Lindeberg,98)) is applied o ρ, where y in ỹ j is subsiued by ŷ ha is obained from (7). The kernel bandwidh marix is esimaed by seing Σ ( Σ γ/2 ρ(p, q)) = 0, yielding, ˆΣ = 2 n j=1 ω j g(ỹ T j Σ 1 ỹ j )ỹ T j ỹ j 1 γ n j=1 ω j k(ỹ T j Σ 1 ỹ j ) (8) where γ is empirically deermined, and ỹ j =(y j ŷ). The esimaion of (7) and (8) are done alernaively in each ieraion. The ieraion repeas unil he esimaed parameers converge, or a pre-specified maximum number of ieraions is reached Esimaing shape parameers of bounding box For esimaing he parameers of bounding box in he baseline racker-b, a simple approach differen from (Sumin & Xianwu,08) is employed. The anisoropic mean shif used in he baseline racker-b conains a fully unable affine box wih five degrees of freedom, i.e., he 2D cenral posiion, widh, heigh and orienaion of he box. Le he orienaion of he box be defined as he angle beween he long axis of bandwidh marix and he horizonal-axis of he coordinae sysem. Le he heigh h and widh w of he bounding box be defined as he radii along he long and shor axes of an ellipse, as depiced in Fig.3. Since h, w,andθ are relaed o he kernel bandwidh marix Σ by, [ Σ = R T (h/2) (θ) 2 ] 0 0 (w/2) 2 R(θ) (9) [ ] cos θ sin θ where R =, compuing hese parameers can be efficienly done by applying sin θ cos θ eigenvalue decomposiion o Σ, Σ = VΛV 1 (10)

13 100 Objec Tracking Fig. 3. Definiion of he widh, heigh and orienaion of a bounding box. [ ] [ ] v11 v where V = 12 λ1 0, Λ =, and by relaing hese parameers wih eigenvecors and v 21 v 22 0 λ 2 eigenvalues using (9) and (10) as follows, ˆθ = an 1 (v 2,1 /v 1,1 ), ĥ = 2 λ 1, ŵ = 2 λ 2 (11) where v 11 and v 21 are he wo componens from he larges eigenvecor. parameers in V (2) are obained from he esimaes in (7), (8) and (11). The firs five Enhancing he anisoropic mean shif The basic idea of enhanced mean shif is o parially guild he locaion of mean shif kernel from he resul of local feaure poin-based racker (or, baseline racker-a). This is designed for correcing possible racking drif due o, e.g. similar color disribued objec / background cluer, or parial objec occlusions/inersecions. Enhancemen is done by assigning he mean shif racker o an area ha is also agreeable wih ha from he local feaure poins of arge. Areas used for he mean shif and for he candidae objec: To limi he number of background pixels enering he foreground region, one may use a slighly smaller ellipse area inside he recangular box (e.g., scaled by K, K=0.9 in our ess) of candidae objec region R (2).Thisis based on he observaion ha an ellipse box can be igher for some objec ypes and would herefore exclude he background pixels around he four corners of he recangular box. The resul from he local feaure-based racker (i.e. baseline racker-a) is employed o guide he kernel posiion of mean shif, if he resul from he baseline racker-a is shown o be reliable. This is done by examining he Bhaacharyya coefficien and he number of consensus feaure poins in he racked objec region R (1). If hey are boh high (indicaing a reliable resul), hen he iniial parameer vecor in R (2) for he baseline racker-b is assigned by ha in he racker-a, oherwise by he parameer vecor from he racker-b a 1, i.e. { V (2) V (1) = if P F > T (2) 1 and ρ (1) > T (2) 2 oherwise V (2) 1 where T (2) 1 and T (2) 2 are hresholds deermined empirically Re-iniializing he region A re-iniializaion process is added o ackle he issue of racking drif or racking error propagaion across frames. The idea can be analogue o applying inra video coding in each fixed frame inerval. The following crierion, in a similar spiri o ha in he baseline racker-a, (12)

14 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 101 is used o deermine wheher or no he re-iniializaion is applied o he racked region ˆR (2) for he baseline racker-b. Tha is, if one of he following wo condiions is saisfied: dis (2) 4 = x (2),i x (2) 1,i 2 > T (2) 3, or ρ (2) < T (2) 4 (13) i=1 hen, he baseline racker-b is re-iniialized a by: where R (obj) 1 R (2) R (obj) 1, V(2) V (obj) 1 and ρ (2) 0 (14) is he previous racked bounding box from he final hybrid racker a ( 1) and V (obj) 1 is he corresponding parameer vecor, ρ (2) defined in (21), x (2),i and x (2) 1,i is he normalized Bhaacharyya coefficien are he four corners of racked objec regions R(2) and R (2) 1 from he baseline racker-b, T (2) 3 and T (2) 4 are wo empirically deermined hresholds. 5. Online learning in he spaioemporal SIFT-RANSAC and he enhanced anisoropic mean shif Online learning is an imporan sep ha may significan impac he racking performance. In his secion, we describe wo online learning echniques, one is uilized o mainain wo feaure poin ses from he spaioemporal SIFT-RANSAC in he baseline racker-a, anoher is applied o updae he reference objec disribuion from he enhanced anisoropic mean shif in he baseline racker-b. 5.1 Online learning of local poin feaure ses This subsecion describes a key sep in he spaioemporal SIFT-RANSAC: online learning of wo feaure poin ses across video frames. The use of spaioemporal SIFT-RANSAC is moivaed by he problems encounered in he convenional SIFT-RANSAC for racking objecs hrough long video sequences. Ofen, he number of corresponding feaure poins reduce significanly hrough video frames due o objec pose changes, parial occlusions or inersecions. In hese cases, some feaure poins on he objec may disappear, consequenly only a subse of feaure poins finds heir correspondences across wo image frames. This phenomenon may propagae hrough video frames, and may evenually lead o a dramaically reduced number of corresponding poins. When he number of poins is very small, he region surrounding hese poins may become very small and unreliable. Furher, moion parameer esimaion can become ill-posed if he number of equaions is less han he unknown parameers. The online learning procedure in he spaioemporal SIFT-RANSAC is designed o dynamically mainain he corresponding poin ses hrough video frames. This includes he possibiliy of adding new poins, pruning weak poins wih low weighs, and freezing he adapaion when parial occlusions are suspeced. This online learning is applied separaely o he foreground candidae objec region and he surrounding background region. The mehod for online mainenance of spaioemporal poin correspondences is similar o he work in (Haner & Gu,10; Khan & Gu,10), which conains he following main seps: Assign weighs o all corresponding feaure poins; Add new candidae feaure poins;

15 102 Objec Tracking Prune feaure poins wih low weighs; Freeze he updaing when a parial occlusion of objec is likely o occur; Separae mainenance of he foreground feaure poin se P F and he background feaure poin sep B Mainenance of foreground feaure poin se P F For his feaure poin se, online learning of he dynamic feaure poins conains he following main seps: Assigning weighs: Firs, a se of feaure poins P F = {p i =(x i, y i ), i = 1, 2, }a he frame is seleced from he SIFT, wihin he ransformed bounding box area obained by using he esimaed affine ransform parameers from he RANSAC o he racked objec box a (-1). Feaure poins from RANSAC ha are ouside he ransformed bounding box (i.e. belong o he background) are hen removed. Afer ha, a candidae consensus poin se P F is creaed. P F consiss of hree subses of feaure poins according o how consensus correspondences in he foreground se are esablished, P F = {Pa F Pb F PF c } (15) In (15), Pa F conains mached consensus poins, i.e. he feaure poins seleced by he RANSAC, Pb F conains he ouliers ha fail o agree wih he bes esimaed ransform parameers in he RANSAC (which could be he resul from noise influence or objec dynamics), and Pc F is he se of newly added feaure poins from he SIFT ha are iniiaed wihin he candidae objec region a and do no correspond o any background feaure poins. Each feaure poin p i in he candidae sep F is assigned o a weigh according o: W W i 1 i + 2 = W 1 i 1 W i 0 p i Pa F p i Pb F p i Pc F where he iniial weigh for a new feaure poin W0 i is se o be he median weigh value of all poins in he subses Pa F and Pb F,i.e.Wj 0 = median(w j p j Pa F Pb F),andWi 1 for P a F and Pb F is iniialized o zero in he firs frame. Once he maximum consensus poins are seleced a, heir weighs in (16) are increased. For hose corresponding poins ha do no fi o he esimaed ransform parameers in he RANSAC (i.e. mached ouliers), heir weighs in (16) are reduced. Adding or pruning consensus poins: Afer updaing he weighs in (16), feaure poins in P F are hen updaed. This is done by firs soring ou he feaure poins in P F according o heir weighs. New feaure poins inpc F are added wih he median weigh values, so ha hey may remain in he se afer he subsequen pruning process. The pruning process is hen applied o keep a reasonable size ofp F by removing low weigh feaure poins in he se. This is done by keeping he L F (L F empirically deermined, L F =1000 in our ess) highes weigh poins in he se and removing he remaining ones. Freezing he updaing when a parial occlusion is highly probable: If an objec is occluded by cluered background or inerseced by oher objecs, objec appearance wihin he bounding box may change significanly. The Bhaacharyya coefficien value may indicae he exisence of such scenarios, as he images beween he racked objec and he reference objec become less similar. When such a scenario occurs, he online mainenance process should be emporally frozen in order o no include he feaure poins from he background cluer or (16)

16 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 103 he occluding objec. The Bhaacharyya coefficien is compued by using he hisograms from he reference objec and from he racked objec region R (1) a obained in RANSAC as ρ (1) = m u=1 p,(1) u (y, Σ) q u,wherep,(1) u and q u are he uh bin of spaial kernel-weighed color hisogram from R (1) of racker-a, and of he reference objec region, respecively. The kernel cener y =(y 1, y 2 ) and he anisoropic kernel bandwidh marix Σ can be compued using he mehod described in Secion 4.2. The Bhaacharyya coefficien ρ (1) is used o indicae wheher or no he region R (1) is likely o conain occluding objecs/background area, e.g. from parial occlusions or objec inersecions. If ρ (1) is small, indicaing ha he global appearance of objec is significanly differen from ha of he reference objec, hen he dynamic mainenance is emporally frozen, so ha no new feaure poins would be wrongly added. This is done as follows: If ρ (1) T F (T F is a hreshold deermined empirically), hen he mainenance process freezes, oherwise he mainenance process proceeds Mainenance of background feaure poin se P B Online learning is also performed o he background se o mainain he dynamic background feaure poins. The background se conains feaure poins in beween he large searching box and he candidae objec region (see he area beween he black and red recangles in Fig.2). Comparing wih he foreground case, he following differences exis for mainaining he background se: Thesearchingareaandisre-iniializaion: The search area a (see he black recangle in Fig.2) is a recangular area, ha is larger han he racked objec region R (1) 1.Thisisdonebyexending boh he lef and righ side of R (1) 1 by k xw (1) (w (1) is he widh of R (1) 1,andk x=0.5 in our ess), and exending boh he op and boom side of R (1) 1 by k yh (1) (h (1) is he heigh of R (1) 1,and k y [0.1, 0.5] in our ess). This resuls in a searching area of (2k x + 1)w (1) in widh, and (2k y + 1)h (1) in heigh. Correspondingly, he search area is re-iniialized immediaely afer he re-iniializaion of he foreground objec region R (1). Shif foreground poins o he background se: Those feaure poins in he foreground se ha find heir correspondences in he background se are re-assigned o he background se. Adding and pruning new background poins: New feaure poins ha are found in he background region a he curren frame are added ino his se. A maximum number L B is hen assigned o he background poin se (L B =1500 in our ess, empirically deermined). Feaure poins in his se are sored ou according o heir aging: If he oal number of feaure poins exceeds L B, hen only he newes L B feaure poins are kep while he remaining old aging poins are removed. 5.2 Online learning of dynamic reference objec appearance Since he appearance of a arge objec may change in ime, using a ime-independen appearance disribuion q = q for he reference objec may lead o racking drif or racking failures in he mean shif especially when he pdf of a visual objec changes (e.g. significan changes in he objec color disribuion). Despie much research work in he mean shif-based objec racking, online learning of reference objec pdf q remains an open issue. This is mainly due o he ambiguiy on he change which could be caused eiher by objec iself or by some background inerferences (e.g. occlusions/inersecions or background cluer) and he lack of

17 104 Objec Tracking occlusion priors. An efficien mehod for online learning of reference objec pdf for he mean shif has so far been repored in (Khan & Gu,10). The basic idea behind his online learning mehod is o only updae he reference appearance disribuion a hose frames where reliable racking wihou occlusion is indicaed. Furher, he updaing frequency does no need o be very high, since objec changes are usually gradual due o he mechanical movemen. The online learning is done by using a crierion funcion and seeking he local maximum poin ha corresponds o good racking performance in each individual frame inerval of fixed lengh. Le ρ = u qu j 1 p u be he Bhaacharyya coefficien beween he curren racked objec from he final racker and he reference objec in he previous (j 1)h inerval, and x,i be he four corners of he racked region R (obj) from he final racker. Noing ha qu j 1 implies ha q u is in he (j 1)h inerval, [(j 2)S+1,(j 1)S],whereS is he oal frames in he inerval (S is empirically deermined depending on he moion speed and video frame rae, S=25 frames in our ess). If he following condiions are boh saisfied: 4 dis = x,i x 1,i 2 < T 1, and ρ > T 2 (17) i=1 hen he reference objec appearance disribuion q j in he jh inerval is updaed q j = κp +(1 κ)q j 1 (18) where j = 1, 2,,and is he highes performance frame chosen from = argmax [(j 1)S+1,jS] ρ (19) and q j is he updaed reference objec pdf in he jh inerval ha is relaed o he ime inerval [(j 1)S+ 1, js], κ is he consan conrolling he learning rae (κ = 0.1 in our ess), p is he appearance disribuion of he candidae objec where is chosen from (19). If (17) is no saisfied, hen he reference objec disribuion remains unchanged, i.e., q j q j 1. Key seps for updaing q j in he jh inerval can be summarized as: 1. Check wheher condiions in (17) are saisfied in [(j 1)S+ 1, js]; 2. If saisfied, updaing q j using (18) where he frame is seleced from (19); 3. Oherwise, freezing he updae by assigning q j q j 1. As an example, Fig.4 shows he wo performance curves in (17) for he video "sair walking", where he hresholds T 1 and T 2 (blue dash line) and he updaed frames (red dos) are marked. To demonsrae he effec of online learning, Fig.5 shows he resuls from he hybrid racker wih and wihou adding online learning o he reference objec appearance. I shows ha he improvemen of racking performance is mos visible wih he increase of video frame number. Since objec appearance changes gradually in ime, online learning of dynamic reference objec disribuion has indeed yielded visible improvemen in racking. 6. Hybrid racker formulaed from a crierion funcion This secion describes he formulaion of hybrid racker hrough combining he wo baseline rackers under a given crierion funcion. For he baseline racker-a in Secion 4.1, feaure poin correspondences are esimaed by using spaioemporal SIFT-RANSAC in he foreground and background regions. A igh recangle

Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 105 1.005 1 ρ Frames updaed 80 70 Disance Frames updaed 0.995 60 ρ 0.99 0.985 Disance 50 40 30 0.

18 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs ρ Frames updaed Disance Frames updaed ρ Disance Frame Number Frame Number Fig. 4. Online learning of q for video "sair walking". Lef: he curve of Bhaacharyya coefficien ρ in (17) vs. frame number, where he blue dash line is he hreshold T 2,andhe red dos are he frames updaed. Righ: he curve of disance of four corners dis in (17) vs. frame number, where he blue dash line is T 1 and he corresponding updaed frames are in red dos Disance wih learning wihou learning Frame Number Fig. 5. Lef: Tracking errors for he hybrid racker as a funcion of video frame number. Disance d 1 in he curve is defined beween he racked and he ground-ruh regions, according o (25). Red: wih online learning; Blue: wihou online learning. Boh curves are for he "sair walking" video; Righ: an example frame of "sair walking". surrounding he foreground poins is hen drawn as he candidae objec region R (1) ha is described by a parameer vecor, R (1) : V (1) =[y c =(y (1) 1,c, y(1) 2,c ), w(1), h (1), θ (1),P F ] T conaining he 2D cener posiion, widh, heigh and orienaion of he bounding box, and he foreground poin se. For he baseline racker-b in Secion 4.2, an image region R (2) whose image conen is mos similar o he reference objec appearance is sough by using he enhanced anisoropic mean shif wih is kernel parially guided by he local feaure poins. This enhanced mean shif racker generaes a parameerized candidae region R (2) : V (2) =[y =(y (2) 1, y(2) 2 ), w(2), h (2), θ (2), h (2) rgb ]T A hird candidae objec region R (3) is hen formed whose parameer vecor is a weighed combinaion of he parameer vecors of he above wo baseline rackers, i.e., R (3) : V (3) = 2 i=1 ρ(i) V (i) (20)

19 106 Objec Tracking where ρ (i) is he Bhaacharyya coefficien (defined in (23)), and ρ (i) Bhaacharyya coefficiens for he wo baseline rackers, is he normalized ρ (i) ρ (i) = ρ (1) + ρ (2) (21) For he final hybrid racker, he parameer vecor associaed wih he opimal arge objec region R (obj) is seleced by maximizing he following crierion, V (obj) { = arg max ρ (i) } i:v (i), i = 1,,3 where ρ (i), i=1,2,3, is he Bhaacharyya coefficien measuring he similariy beween he reference objec and he candidae objec from he racked candidae region R (i) a ime, ρ (i) m = q u pu,(i) (23) u=1 pu,(i) is he uh bin of candidae objec pdf esimae eiher from he baseline racker-a (i=1) or from he baseline racker-b (i=2), q u is he uh bin of reference objec pdf esimae. Noing ha he superscrip in q u indicaes ha he reference objec pdf is dynamic. Table 1 summarizes he algorihm of he enire hybrid racking scheme. Iniializaion:: Frame = 0: mark a bounding box for he objec and compue q 0 ; For frame = 1, 2,,do: 1. Baseline racker-a: Local feaure correspondences by he spaioemporal SIFT-RANSAC. 1.1 Compue correspondence poins by SIFT in he searching area; 1.2 Find consensus poins, esimae he ransform, compue scores by RANSAC; 1.3 Perform dynamic poin mainenance o P F and P B ; 1.4 Compue V (1), R (1),andρ (1) ; 1.5 If (3) is saisfied, re-iniialize V (1), R (1),and ρ (1) 2. Baseline racker-b: Enhanced anisoropic mean shif: 2.1 Using (12) o deermine he iniial V (2) ; 2.2 Compue ŷ () using (8), and ˆΣ () using (9); 2.3 Repea Sep 2.2 unil convergence; ; 2.4 Compue ŵ (2), ĥ(2), ˆθ (2) from (11), and form V (2) ; 2.5 Compue ρ (2) and form R (2) ; 2.6 If (13) is saisfied, re-iniialize V (2), R (2) and ρ (2) ; 3. Compue he combined region parameers V (3) using (20); 4. Deermine R (obj) for he hybrid racker according o (22); 5. Online learning of objec appearance pdf: If mod(, S)=0 (i.e., boundary of an inerval), hen online learning of q j using (18), if condiions in (17) are saisfied; END (For) Table 1. The algorihm for he hybrid racking scheme (22)

20 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs Applicaion: Employ online learning in he join mean shif and paricle filer-based racker In his secion, we show a new applicaion example where he online learning approach in Secion 5.2 is direcly added o an exising join anisoropic mean shif and paricle filer-based (JMSPF) racking scheme (Khan e al.,09), in order o furher improve he racking performance. In he JMSPF racking scheme, a paricle filer is used o rack he parameers of objec shape (or, he bounding box of objec), while he muli-mode anisoropic mean shif is embedded in he he paricle filer hrough shifing is kernel locaion o he mos similar objec area and subsequenly forming up he condiional probabiliy based on he appearance disance meric. In such a way, PF weighs are updaed by using he likelihood obained from he mean-shif, ha re-disribue paricles according o he disance meric hrough exploiing he mos similar objec appearance from he mean shif, raher han using he random sampling. This leads o more efficien uilizing of paricles, hence a significanly reducion of he required number of paricles: from N P = 800 paricles when he sae vecor conains boh he shape and he appearance of objec (Wang e al.,08), o N P =15 paricles in his JMSPF racking scheme. Deails of he JMSPF racking scheme is referred o (Khan e al.,09). Due o he lack of effecive online learning mehods, he JMSPF racker in (Khan e al.,09) uses a ime-invarian reference objec appearance. Adding online learning of objec appearance o he JMSPF racker can be done by applying (18) in fixed-lengh frame inervals, wih a small modificaion o include he superscrip i for paricles, q j,i = κp,i +(1 κ)q j 1,i, i = 1,, N P (24) if he boh condiions in (17) are saisfied. Furher, he re-iniializaion process in Secion can also be applied. A JMSPF racking scheme afer adding he online learning and re-iniializaion process can be shown schemaically in Fig.6. Fig. 6. Block diagram of he join muli-mode anisoropic mean shif and paricle filer-based racking scheme (JMSPF) wih online learning. Noaions used in he block diagram: I : image frame a ; ŝ (obj) 1 and ŝ(obj) : racked box parameers a (-1) and ; s (j) : jh paricle a ; q 1 and q : he esimaed reference objec pdf a (-1) and. Fig.7 shows he racking errors on he wo videos "ThreePasShop2Cor" (CAVIAR Daase) and "Pes2006_S07_C4" (PETS2006) wih and wihou online learning. One can see ha adding online learning o he scheme is able o furher improve he racking robusness and reduce he racking drif in hese cases.

21 108 Objec Tracking online learning no learning 60 online learning no learning Disance Disance Frame Number Frame Number Fig. 7. Tracking errors for he JMSPF racker: he disance d 1 in he curve is defined beween he racked and ground-ruh regions according o (25). Red: wih online learning; Blue: wihou online learning. Lef: from he video "ThreePasShop2Cor"; Righ: from he video "Pes2006_S07_C4". The ground ruh boxes are manually marked. For "ThreePasShop2Cor", only he firs 250 frames of ground ruh boxes are marked and compared. 8. Experimenal resuls and performance evaluaion The hybrid racking scheme (summarized in Secion 6) has been esed on numerous videos conaining a range of complex scenarios. Our es resuls have shown ha he hybrid racking scheme is very robus and has yielded a marked improvemen in erms of racking drif, ighness and accuracy of racked bounding boxes. This is especially obvious when complex scenarios in videos conain long-erm parial occlusions, objec inersecions, severe objec deformaion, or cluered background / background objecs wih similar color disribuions o he foreground objec. 8.1 Experimenal seup For esing he effeciveness of he hybrid racking scheme, es videos ha conain difficul scenarios in a range of complexiies (e.g. long-erm parial occlusion, objec inersecion, deformaion or, pose changes) are seleced. These videos are eiher capured from a dynamic or a saic camera. In he ess, he iniial bounding box is manually marked. Mehods for auomaic iniial bounding box is beyond he scope of his chaper, readers can exploi oher echniques, e.g. muliple hypohesis ess (Reid,79), acive shape models or polygon verices (Cooes e al.,01). For he mean shif racking, a bin hisogram is used for he RGB color images. The maximum number of ieraions is 10 for he enhanced mean shif for all videos and is deermined empirically. Table 2 summarizes he hresholds used for re-iniializaion hresholds as well as he γ values for normalizing he kernel bandwidh marix of he mean shif in he hybrid racker. (T (2) 1, T (2) 2 ) in (12) are se o (10, 0.95) in all cases. Furher, Table 3 summarizes he online learning hresholds used for he hybrid racker and he improved JMSPF racker. 8.2 Qualiaive evaluaion and comparison of racking resuls The hybrid racking scheme has been esed on numerous videos ha conain a variey of difficul racking scenarios. Fig.8 shows he racking resuls (key video frames) from he hybrid scheme (marked by red solid line recangles) on 5 videos. In all cases, online learning is included in he hybrid racker.

22 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 109 Video in baseline racker-a in baseline racker-b T (1) 1 T (1) 2 T (2) 3 T (2) 4 γ walking lady OneShopOneWai2Cor ThreePasShop2Cor Pes2006_S7_C Pes2007_DS5_C Table 2. Parameers in he hybrid racker: re-iniializing hresholds in (3) and (13), and γ-normalizaion in (8). Video Hybrid racker JMSPF racker T 1 T 2 T 1 T 2 walking lady OneShopOneWai2Cor ThreePasShop2Cor Pes2006_S7_C Pes2007_DS5_C Table 3. Online learning hresholds in (17) for he hybrid racker and he JMSPF racker. The video "walking lady" capured from a moving camera conains several long-erm parial occlusions when he lady walks behind cars. Furher, colors from a par of he objec someimes appear o be similar o he occluding car. The video "OneShopOneWai2Cor" is downloaded from he CAVIAR daase (CAVIAR Daase). The seleced arge objec is a walking man wih dark colored clohes. During he course of walking, here is inersecion where anoher man parially occludes he arge man, also here are pose changes while he man is waiing, and scale changes during he course of walking. The video "ThreePasShop2Cor" is also from he CAVIAR daase (CAVIAR Daase). In he video, he seleced arge (a man wearing a red coa wih a backpack) walks in parallel wih wo oher persons before inersecing wih one of hem by suddenly changing his walking direcion. The man coninues his walking and passes anoher nearby person wih a red coa coming from he opposie direcion. Afer a while, several oher inersecions appear when he man walks coninuously away from he camera (deph changes). The video "Pes2006_S7_C4" is from he Pes 2006 daase (PETS2006), named "Daase S7 (Take 6-B)" by he camera 4. The seleced arge objec is a walking man wih dark clohes. During he course of walking, here are several inersecions wih parial occlusions, pose changes. The man also passes over oher walking persons wih similar color clohes. The video "Pes2007_S05_C1" is from he Pes 2007 daase (PETS2007), named "Daase S5" from he camera one. The video ia probably capured around a check-in desk in an airpor, where here are many walking persons. The seleced arge objec is a man wih whie shir carrying a backpack. Tracking his single arge hrough he crowds (conaining around 20 persons where some are moving, some sand sill) is a raher challenging ask, as here are many and frequen parial occlusions, inersecions and pose changes. The aim of hese ess is o qualiaively evaluae he robusness of he hybrid racking scheme, especially in video scenes conaining long erm parial occlusions, objec inersecions, deformaions and fas moion, cluered background or background objec. Comparisons: Comparisons are made wih hree sae-of-he-ar mehods ha are closely-relaed o he hybrid racker described in his chaper. They are:

23 110 Objec Tracking Fig. 8. Comparing racking resuls from 3 rackers: he hybrid racker (red solid line box), Tracker-1 (green shor dash line); Tracker-2 (blue long dash line). From rows 1-6: resuls for videos (key frames) "walking lady", "OneShopOneWai2Cor", "ThreePasShop2Cor", "Pes2006_S5_C4" and "Pes2007_DS5_C1". Tracker-1: an anisoropic mean shif racker in (Sumin & Xianwu,08) ha is formed enirely based on using global objec modeling. Tracker-2: a spaioemporal SIFT-RANSAC racker ha is enirely based on using local feaure poins. Tracker-3: a fragmens-based racker ha uses inegral hisograms (Adam e al.,06). Since we do no have he program code of Tracker-3, comparison is made by using he same videos shown in he Tracker-3 (Adam e al.,06). Three videos (rows 1-3 in Fig.8) in (Adam e al.,06) are found by he Inerne search, and are used for comparisons wih Tracker-3. Fig.8 shows he racking resuls from he hree racking mehods: he hybrid racker (marked in red solid line boxes), Tracker-1 (marked in green shor dash line boxes), and Tracker-2 (marked in blue long dash line boxes). Observing he racked resuls in Fig.8, he hybrid racker is shown o be very robus wih ighly racked bounding boxes and wihou racking drif. Comparing he resuls from he hybrid racker and he wo exising Tracker-1 and Tracker-2, he hybrid racking scheme is shown o be much more robus wih marked improvemen, especially in difficul video scenarios ha conain long parial occlusions, objec inersecs, fas objec moions, nearby persons/cluered background wih similar color disribuions and shape. For he video "Pes2007_S05_C1", he hybrid racker has evenually failed afer 490 frames as he scenes conain oo many persons (around 15) wih high frequency of parial occlusions. Comparing wih Tracker-1 and Tracker-2, he wo rackers have failed in

24 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 111 abou 400 and 240 frames, respecively, while he hybrid racker has managed o rack he objec somewha longer. Fig.9 shows he racking resuls in (Adam e al.,06) (referred o as: Tracker-3). Comparing he racking resuls (marked in red box) shown in Fig.9 and he racking resuls in Fig.8 (rows 1-3, marked in red box), one can see ha he wo rackers have somewha similar racking performance in hese 3 videos, boh have racked he arge objec raher well. Comparisons using more complex videos, e.g., "Pes2007_DS5_C1" would probably be able o disinguish he performance differences of hese 2 rackers, however, no ess are made as his would require o run he program of (Adam e al.,06). Fig. 9. Resuls fromtracker-3 (couresy from (Adam e al.,06)): resuls from Tracker-3 (Red); manually seleced arge (Pink). Top o boom: frames from he videos "walking lady", "OneShopOneWai2Cor" and "ThreePasShop2Cor". 8.3 Quaniaive evaluaion and comparisons of performance To quaniaively evaluae and compare he performance of he hybrid racker and he wo exising rackers (Tracker-1 and Tracker-2), hree disance merics are used Disance merics The Euclidian disance: is defined beween he four corners of he racked objec bounding box and he manually marked ground ruh box as follows, d 1 = (x i,1 xi,1 GT)2 +(x i,2 xi,2 GT)2 (25) i=1 where (x i,1, x i,2 ) and (xi, GT, xi,2 GT ), i = 1,, 4, are he corners of recangular box from he final hybrid racker and he manually marked Ground Truh (GT), respecively. The MSE (Mean Square Error): is defined beween he 5 parameers (2D cener, widh, heigh and orienaion) of racked objec box and he manually marked Ground Truh (GT) objec

25 112 Objec Tracking bounding box over all frames in each video, MSE = 1 N ( ) 2 N v i v,gt i (26) =1 where v i is he ih parameer of a racked box a, v,gt i is he ih ground ruh parameer a, N is he oal number of frames in he video. The Bhaacharyya disance: is defined beween he racked objec box and he reference objec box as follows: d 2 = 1 ρ(p u, q u ) (27) u where u is he index of hisogram bin. Under his crierion, good performance is indicaed by small d 2 values. The average Bhaacharyya disance d 2 is compued by averaging he Bhaacharyya disances over all frames in each video. In he firs row of Fig.10, we compare he racking errors for he 3 rackers (he hybrid racker, Tracker-1 and Tracker-2), in erms of he Euclidian disance d 1 (in (25)) beween he racked box and he ground ruh box as a funcion of image frames, on he video "face" and "walking lady". Comparing he resuls from he wo videos, he hybrid racker has clearly shown beer performance han hose from he Tracker-1 and Tracker-2 in hese cases. In he 2nd row of Fig.10, we compare he hybrid racker and he JMSPF racker wih online learning. Comparing he resuls from he wo videos, he JMSPF racker seems beer in "ThreePasShop2Cor" and slighly worse in "walking lady" o ha obained from he hybrid racker. The performance of hese wo mehods varies depending on he es videos. Table 4 shows he racking errors (he MSEs defined in (26)) for he four rackers: he hybrid racker, he JMSPF racker, Tracker-1, and Tracker-2. Comparing he resuls in he able, he hybrid racker and JMSPF racker have shown clearly beer performance han hose from he wo exising Tracker-1 and Tracker-2. Furher, he JMSPF racker is shown o be much beer han ha of he hybrid racker on he video "ThreePasShop2Corhe" and slighly worse on he video "walking lady". Video Box Hybrid JMSPF Tracker-1 Tracker-2 Parameers racker racker walking lady x-posiion y-posiion widh w heigh h θ (in radius) ThreePasShop2Cor x-posiion y-posiion widh w heigh h θ (in radius) Table 4. Tracking errors, he MSE defined in (26), for 4 differen rackers. Table 5 shows he racking errors, he average Bhaacharyya disances d 2 in (27), for he four rackers: he hybrid racker, he JMSPF racker, Tracker-1 and Tracker-2.

Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 113 Video Hybrid Tracker JMSPF Tracker Tracker-1 Tracker-2 walking lady 0.2076 0.2652 0.3123 0.

26 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 113 Video Hybrid Tracker JMSPF Tracker Tracker-1 Tracker-2 walking lady OneShopOneWai2Cor ThreePasShop2Cor Pes2006_S7_C Pes2007_DS5_C Table 5. Tracking errors, he average Bhaacharyya disances d 2 in (27), for 4 differen rackers. The smaller he d 2, he beer he performance. Fig. 10. Comparison of racking errors (he Euclidian disance d 1 in (25)) on he videos "ThreePasShop2Cor" (column 1) and "walking lady" (column 2). 1s row: comparison among 3 rackers: hybrid racker (red), Tracker-1 (blu) and Tracker-2 (green); 2nd row: comparison beween hybrid racker (red solid) and JMSPF racker (blue dash). Noing he scale difference in verical axis for "walking lady" in he 2nd column. 8.4 Compuaional cos To give an indicaion on he compuaional cos, he execuion imes are recorded for four racking mehods: he hybrid racker (summarized in Secion 6), Tracker-1, Tracker-2, and JMSPF racker (in Secion 7). Table 6 shows he average ime (in Second) required for racking one objec in one video frame, where he average is done over all frames in each video. Noing ha racking ime varies dependen on he complexiy of video scenes. All hese racking schemes are implemened by Malab programs, and run on a PC wih a Inel Penium Dual 2.00 GHz processor. Observing Table 6 one can see ha he hybrid racker requires Video Hybrid racker Tracker-1 Tracker-2 JMSPFracker (in sec) (in sec) (in sec) (in sec) OneShopOneWai2Cor ThreePasShop2Cor Pes2006_S7_C Pes2007_DS5_C Table 6. Average required ime o rack a visual objec in one video frame, for 4 differen visual rackers. All programs are implemened in Malab wihou opimizing he program codes.

27 114 Objec Tracking more compuaions as comparing wih Tracker-1 or Tracker-2, as he resul of combining baseline rackers, adding online learning and compuing he crierion in order o make he final racking. Despie his, he hybrid racker achieves an average racking speed of 10 frames/second using he Malab program. One may also observe ha he JMSPF racker requires raher heavy compuaions, approximaely 10 imes of ha required by he hybrid racker. 8.5 Comparison beween he hybrid racker and he JMSPF racker Boh racking schemes, he hybrid racker (summarized in Secion 6) and he JMSPF racker (in Secion 7) are shown o be very robus in racking visual objecs. Fig.11 shows some resuls of racked video frames (key frames are seleced) on 5 videos from hese wo mehods. Qualiaively evaluaion of hese racking resuls hrough visual comparisons, boh he hybrid racker and he JMSPF racker are shown o be very robus. From he general visual impression, he JMSPF racker has a slighly beer performance in erms of he ighness and he accuracy of he bounding box in some frames. For he video "Pes2007_S05_C1", similar o ha in he hybrid racker, he JMSPF racker has evenually failed afer abou 490 frames as he scenes conain oo many persons wih high frequency of parial occlusions. Quaniaively evaluaions of he performance by comparing d 2 values (defined in (27)) in Table 5 (columns 1 and 2) and d 1 values (defined in (25)) in Fig.10 (he righ sub-figure), and comparing he compuaional speed in Table 6 (columns 1 and 4), show ha he hybrid racker has slighly smaller (average) d 2 values and a much fas compuaional speed (abou 10 imes faser) on he esed videos. While d 1 values in he wo rackers vary depending on he videos. Overall, he hybrid racker seems a more aracive choice, as he radeoff beween he average performance, racking robusness and compuaional speed. 8.6 Limiaions Despie very robus racking performance from he hybrid racking scheme, several weak poins are observed from he experimens. (a) If a arge objec in he video experiences a long-duraion parial occlusions over a large percenage of area (e.g.>60%), hen he racking performance can be degraded, especially if he visible par is raher smooh and lacks of local feaure poins. (b) For images conain a relaively large objec, e.g., a face, large pose changes could poenially cause racking degradaion. This is probably due o he complexiy of face movemen (many possible local moions) and he use of pdf as he face appearance model (ha may no be he bes choice). Improvemen hrough using objec-ype-specific appearance models and feaure poin correspondences under muliple local moion models could be considered. (c) When full objec occlusion occurs. Alhough our ess occasionally conain a few frames of full occlusion, i causes he racker emporally frozen or racking failure, however, he racking is able o immediaely recover or resume racking soon afer he parial appearance of he objec. In principle, full occlusions wih a long duraion is beyond he limi of his scheme. The problem may be beer ackled by rackers using videos from muliple cameras. 9. Conclusion A novel hybrid visual racking scheme is presened in his chaper, which joinly explois local feaures and global appearances and shape of dynamic objecs. The hybrid racker is formulaed using a crierion funcion ha opimally combines he resuls from wo baseline rackers: he spaioemporal SIFT-RANSAC and he enhanced anisoropic mean shif.

28 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 115 Fig. 11. Comparing racking resuls from 2 rackers: he hybrid racker (red solid line box) and he JMSPF racker wih online learning (green dash line box). From rows 1-5: resul frames (key frames) from videos "walking lady", "OneShopOneWai2Cor", "ThreePasShop2Cor", "Pes2006_S5_C4" and "Pes2007_DS5_C1". Online learning of dynamic objec is inroduced o he global objec appearance and local objec feaure poins separaely: For objec appearances, online learning of he appearance disribuion of reference objec is performed in each fixed-lengh frame inerval where he ambiguiy beween he objec change and he change due o parial occlusions is addressed. For objec feaure poins, online mainenance of wo feaure poin ses (foreground and background) is performed in each video frame, where he background se is used as priors on he occlusion. I is worh noing ha he online mainenance of feaure poin ses is a key sep for he realizaion of spaioemporal SIFT-RANSAC. I is also worh menioning ha he enhanced mean shif, by allowing he kernel posiion parially guided by local feaure poins, significanly reduces he mean shif sensiiviy o similar color disribued background/oher objecs. Experimenal resuls on numerous videos wih a range of complexiies have shown ha he hybrid racker has yielded very robus racking performance. This is especially eviden when racking objecs hrough complex scenarios, for example, video scenes where he arge objec experiences long-erm parial occlusions or inersecions from oher objecs, large objec deformaions, abrup moion changes of objec, dynamic cluered background/occluding objecs having similar color disribuions o he arge objec. Resuls of quaniaive and qualiaive evaluaions and comparisons of he hybrid racker and he wo exising racking mehods, (Tracker-1 and Tracker-2), have shown a marked racking improvemen from he hybrid racker, in erms of reduced racking drif and improved ighness of racked objec bounding box. Comparisons by visual inspecing he racking resuls of 3 videos from he hybrid racker and from (Adam e al.,06) have shown ha boh rackers perform raher well

29 116 Objec Tracking in hese cases. Furher comparisons on complex video scenarios requires o run he program from (Adam e al.,06) and hence no performed. Comparisons of he hybrid racker and he JMSPF racker wih online learning (in Secion 7) have shown ha laer has raher similar however occasionally slighly beer performance bu a he cos of significanly increase in compuaions (approximaely 10 imes). Comparisons of hybrid racker wih and wihou online learning have shown ha adding online learning has significanly reduced he racking drif especially for long video sequences. Overall, he hybrid racking scheme is shown o be very robus and yielded marked improvemens over he exising rackers (Tracker-1 and Tracker-2). Comparing wih he JMSPF racker, he hybrid racker provides a beer radeoff beween he racking robusness and racking speed ( 10 frames/second in our Malab program). 10. References [Adam e al.,06] Adam A, Rivlin E, Shimshoni I (2006), "Robus Fragmens-based Tracking using he Inegral Hisogram", vol.1, pp , in Proc.IEEE in l conf CVPR. [Bay e al.,06] Bay H., Tuyelaars T. & Gool L.V.(2006), "SURF: Speeded Up Robus Feaures", in proc. European conf. ECCV, pp [Bauer e al.,07] Bauer J, Sunderhauf N & Prozel P (2007), "Comparing Several Implemenaions of Two Recenly Published Feaure Deecors", in proc. In l Conf. Inelligen and Auonomous Sysems, Toulouse, France. [Bar-Shalom & Formann,98] Bar-Shalom Y. and Formann T. (1998), Tracking and Daa Associaion. New York: Academic. [Baiao e al.,07] Baiao S., Gallo G., Puglisi G. & Scellao S. (2007), "SIFT Feaures Tracking for Video Sabilizaion", in Proc of In l Conf Image Analysis and Processing, pp [Brezner & Lindeberg,98] Brezner L. & Lindeberg T. (1998), "Feaure Tracking wih Auomaic Selecion of Spaial Scales", in proc. Comp. Vision and Image Undersanding, vol. 71, pp [CAVIAR Daase] hp://groups.inf.ed.ac.uk/vision/caviar/caviardata1/ [Chen e al.,08] Chen A, Zhu M, Wang Y & Xue C. (2008), "Mean shif racking combining SIFT", in proc. In l conf. Audio, Language and Image Proc., pp [Cooes e al.,01] Cooes TF, Edwards GJ, Taylor CJ (2001), "Acive appearance models", IEEE rans. TPAMI, vol.23, no.6, pp.681ű685. [Collins,03] Collins R.T.(2003), "Mean-shif blob racking hrough scale space", in proc. IEEE In l conf. CVPR 03, vol. 2, pp [Comaniciu e al.,03] Comaniciu D., Ramesh V. & Meer P. (2003), "Kernel-based objec racking", IEEE Trans Paern Analysis and Machine Inelligence, Vol.5, pp [Deguchi e al.,04] Deguchi K., Kawanaka O. & Okaani T. (2004), "Objec racking by he mean-shif of regional color disribuion combined wih he paricle-filer algorihm", in proc. In l conf. ICPR, vol. 3, pp [Fischler & Bolles,81] Fischler MA & Bolles RC (1981), "Random sample consensus: a paradigm for model fiing wih applicaions o image analysis and auomaed carographys", in Communicaions of he ACM, vol. 24, pp [Gordon e al.,01] Gordon N.J., Douce A. and Freias N.D. (2001), "Sequenial Mone Carlo Mehods in Pracice", New York: Springer. [Gordon,00] Gordon N.J., Douce A. and de Freias N. (2000), "On sequenial mone carlo sampling mehods for Bayesian filering", Saisics and Compuing, vol. 10, pp

30 Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs 117 [Haner & Gu,10] Haner S and Gu IYH (2010), "Combining Foreground / Background Feaure Poins and Anisoropic Mean Shif For Enhanced Visual Objec Tracking", in proc. IEEE In l conf. ICPR, Augus, Isanbul, Turkey. [Harris & Sephens,88] Harris C. & Sephens M.(1988), "A Combined Corner and Edge Deecor", in Proc. 4h Alvey Vision Conf., Mancheser, pp [Hager e al.,04] Hager G.D., Dewan M., Sewar C.V. (2004), ŤMuliple Kernel Tracking wih SSD,Ť vol. 1, pp , in proc. IEEE In l conf. CVPR 04. [Khan & Gu,10] Khan, Z.H.; Gu, I.Y.H.(2010), "Join Feaure Correspondences and Appearance Similariy for Robus Visual Objec Tracking", IEEE Trans. Informaion Forensics and Securiy, Vol.5, No. 3, pp [Khan e al.,09] Khan ZH, Gu IYH, Backhouse AG (2010), "Robus Visual Objec Tracking using Muli-Mode Anisoropic Mean Shif and Paricle Filers", o appear, IEEE rans. Circuis and Sysems for Video Technology. [Khalid e al.,05] Khalid M.S., Ilyas M.U., Mahmoo K., Sarfaraz M.S., Malik M.B.(2005), "Kullback-Leiber divergence measure in correlaion of gray-scale objecs", in Proc. 2nd InŠl conf. on innovaions in Informaion Technology (IIT05). [Li e al,06] Li. Y., Yang J., Wu R. & Gong F. (2006), "Efficien Objec Tracking Based on Local Invarian Feaures", in Proc. of In. Symposium on Comm. and Informaion Technologies (ISCIT), pp [Li e al.,08] Li L, Huang W, Gu IYH, Luo R & Tian Q (2008), "An efficien sequenial approach o racking muliple objecs hrough crowds for real-ime inelligen CCTV sysems", IEEE rans. Sysems, Man and Cyberneics, par B, vol.38, No.5, pp [Lim e al.,04] Lim J, Ross D,Lin R-S, Yang M-H (2004), "Incremenal learning for visual racking", in Proc. In l conf NIPS. [Lowe,04] Lowe D.G (2004), "Disincive Image Feaures from Scale-Invarian Keypoins", In. Journal of Compuer Vision, vol.60, pp [Maggio & Cavallaro,05] Maggio E. & Cavallaro A. (2005), "Muli-par arge represenaion for colorracking", in proc. IEEE In l conf. ICIP, pp [Mondragon w al.,07] Mondragon I.F., Campoy P., Correa J. F., & Mejias L.(2007), "Visual Model Feaure Tracking For UAV Conrol", in Proc. In l Symposium Inelligen Signal Processing, pp.1-6. [Okuma e al.,04] Okuma K., Taleghani A., Freias N., Lile J.J. & Lowe D.G. (2004), "A boosed paricle filer: muliarge deecion and racking", in Proc. In l conf. ECCV, pp [PETS2006] hp:// [PETS2007] hp://pes2007.ne/ [Parameswaran e al.,07] Parameswaran V, Ramesh V & Zoghlami I (2006), "Tunable kernels for racking", in proc. IEEE In l Conf. CVPR, pp [Reid,79] Reid D.B. (1979), "An algorihm for racking muliple arges", IEEE Trans. Auom. Conrol, vol. 24, no. 2, pp [Rosales & Sclaroff,99] Rosales R. and Sclaroff S. (1999), "3D rajecory recovery for racking muliple objecs and rajecory guided recogniion of acions", in proc. IEEE In l Conf. CVPR, pp [Sankaranarayanan e al.,08] Sankaranarayanan A.C., Veeraraghavan A., Chellappa R. (2008), "Objec Deecion, Tracking and Recogniion for Muliple Smar Cameras", Proc. of he IEEE, Vol.96, No.10, pp

31 118 Objec Tracking [Shi & Tomasi,94] Shi J. & Tomasi C. (2008), "Good feaures o rack", in proc. IEEE In l conf. CVPR, pp [Srandmark & Gu,09] Srandmark P. & Gu I.Y.H. (2009), "Join Random Sample Consensus and Muliple Moion Models for Robus Video Tracking", in Springer LNCS Vol. 5575, pp [Skrypnyk & Lowe,04] Skrypnyk I.& Lowe D.G.(2004), "Scene modelling, recogniion and racking wih invarian image feaures", in Proc. In. Symposium Mixed and Augmened Realiy (ISMAR), pp , [Sumin & Xianwu,08] Sumin Q. & Xianwu H. (2008), "Hand racking and gesure gecognion by anisoropic kernel mean shif", in proc. IEEE In l. conf. NNSP, vol. 25, pp [Vermaak e al.,03] Vermaak J., Douce A., Perez P. (2003), "Mainaining mulimodaliy hrough mixure racking", in Proc. IEEE In l conf. ICCV, pp [Wang e al.,07] Wang T, Gu IYH, Shi P (2007), "Objec racking using incremenal 2D-PCA learning and ML esimaion", in proc. IEEE in l conf. ICASSP. [Wang e al.,08] Wang T., Backhouse A.G., & Gu I.Y.H. (2008), "Online subspace learning on Grassmann manifold for moving objec racking in video", in proc.ieee in l conf. ICASSP. [Wang e al.,08] Wang T., Gu I.Y.H., Backhouse A.G. and Shi P. (2008), "Face Tracking Using Rao-Blackwellized Paricle Filer and Pose-Dependen Probabilisic PCA", in proc. IEEE in l conf ICIP, San Diego, USA, Oc [Welch & Bishop,97] Welch G. and Bishop G. (1997), "Scaa: incremenal racking wih incomplee informaion", in proc. 24h Annual Conf. Comp. Graphics & Ineracive Techniques. [Wu e al.,08] Wu P, Kong L, Zhao F & Li X(2008), "Paricle filer racking based on color and SIFT feaures", in Proc. IEEE in l conf Audio, Language and Image Processing, pp [Xu e al,08] Tu Q., Xu Y., & Zhou M. (2008), "Robus vehicle racking based on Scale Invarian Feaure Transform", in proc. In. Conf. Informaion and Auomaion (ICIA), pp [Xu e al.,05] Xu D, Wang Y & An J (2005), "Applying a new spaial color hisogram in mean-shif based racking algorihm", in proc. In l conf. Image and Vision Comp. New Zealand. [Yang e al.,04] Yang J, Zhang D, Frangi AF, Yang J-Y (2004), "Two-dimensional PCA: a new approach o appearance-based face represenaion and recogniion", IEEE Trans. PAMI, vol.26, no.1, pp [Yilmaz,07] Yilmaz A., "Objec racking by asymmeric kernel mean shif wih auomaic scale and orienaion selecion", in Proc. IEEE conf. CVPR 07. [Yilmaz e al.,06] Yilmaz A., Javed O. and Shah M.(2006), "Objec racking: A survey", ACM Compuing Surveys,vol.38,no.4. [Zhao e al.,08] Zhao C, Knigh A & Reid I (2008), "Targe racking using mean-shif and affine srucure", in Proc. IEEE In l conf. ICPR, pp.1-5. [Zhou e al.,08] Zhou H., Yuan Y., Shi C.(2008), "Kernel-Based mehod for racking objecs wih roaion and ranslaion", In. Journal Compuer Vision. [Zivkovic & Krose,04] Zivkovic Z. & Krose B. (2004), "An EM-like algorihm for color-hisogram-based objec racking", in proc. IEEE In l conf. CVPR, vol.1, pp. I

32 Objec Tracking Edied by Dr. Hanna Goszczynska ISBN Hard cover, 284 pages Publisher InTech Published online 28, February, 2011 Published in prin ediion February, 2011 Objec racking consiss in esimaion of rajecory of moving objecs in he sequence of images. Auomaion of he compuer objec racking is a difficul ask. Dynamics of muliple parameers changes represening feaures and moion of he objecs, and emporary parial or full occlusion of he racked objecs have o be considered. This monograph presens he developmen of objec racking algorihms, mehods and sysems. Boh, sae of he ar of objec racking mehods and also he new rends in research are described in his book. Foureen chapers are spli ino wo secions. Secion 1 presens new heoreical ideas whereas Secion 2 presens reallife applicaions. Despie he variey of opics conained in his monograph i consiues a consised knowledge in he field of compuer objec racking. The inenion of edior was o follow up he very quick progress in he developing of mehods as well as exension of he applicaion. How o reference In order o correcly reference his scholarly work, feel free o copy and pase he following: Irene Y.H. Gu and Zulfiqar H. Khan (2011). Online Learning and Robus Visual Tracking using Local Feaures and Global Appearances of Video Objecs, Objec Tracking, Dr. Hanna Goszczynska (Ed.), ISBN: , InTech, Available from: hp:///books/objec-racking/online-learning-androbus-visual-racking-using-local-feaures-and-global-appearances-of-video-obje InTech Europe Universiy Campus STeP Ri Slavka Krauzeka 83/A Rijeka, Croaia Phone: +385 (51) Fax: +385 (51) InTech China Uni 405, Office Block, Hoel Equaorial Shanghai No.65, Yan An Road (Wes), Shanghai, , China Phone: Fax:

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto Visual Percepion as Bayesian Inference David J Flee Universiy of Torono Basic rules of probabiliy sum rule (for muually exclusive a ): produc rule (condiioning): independence (def n ): Bayes rule: marginalizaion: