IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1"

Franklin Logan
6 years ago
Views:

TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1 Adapive Appearance Model and Condensaion Algorihm for Robus Face Tracking Yui Man Lui, Suden Member,, J.

1 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1 Adapive Appearance Model and Condensaion Algorihm for Robus Face Tracking Yui Man Lui, Suden Member,, J. Ross Beveridge, Member,, and L. Darrell Whiley Absrac We presen an adapive framework for condensaion algorihms in he conex of human-face racking. We aack he face racking problem by making facored sampling more efficien and appearance updae more effecive. An adapive affine cascade facored sampling sraegy is inroduced o sample he parameer space such ha coarse face locaions are locaed firs, followed by a fine facored sampling wih a small number of paricles. In addiion, he local lineariy of an appearance manifold is used in conjuncion wih a new crierion o selec a angen plane for updaing an appearance in face racking. Our proposed mehod seeks he bes linear variey from he seleced angen plane o form a reference image. We demonsrae he effeciveness and efficiency of he proposed mehod on a number of challenging videos. These es video sequences show ha our mehod is robus o illuminaion, appearance, and pose changes, as well as emporary occlusions. Quaniaively, our mehod achieves he average roo-mean-square error a 4.98 on he well-known dudek video sequence while mainaining a proficien speed a 8.74 f/s. Finally, while our algorihm is adapive during execuion, no raining is required. Index Terms Adapive appearance model, adapive condensaion algorihm, face racking, angen-plane selecion. I. INTRODUCTION OBJECT racking is an acive compuer vision research opic [1]. Among many visual objecs, human faces are ofen he subjec of ineres. Tracking human faces is paricularly imporan in he racking communiy, and i has led o many applicaions. These applicaions include video surveillance, human compuer inerface, biomerics, ec. However, face racking coninues o be a challenging problem due, in par, o nonrigid moion, appearance variaions, illuminaion changes, and occlusions. Some examples of hese challenges are shown in Fig. 1. For visual racking, observaions ake place sequenially, and predicions are deermined as soon as image frames arrive. Moreover, real-world daa are ofen complex. To succeed, a racking algorihm needs o be boh efficien and effecive. Recenly, sequenial Mone Carlo (SMC) echniques [2] [4] have received aenion because of heir abiliy o escape from local minima and heir applicabiliy o non-gaussian daa. For example, hese echniques have been demonsraed o be robus relaive o parial occlusion. Manuscrip received November 29, 2008; revised May 27, This work was suppored by he Naional Science Foundaion under Gran This paper was recommended by Gues Edior K. W. Bowyer. The auhors are wih he Deparmen of Compuer Science, Colorado Sae Universiy, For Collins, CO USA ( lui@cs.colosae.edu; ross@cs.colosae.edu; whiley@cs.colosae.edu). Color versions of one or more of he figures in his paper are available online a hp://ieeexplore.ieee.org. Digial Objec Idenifier /TSMCA Fig. 1. Challenges of face racking. (a) Differen expressions. (b) Differen poses. (c) Differen illuminaions. (d) Occlusion. A condensaion algorihm [5], also known as boosrap filers [3], is an example of an SMC mehod. This echnique adops facored sampling o form a se of weighed samples. As such, he mos probable objec movemens can be locaed and racked accordingly. However, mos exising condensaion algorihms apply sequenial imporance sampling wih a fixed Gaussian envelope [5], [6] and ignore he emporal informaion and mach qualiies. This canonical sampling sraegy may be widespread raher han concenraing on he mos probable regions. Hence, i requires more paricles o sample he space. In his paper, we inroduce an adapive condensaion algorihm which uses cascade facored sampling o sample he search space. Facial appearance can change dramaically during a racking process because of various expressions, poses, and illuminaions. An appearance model for face racking is o capure hese variaions and models i for a novel image (incoming frame) /$

2 2 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS For online racking which does no have any prior raining, he appearance model mus learn he facial variaions using previous frames. Therefore, i is considered as a harder problem bu could poenially have more applicaions. Appearance models can be represened in a variey of ways. The common appearance models are emplae-based [7], [8], view-based [9], [10], and feaure-based models [11], [12]. Our appearance model may be regarded as an online emplae-based model. Incremenal subspace learning [6], [13], [14] is a popular mehod for online visual racking. This echnique updaes a subspace sequenially, and he racked objec is declared when he minimum disance is found beween he observaion and he subspace. While hese mehods updae a subspace efficienly, hey can only express an image in a linear fashion. I has been argued ha images acually reside in a nonlinear space [15], [16]. In his paper, we mark he curved naure of an image space, which we call appearance manifold, and characerize local lineariy o form a angen plane in which he emplae is buil adapively. A condensaion algorihm uilizes a facored sampling sraegy o form muliple hypoheses and compues he likelihood for each predicion in accordance wih an observaion. For human-face racking using a condensaion algorihm, wo facors play a vial role in robusness. The firs is o have effecive sampling. This facor deermines he geomery of he face, including where i appears in he image. The second is o updae he appearance effecively. Even a good sampling sraegy will fail if he appearance model does no associae a good prediced appearance wih an essenially correc locaion. Therefore, he appearance model should adap over ime, be able o generae high-fideliy maches where a rue face appears, and avoid generaing false maches for hypohesized facial geomeries ha are parially or oally in error. In his paper, we propose a new adapive framework for condensaion algorihms. We inroduce an adapive affine cascade facored sampling and adapive likelihoods o make facored sampling more efficien. Moreover, we model he appearance changes beween video frames by exploiing he local lineariy of an appearance manifold. A new angen-plane-selecion crierion is proposed, and a reference image is buil using he bes linear variey on he seleced angen plane. This adapive appearance model makes image maching robus o various appearances and poses. The proposed mehod can be characerized as an online learning paradigm which does no require prior raining daa. Unlike many racking algorihms [5], [17], [18] which employ complex dynamical models, we model he sae dynamics using he Brownian moion. This makes our mehod generic. Oher characerisics of he proposed mehod include he following: 1) adapive facored sampling; 2) adapive noise; 3) adapive likelihoods; and 4) an adapive appearance model. As our appearance model, we consider appearances on a manifold and uilize is geomerical properies accordingly. This adapive framework is he core conribuion of his paper. Four video sequences, namely, dudek [19], davidin300 [19], Rams, and Smiley, are employed o assess he performance of he proposed mehod. The res of his paper is organized as follows. Secions II and III review relaed works and he elemens of condensaion algorihms. The frameworks of adapive sampling and adapive appearance are given in Secions IV and V, respecively. Experimenal resuls are described in Secion VI. Discussion and conclusions are provided in Secions VII and VIII, respecively. II. RELATED WORK Appearance changes can be caused by pose and/or illuminaion variaions. Isard and Blake [5] propose a condensaion algorihm for visual racking using acive conours paramerized by low-dimensional vecors. The key idea of condensaion algorihms is he ieraive use of facored sampling o generae muliple hypoheses on video sequences. MacCormick and Isard [20] propose a pariioned sampling for a condensaion algorihm which performs he sampling in a hierarchical fashion using a survival rae. Unlike pariioned sampling, our adapive sampling is a wo-sage cascade facored sampling where a coarse sampling is applied o locae he approximaed face locaions, followed by a fine sampling o fine une he face posiions. The ask of visual racking can be considered as an opimizaion problem. Hager and Belhumeur [7] propose a racking algorihm uilizing a parameric model. This algorihm employs a low-dimensional linear subspace represenaion of appearance based upon a se of images acquired prior o he sar of racking. A gradien-based opimizaion procedure is used during racking o adjus appearance. Cooes and Edwards [21] inroduce acive appearance models (AAMs) for deformable objecs. This model applies principal componen analysis o encode shape and exure informaion. Typically, a Gauss Newon mehod is used o find model parameers ha adjus he model o mach closely he curren video frame. The major drawback of his model is is generalizaion o a generic person; in addiion, prior raining is needed. Jepson e al. [22] propose a WSL racker employing a mixure model o accoun for appearance changes. The image feaures are exraced using wavele filers and are modeled using an online expecaion-maximizaion algorihm during he racking process. However, he appearance updae sraegy for hese mixure models depends upon a Gaussian assumpion which may no be valid for real-world daa. Insead, we model observaions on an appearance manifold. Visual racking can be viewed as a classificaion problem. Avidan [23] formulaes he visual racking problem as a classificaion ask using suppor vecor machines (SVM), where he arge is aced as posiive examples and background is regarded as negaive examples. Song e al. [24] combine ensemble classifiers wih paricle filers for muliarge visual racking. These classificaion-based racking mehods usually require eiher offline raining or heavy compuaional load due o a complicaed online learning scheme. Yu e al. [25] inegrae a generaive racker, which employs muliple subspaces o represen an objec, and a discriminaive racker, which is online SVMs. Subspaces are learned and merged online for all appearance variaions; in addiion, an online SVM classifier is used o focus on recen appearance changes. This mehod yields very good racking resuls. However, i requires heavy compuaion for subspace merging and updaing SVMs and only obains wo frames per second in a C++ implemenaion. Adaping o environmenal changes is key for visual racking. Zhou e al. [17] employ a paricle filer for racking and recogniion. Momen images are used o build a mixure appearance

3 LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 3 model for various appearances. The adapive velociy is compued based on he difference beween wo successive frames. The adapive noise is deermined based on he qualiy of predicion. The number of paricles is also varied based upon he esimaed noise variance. Furhermore, he auhors rea occluded pixels as ouliers in which he properies of robus saisics are assumed. Conrasingly, we do no assume any saisical properies in order o handle occlusion. Li e al. [26] propose a cascade paricle filer using hree discriminaive observers. These discriminaive observers are rained using differen inervals beween video frames. A cascade model is applied for imporance sampling. There are hree sages of sampling corresponding o hese hree observers, and hey employ 3000, 600, and 200 paricles, respecively. In conras, our cascade facored sampling only needs o apply 400 o 600 paricles. Incremenally updaing a subspace is a popular echnique for online racking. Ho e al. [14] inroduce an online subspace learning mehod for visual racking. The bases of a subspace are consruced from a se of local means. The local means are compued from a se of consecuive frames, and a new observaion is consrained o remain wihin a prese disance of is local mean. The appearance subspace is incremenally updaed. This mehod mainains he emporal neighborhood (recen frames) as he bases for racking. The local consrain may evenually cause he racker o lose valuable informaion and consequenly fail o keep racking he face. In conras, our mehod keeps rack of he spaial neighborhood, and consequenly, key frames can be used o cope wih drifing and various appearance changes. Lee and Kriegman [27] uilize a prerained generic appearance represenaion in conjuncion wih an online personspecific appearance model. The auhors approximae an appearance manifold using a low-dimensional linear subspace and a probabilisic sae-ransiion Bayesian framework. The pose subspaces are incremenally updaed during racking. Recenly, Ross e al. [6] applied a paricle filer and exended he sequenial Karhunrn Loeve algorihm. This sequenial algorihm incremenally updaes a mean image and associaed eigenspace ha characerizes he face being racked. A forgeing facor is suggesed o enhance he robusness of racking. The auhors demonsrae ha heir mehod ouperforms he WSL [22] and he mean shif [28] rackers on he dudek video sequence. Neverheless, his mehod needs o employ los of paricles o achieve good resuls. Generally speaking, drifing is he challenge o be overcome by online visual rackers. Mahews e al. [8] discuss he emplae updae problem for drifing. The auhors proposed a simple mechanism for he drifing problem, which aligns he updae emplae wih he firs frame. Then, hey formulaed he racking problem as a search problem for AAMs. In his paper, we aack he drifing problem by adaping he angen plane and no allowing small perurbaions o conaminae our appearance model. III. CONDENSATION ALGORITHM To review, he objecive of condensaion algorihms [5] is o esimae he poserior disribuion p(x Y 1: ), where x is a sae a ime and Y is he observaion from ime 1 o ime. By Bayes rule, his poserior disribuion can be esimaed recursively [4] p(x Y 1: ) = κp(y x )p(x Y 1: 1 ) (1) where κ is a normalizing consan, p(y x ) is an observaion condiional densiy, and p(x Y 1: 1 ) is a prior densiy. In pracice, p(y x ) is generally mulimodal, and p(x Y 1: ) canno be compued in closed form. However, i is assumed ha p(y x ) can be evaluaed a poins, and herefore, he poserior disribuion can be approximaed using facored sampling. The key o facored sampling is o generae n samples {s 1,s 2,...,s n } from p(x ), where sample i {1, 2,...,n} is chosen in accordance wih probabiliy π i described as follows: π i = p(y x = s i ) n j=1 p(y x = s j ). (2) Once he poserior disribuion is approximaed, he maximum a poseriori esimae can be used o draw probabilisic inferences. Noe ha increasing n improves he esimae and he approximaed poserior disribuion is weakly convergen o he rue poserior disribuion using facored sampling [5]. A. Elemens of he Condensaion Algorihms Condensaion algorihms ieraively exploi propagaion and facored sampling. The poserior disribuion is sampled from a se of paricles {s (p),π (p) } n p=1 wih p(x 1 Y 1: 1 ). Samples wih high probabiliies will be sampled muliple imes and processed by propagaion and observaion seps. Four elemens of condensaion algorihms may be summarized as follows. 1) Iniializaion: The firs sep for a condensaion algorihm is o form a se of poenial saes. Each poenial sae x 0 has an associaed likelihood π 0.Len be he iniial number of paricles; hen, his se of iniial paricles can be expressed as {s (p) 0,π(p) 0 }n p=1, where he superscrip denoes he index of a paricle and he subscrip denoes he ime index. 2) Propagaion: Propagae samples using a sae-ransiion equaion s (p) = s (p) 1 + N(0, Σ), where Σ is he observaion noise. 3) Observaion: Compue likelihoods from observaions π (p) =p(y (p) s (p) ), and normalize such ha n p=1 π(p) =1. 4) Resampling: Resample s (p) wih he probabiliy π (p) using facored sampling. The propagaion, observaion, and resampling process will be ieraed in each ime sep. IV. ADAPTIVE SAMPLING The canonical condensaion algorihm employs a uniform approach o perform sampling. In oher words, he same number of new paricles are creaed upon each ieraion. In our algorihm, boh he number of new paricles and he amoun of noise vary in response o emporal differences and mach qualiies. This process is described as follows, beginning wih a formal definiion of he sae space.

4 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS imaely locae he face. The number of paricles used in coarse sampling is beween 400 and 600.

4 4 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS imaely locae he face. The number of paricles used in coarse sampling is beween 400 and 600. Paricles wih high probabiliies are seleced for fine sampling wih all six affine parameers. In our experimens, we selec he op 100 paricles o perform fine sampling. In he fine sampling sep, he noise of [sx sy dx dy] is se o be small consans such ha small adjusmen is made, and he [θ q] noise depends upon a mach qualiy esimae, as described in he nex secion. Fig. 2. Illusraion of our cascade facored sampling. Every ime sep consiss of a coarse and fine sampling. A. Sae Space Model Face racking is accomplished by fiing a recangular bounding box o a region of he image cenered upon he face. The posiion, size, and shape of he box relaive o he curren frame of video are deermined by an affine ransformaion defined by he following: u cos(θ) sin(θ) dx v = sin(θ) cos(θ) dy 1 q sx sy 0 x y. (3) In his equaion, x and y are he 2-D homogeneous inpu coordinaes, and u and v are he affine ransformed coordinaes. The space of affine ransformaions is he sae space searched by he face racking algorihm. As (3) shows, affine ransformaions can be characerized by six conrol parameers [sx,sy,θ,q,dx,dy], where sx is he horizonal scaling, sy is he verical scaling, θ is a roaed angle, q is a skew parameer, dx is he horizonal ranslaion, and dy is he verical ranslaion. Hence, our sae vecor denoed as s a ime consiss of six parameers of an affine ransformaion, and he objecive of face racking is o esimae hese hidden variables. B. Adapive Affine Cascade Facored Sampling Tradiional condensaion algorihms [5], [6], [26] employ many paricles o sample a sae space, making condensaion inefficien and impracical for large spaces. To make sampling effecive and efficien, we propose a cascade facored sampling wih coarse and fine facored sampling sraegies shown in Fig. 2. As Fig. 2 shows, our cascade facored sampling consiss of wo levels of propagaion and observaion in each ime sep. The purpose of coarse 1 sampling is o approximaely locae he face posiion; hen, he face posiion is fine uned during he fine sampling process. Specifically, he [sx sy dx dy] parameers are coarsely sampled while [θ q] remains unchanged in order o approx- 1 For he sake of breviy, here and in fuure, we will drop he adjecive facored in fron of sampling. C. Adapive Noise Samples are propagaed by adding observaion noise o saes. Unlike radiional condensaion algorihms, which add a fixed amoun of noise o every image frame, we deermine he amoun of observaion noise based on he emporal difference of each sae parameer. Le u (i) be he elemen of one of he sae parameers in [sx sy dx dy], where he superscrip indicaes he index posiion. In he coarse sampling sep, he amoun of observaion noise is deermined based on he following crierion: a 1 b du (1) a Σ u =max 2 0 b, du (2) a b 2 0 du (3), a b 4 du (4) du (1) u (1) u (1) 1 du (2) where du (3) = u (2) u (2) 1 u (3) du (4) u (3). (4) 1 u (4) u (4) In his conex, max is an elemenwise maximum operaion. The erm a i ensures ha a minimum amoun of noise will always be added o an observaion. The erm b i weighs he L 1 norm of he emporal difference. In our experimens, we se a o [0.03, 0.03, 2, 2] T and diag(b) o [0.5, 0.5, 1, 1] T. While coarse sampling is concerned wih scale and ranslaion, and uses emporal differences o adjus he amoun of noise, fine sampling will add a noise erm for roaion and skew based upon he qualiy of he mach. Formally, le v be he elemen of one of he sae parameers in [θ q]. In he fine sampling sep, he amoun of observaion noise is deermined as follows: [ ] c1 Σ v = τ (5) 1 where c i is deermined empirically, and τ is a mach qualiy measure defined in he nex secion. In our experimens, we se c as [0.003, 0.003] T. In general, he parameers a, b, and c conrol he range of sampling. The specific seings for a, b, and c jus described were chosen based upon pilo experimens conduced using he dudek video sequence [19]. These seings are hen held consan in all he racking experimens presened in Secion VI. Also in a pilo experimen, he imporance of adapive sampling and adapive noise was esed by running a side-by-side comparison of our algorihm using boh and a varian of our algorihm wih boh disabled. The es was also run on he dudek video sequence. From his comparison, we noe ha he number c 2

5 LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 5 of paricles had o be increased o 1000 o obain reasonable racking performance, whereas he adapive version of our algorihm uses beween 400 and 600 paricles. Even wih he addiional paricles, he average roo-mean-square error (rmse) increased by 10% relaive o he resul for our adapive algorihm. This oucome clearly indicaes ha he adapive componens of our algorihm no only resul in a more efficien racking algorihm bu also improve racking accuracy. V. A DAPTIVE APPEARANCE Every paricle has an associaed affine ransformaion, and his is used o warp a video frame o a canonical geomery. The canonical geomery is defined as he face chip from he firs frame which is seleced manually in our experimens. Formally, le y (p) R m be an observaion given by a sae s (p) a ime, defined as y (p) ( ) = W I,s (p) where W is a warp operaion, 2 I is an image frame a ime, and s (p) is he ph sae vecor a ime. In general, he likelihood of an observaion depends upon he qualiy of he mach beween he warped and reference images. In he conex of our face racking, a reference image is a emplae buil from an appearance model, and i is used o mach all possible face candidaes. Before we discuss how o adapively build his reference image, we firs define our likelihood funcion and demonsrae our sraegy for narrowing he envelope of he likelihood funcion when a sae is found o be a good mach. The raionale is o focus he search more finely in he regions of sae space where maches are good. Then, we will describe he appearance model and our procedure for adaping he model. The model is based on a local linear approximaion o he appearance manifold. The adapaion process is based on a new angen-plane-selecion crierion. A. Adapive Likelihood Model A likelihood funcion measures he qualiy of an observaion. As is common, he likelihood depends upon he L 2 disance beween he observaion y (p) a ime and he reference image z 1 creaed from he appearance model a ime 1. Specifically, he likelihood is defined as ( p y (p) ) (p) z 1 y (p) 2 s =exp. (7) Wha makes our likelihood adapive is our inroducion of a dependence beween σ 2 and he bes mach among he curren observaions hrough an inermediae variable τ. Specifically, τ represens he smalles L 2 disance beween he reference image and he observaions a ime τ =min p σ 2 (6) z 1 y (p) 2. (8) 2 In his paper, he warp operaion is defined as an affine ransformaion, (3), followed by a cropping operaion. Fig. 3. Essence of adapive likelihood sampling is adjusing he variance of he noise model. (a) Inensify sampling when confidence in mach is high. (b) Diversify sampling when confidence in mach is low. Then, he adapive variance is defined as σ 2 = max(τ,t) σ 2 0. (9) The iniial variance σ0 2 and minimum noise hreshold T are user-defined parameers. As (9) shows, he adapive variance is proporional o he mach qualiy τ, and he mach hreshold T ensures ha a leas some minimum amoun of variaion is always applied o he sampling process. In condensaion algorihms, he purpose of compuing likelihood probabiliies p(y (p) s (p) ) is o govern he facored sampling process where paricles are resampled in order o concenrae paricles in he regions of he search space wih he highes likelihood. The adapive variance σ 2 inensifies search over a narrow range of affine ransformaions when a mach qualiy is high and broadens he range of ransformaions when a mach qualiy is low. This relaionship beween inensificaion and diversificaion is shown in Fig. 3. We also increase he number of paricles by a facor of 1.5 when τ is larger or equal o T, indicaing a low mach qualiy. The raionale is ha he mach qualiy will degrade if he racker is saring o fail o keep up wih a moving face, and increasing he number of paricles increases he chances of finding a beer affine ransformaion and reacquiring he face. B. Adapive Appearance Model The proposed adapive appearance model is moivaed by he opology of a space where images are viewed as residing on an absrac image space [16]. We call his image space as he appearance manifold. One advanage of making he opology of he appearance manifold explici is ha i allows us o ake advanage of adjacen appearances in he sense of spaial neighborhood raher han emporal neighborhood. The spaial coherence admis local lineariy even hough he appearance manifold is globally curved. Since images on an appearance manifold exhibi srong spaial coherence, he appearance model can reconsruc a beer reference image. Before showing he benefis of spaial neighborhood, we reveal how he appearance manifold is modeled. Key operaions for our racking algorihm include forming an appearance manifold, selecing a angen plane, and building a reference image based upon he angen plane. These operaions are ieraively applied o every frame during racking such ha he appearance is adaped over ime. The nex hree secions describe hese hree operaions. 1) Appearance Manifold Formaion: Alhough frame-oframe changes in face appearance are usually small, over

6 6 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS many frame appearance changes and rackers relian upon fixed emplaes will perform poorly. Mahews e al. [8] discuss he emplae updae problem by which a racker may adap over ime. We apply a similar drif correcion echnique in our appearance manifold updae scheme. Formally, le us define a locaion-independen sae as ŝ (p) = [sx, sy, θ, q] (p) and he bes observaion Y o be he y (p) associaed wih he paricle p ha yields he highes likelihood p(y (p) s (p) ) a ime. We declare an observaion Y as an appearance image (key frame) and place i on he appearance manifold when he following condiions are saisfied: τ<t ɛ 1 < min j ŝ (p) ŝ (j) 2 <ɛ2 (10) where ŝ (p) is associaed wih Y, and he index j ranges over he appearance images previously appended o he appearance manifold. The firs condiion examines he mach qualiy while he second condiion ensures ha he pose variabiliy is wihin a range defined by ɛ 1 and ɛ 2. Since he firs observaion y 0 is manually seleced, i is auomaically appended o he appearance manifold. In addiion, we replace he j h observaion on he appearance manifold when he number of observaions exceeds a predefined hreshold (40 in our experimens), where j is defined as j =argmin ŝ (p) ŝ (j) 2. (11) j This updae scheme essenially replaces he observaion on an appearance manifold ha is neares o he new observaion. 2) Tangen-Plane Selecion: The disance beween an image and a manifold can be characerized using a angen plane [15]. However, angen planes are no unique on an appearance manifold. Mos exising algorihms employ all available images on an appearance manifold o form a angen plane. In his paper, we inroduce a new crierion for angen-plane selecion. Firs, we formally define a angen plane. Le F be some unknown ransformaion acing on appearances. The firs-order approximaion of an appearance manifold is represened as F α = F 0 + Fα + H.O.T. (12) where F 0 is an appearance image on an appearance manifold, F denoes he angen vecors, and F α is he reconsruced appearance image. The purpose of angen disance is o seek a reconsruced image using angen vecors such ha he disance beween he reconsruced image and an observed image Y is minimized. In his conex, Y is he observaion y (p ) having he highes likelihood in (7). Mahemaically, his may be expressed as F0 min + 2 Fα Y. (13) α In he conex of face racking, here are many alernaive ways o selec F 0 and F given an observaion Y, and hese alernaives play an imporan role in building a reference image. To explore hree possible alernaives, a simplified illusraion is helpful. Fig. 4 shows an observaion Y in relaion o nine appearance images on he appearance manifold. These appearance images are denoed as g, g 1,...,g 8. Furhermore, in Fig. 4. Example of angen-plane selecion, where Y is an observaion and g k is he appearance on a manifold. his example, we assume ha hree angen vecors are sufficien o approximae a angen plane. 1) Se Y o be equal o F 0, and selec he hree neares neighbors (g, g 3, and g 4 )ofy o form he angen vecors. In his case, here is a rivial soluion (α =0)ha makes Y he reference image. This is known as naive updae [8] and has been shown o drif easily. 2) Se g o be equal o F 0 and hree of is neares neighbors (g 1, g 3, and g 4 ) o form he angen vecors. As Fig. 4 shows, hese appearance images may no be he bes images o reconsruc Y. We call his direc appearance updae. 3) Se g o be equal o F 0 and hree neares neighbors (g 3, g 4, and g 8 )ofy o form he angen vecors around g.as Fig. 4 shows, he neares images of Y are surrounded in a blue box. These appearance images ypically provide a much beer reconsrucion. We call his adapive appearance updae. Based on hese observaions, we choose he adapive appearance updae as our angen-plane-selecion crierion. This crierion provides beer spaial coherence of images on an appearance manifold. Our experimenal resuls conclude ha he proposed adapive appearance framework is resilien o drifing. Furher analysis is given in Secion VII-B. 3) Reference Image Updae: To updae he reference image, we firs selec he base image g on he appearance manifold, which is he neares observaion of Y shown as follows: g =argmin Y g j 2 (14) j where g j is an appearance image on he appearance manifold. Because local lineariy is a reasonable presumpion for he appearance manifold, we can form a angen plane using a se of k neares neighbors expressed as T g M : g + span{g g 1,g g 2,...,g g k } (15) where {g 1,g 2,...,g k } are he neares k neighbors of Y, and {g g 1,g g 2,...,g g k } are he angen vecors around g. In our experimen, we se k o be equal o 15. In he case where he number of images on he appearance manifold is less han k, we use all he images ha are currenly available on he appearance manifold. As discussed in he previous secion, angen-plane selecion plays an imporan role in building a reference image. We should noe ha he reference image resides on a angen plane T g M. More specifically, he reference image is formed by

7 LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 7 Fig. 5. Direc appearance updae, he adapive appearance updae, and an appearance manifold. The blue do represens a reference image residing on a angen plane. Differen appearance updaes selec differen ses of angen bases; differen angen bases form a differen angen plane and, herefore, a differen reference image. (a) Direc updae. (b) Adapive updae. seeking he bes linear combinaion of he angen vecors a g relaive o he observaion Y. Thus, he k neares neighbors (a dashed ellipse in Fig. 5) govern he qualiy of he reference image. Fig. 5 shows how he reference image changes when we apply alernaive angen-plane-selecion crieria. Fig. 5(a) shows he direc appearance updae and furher shows how he se of basis images may no be spaially closes o he observaion Y. In his case, he reference image may no be coheren relaive o he opology of he appearance manifold. The adapive appearance updae shown in Fig. 5(b) considers he spaial arrangemen on he appearance manifold and builds he reference image using a se of neares bases. This approach, which respecs he opology of he appearance manifold, is much more likely o creae a reference image from a coheren choice of basis images. Recall ha a linear variey allows a hyperplane o shif from is origin. In his paper, we furher exend he linear combinaion o he bes linear variey such ha he angen plane would be shifed closer o he observaion. In oher words, changes in he overall image brighness are rivially accommodaed. This exension can be wrien in a marix form as z 1 z 2.. z m = g 1 g 2. g m + 1 V (1) 1 V (1).. 1 V m (1) 1...V (k) V (k) 2...V (k) m α 0 α 1.. α k (16) where V (i) denoes {g g i i =1,...,k}, and he addiional column 1 is augmened for brighness compensaion; hus, any linear variey can be reconsruced wih a suiable α. Now, he goal is o seek he bes α such ha he difference beween he linear variey and an observaion Y is minimized as follows: min g + Vα Y 2. (17) α Taking a derivaive wih respec o α, he soluion becomes α = V 1 (Y g). (18) Subsiuing α back o (16), he reconsruced appearance image z a ime is he bes linear variey agains Y and will be used as a reference image for compuing likelihoods a ime +1. To summarize our adapive condensaion algorihm, we skech our mehod in Algorihm 1. As Algorihm 1 reveals, our mehod exhibis four major elemens. They are propagaion and observaion, angen-plane selecion, reference image updae, and appearance manifold updae. As such, he propagaion and observaion sep can be considered as an adapive sampling process, while he angen-plane selecion, reference image updae, and appearance manifold updae can be regarded as an adapive appearance framework. Boh elemens are inerdependen and key ingrediens for our face racker. Algorihm 1 Our Adapive Condensaion Algorihm 1: Iniialize paricles 2: while frame empy do 3: Apply propagaion and observaion 4: Coarse facored sampling 5: Compue likelihoods (7) 6: Normalize 7: Selec op paricles 8: Fine facored sampling 9: Compue likelihoods (7) 10: Normalize 11: Selec a angen plane 12: Updae a reference image 13: Solve α (18) 14: Build a reference image z (16) 15: Updae an appearance manifold 16: end while VI. EXPERIMENTAL RESULTS We demonsrae he effeciveness and robusness of our adapive condensaion algorihm using four challenging video sequences. These video sequences are acquired indoors using hand-held video cameras so ha boh camera moion and human moion are concurrenly revealed. The dudek video sequence [19] is a well-known benchmarked video for face racking and comes wih hand-labeled ground-ruh posiions. The second video sequence, davidin300 [19], has mixed shadowing

8 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS Fig. 6. Demonsraion of robusness o illuminaion, appearance and pose variaions, and occlusions. (a) Dudek video sequence.

The hird and fourh video sequences were colleced a Colorado Sae Universiy, 3 namely, he Rams and Smiley video sequences. These video sequences also exhibi various appearances, poses, and occlusions.

8 8 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS Fig. 6. Demonsraion of robusness o illuminaion, appearance and pose variaions, and occlusions. (a) Dudek video sequence. (b) Davidin300 video sequence. (c) Rams video sequence. (d) Smiley face video sequence. Fig. 7. Examples of occlusion recovery. and pose variaions. The hird and fourh video sequences were colleced a Colorado Sae Universiy, 3 namely, he Rams and Smiley video sequences. These video sequences also exhibi various appearances, poses, and occlusions. Example frames from hese videos are shown in Fig. 6. In our experimens, he iniial racking posiion is manually seleced. The green bounding boxes shown in Fig. 6 indicae where our racker has found he face in hese frames. 3 Available a hp:// vision/smc-facetracking/. In he nex wo secions, we will discuss specific aspecs of performance as i relaes o illuminaion, appearance, pose, and occlusion. This highlighs how aspecs of our algorihm relae o hese specific challenges. Then, in Secion VI-C, we presen our quaniaive evaluaion. A. Illuminaion, Appearances, and Poses Lighing, appearance, and pose variaions are he primary challenges in face racking. I is imperaive for a robus

LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 9 algorihm o handle such variaions. The sample racking resuls shown in Fig.

9 LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 9 algorihm o handle such variaions. The sample racking resuls shown in Fig. 6(a) (d) show ha our mehod is successfully racking he human face under variaions in all hree of hese facors. In addiion, he video sequences include addiional confounding facors such as pronounced facial shadowing, occlusion of he face, and changes in facial expression and pose. Our appearance model is robus o lighing, appearance, and pose changes because i is adapive. This is accomplished by rebuilding he reference image in every frame. To be specific, our angen-plane selecion is driven by he observaion (see Secion V-B2) ha he lighing, appearance, and pose from he observaion are considered when we selec he angen vecors. To pu his anoher way, our reference image is always creaed from a se of prior images ha are mos similar o he curren image, even when ha means i is bes o sep over he recenly acquired images in favor of images seen less recenly bu exhibiing characerisics ha are more useful for explaining he appearance of he curren frame. Finally, from a pracical sandpoin, i is also imporan ha our mehod compensaes for he overall changes in brighness by using a linear variey ha enables he angen plane o shif away from he origin. B. Occlusion Focusing in more precisely on occlusion, Fig. 7 shows successive frames of a video sequence in which our algorihm is successfully handing occlusion. In Fig. 7, he whie bounding boxes are sampling posiions, and he green bounding box is he face posiion wih he highes poserior probabiliy. Observing in he firs few frames, where he face is fully visible, he number of sampling regions is small. As he hand moves in fron of he chin in he middle frame of he op row, he expansion of he sampling paern is immediaely eviden. Then, in successive frames where he face is parially occluded, he sampling paern coninues o expand. This adapive sampling allows he algorihm o reac and avoid losing a face even when he face is moving and/or emporarily occluded. There are wo primary reasons why our mehod is robus o emporary occlusion. Firs, our adapive sampling sraegy is able o follow he movemen of he face since we are using he emporal differences o adjus he amoun of noise. Second, because only observaions wih a qualiy score above a hreshold described in (10) are added o he model, occluded objecs will no ypically paricipae in he process of building a reference image, and consequenly, hey do no corrup he appearance model. Of course, hese sraegies have heir limis. The longer a face is occluded, he higher he risk ha he racker will expand is search region o such an exen ha i will lose he objec. I is also conceivable, alhough we have no observed such a case, ha very slow inroducion of ever more occluded images of a face migh allow parially covered insances of he face o be added o he reference model. C. Quaniaive Evaluaion Ground-ruh face locaion informaion is available for hree video sequences: dudek, Rams, and Smiley. For hese, we are able o carry ou a quaniaive evaluaion. Each video frame in hese sequences is annoaed wih seven fiducial posiions. We Fig. 8. Ground-ruh and esimaed posiions, where ground-ruh posiions are red and esimaed posiions are green. employ he rmse beween he ground-ruh and esimaed posiions o assess racking performance. An example of groundruh and esimaed posiions is shown in Fig. 8, where he red crosses are ground-ruh posiions and he green crosses are he esimaed posiions. We direcly compare our racking resuls wih he sae-ofhe-ar racking algorihms ha exploi he advanages of a learned subspace mehod [14] and incremenal visual racking (IVT) [6]. Boh hese mehods are online subspace learning echniques where he learned subspace mehod consrucs a se of local means as he bases of a subspace and he IVT learns he subspace incremenally wih a forgeing facor. The quaniaive resuls for he dudek, Rams, and Smiley video sequences are shown in Fig. 9 and summarized in Table I. As can be seen in Fig. 9(a), all hree algorihms successfully rack he face over he full duraion of he dudek video. This is no rue for he Rams video [Fig. 9(b)] or he Smiley video [Fig. 9(c)], where disinc poins in he video can be observed where an algorihm experiences a subsanial jump in rmse and hen never recovers. For example, in he Rams video [Fig. 9(b)], a disinc jump in he rmse for he IVT algorihm can be seen a frame 99. A similar jump can be seen for he learned subspace algorihm a frame 532. Noably, our mehod mainains a relaively sable, and comparaively low, rmse over he enire duraion of he video for boh he Rams and Smiley video sequences. Average rmse values over he enire dudek video for all hree algorihms are presened in he firs hree rows of Table I. The average rmse values for he learned subspace mehod and IVT are 6.3 and 5.32, 4 respecively, whereas our mehod achieves 4.98 average rmse. More imporanly, our mehod only employs 400 o 600 paricles and runs a 8.74 frames per second, whereas he IVT uses 4000 paricles and runs a 1.1 frames per second. Noe ha all experimens are run on he same machine, and so, his difference in ime required per frame is a reliable indicaion of jus how much more efficien is he adapive framework. Since boh he IVT and learned subspace algorihms lose rack of he face a some poin in boh he Rams and Smiley video sequences, Table I presens he average rmse values in wo ways. Firs, he average rmse values are presened over he full video sequences, and as a resul of he algorihms losing rack of he face, he average rmse is very high for boh algorihms on boh videos. To address he quesion of how 4 The MATLAB source code was downloaded from hp:// edu/~dross/iv/, and we used he bes parameer seings suggesed by he auhors.

10 10 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS Fig. 9. Quaniaive comparisons o learned subspace [14] and IVT [6] for face racking. (a) Dudek video sequence. (b) Rams video sequence. (c) Smiley video sequence. well he algorihms are doing relaive o each oher over he porion of he video where all hree algorihms remain inac in racking, Table I also shows he average rmse over jus he firs 99 and 273 frames of he Rams and Smiley videos, respecively. Even on hese runcaed video sequences, our mehod sill ouperforms he learned subspace mehod and IVT in boh Rams and Smiley video sequences, where our mehod achieves 3.51 and 6.43 average rmse values for he Rams and Smiley video sequences, respecively. The key differences beween our mehod and he subspace mehods esed in his paper are he daa represenaion of he appearance model and he updae sraegy. While boh subspace mehods build a subspace incremenally, we use key frames and represen hem on an appearance manifold. As we consider key frames on an appearance manifold, he geomeric inerpreaion is aken ino accoun. We apply he propery of local lineariy and make a local updae for a angen plane insead of an incremenal updae for a subspace like [6] and [14]. As such, our bases are a se of local 5 (similar) key frames raher han global bases. This spaial coherency allows us o selec a angen plane from adjacen appearances and builds a beer reference image. In conras, he subspace mehods such as [6] and [14] adop all he bases o form a subspace and compue he subspace disance for each observaion. Therefore, hese mehods ignore he spaial coherency which plays an imporan role in building a good reference image. Finally, we noe ha all he racking resuls are obained using MATLAB version on an HP-xw4600-Core2Duo- SATA 2x2.83G 64-b machine. The racking resuls, racking videos, and he landmark poins can be downloaded from hp:// VII. DISCUSSION A. Drif Analysis Templae updae is an essenial sep o accoun for various expression, pose, and illuminaion changes. However, i can easily resul in a racker drifing off he objec of ineres, in his case, a paricular person s face. To comba drifing, we need o firs undersand ha drif is caused by accumulaed error in a emplae posiion. Whenever he emplae is updaed, small errors accumulae o he posiion of he emplae. Once drifing sars, i can rapidly cause a racker o lose he face. The bes way o minimize drif is o avoid he small errors in he firs place, and his, in urn, requires ha he adapive model makes he correc balance beween generaliy and specificiy. Conrolling he observaions, images, which are added o he appearance manifold, and adapaion o he curren frame when consrucing he reference image ogeher help our approach srike his balance. Our approach is selecive abou wha observaions are added o he appearance manifold. As defined in (10), here are hree aspecs o he screening of images ha are imporan. Firs, here mus be a good mach beween he observaion and he reference image. Second, if he change in posiion, which is he magniude of he ransformaion beween frames, is very large, he observaion is no added o he appearance manifold. These seps ensure he inegriy of he appearance manifold. Third, only observaions wih a sufficien change in posiion are added o he appearance manifold. This proecs agains he appearance manifold becoming overwhelmed wih a se of nearly idenical images. The seps jus described do a lo o balance generaliy and specificiy in he consrucion of he appearance manifold. However, ha alone is no enough. Only observaions relaed o he curren frame should conribue o he consrucion of he reference image, and our algorihm accomplishes his hrough he angen-plane consrucion process described in Secion V-B2. Through his mechanism, our algorihm can mainain hisory abou differen aspecs of appearance, for example, parial profile versus head on, wihou making he 5 Local implies a spaial relaionship raher han a emporal relaionship.

LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 11 TABLE I QUANTITATIVE COMPARISONS IN TERMS OF RMSE AND SPEED Fig. 10.

11 LUI e al.: ADAPTIVE APPEARANCE MODEL AND CONDENSATION ALGORITHM FOR ROBUST FACE TRACKING 11 TABLE I QUANTITATIVE COMPARISONS IN TERMS OF RMSE AND SPEED Fig. 10. Quaniaive comparison for various appearance updae mehods using he dudek video sequence. misake of combining hese wo cases when maching o a specific frame. B. Updae Sraegy Analysis The selecion of angen plane plays a vial role in building a reference image. To furher analyze he effeciveness of our adapive appearance model, we compare he adapive appearance updae wih he direc and naive appearance updaes discussed in Secion V-B2. Fig. 10 shows he quaniaive resuls for various updae sraegies using he dudek video. The naive appearance updae loses rack afer he firs occlusion. The direc and adapive appearance updaes perform idenically in he beginning since, iniially, he same images are added o he appearance manifold for boh algorihms. I is only afer he firs 174 frames ha he performance diverges and our adapive appearance model can be seen o perform somewha beer. Over he enire video sequence, he average rmse is 5.8 for he direc appearance updae and 4.98 for he adapive appearance updae. The argumen of hese wo updaes is from various spaial consideraions beween he observaion and he appearances on a manifold. The experimenal resuls described in Fig. 10 shows he imporance of spaial coherency. Our adapive appearance updae considers he spaial arrangemen properly, and herefore, i is more robus o handle variabiliy and drifing. C. Limiaion While our algorihm performs well in face racking, he sae model is sill an affine ransformaion realized as a global alignmen. Adapively building a reference image o bes model he curren frame does go some ways oward handling nonlinear aspecs of facial movemen; however, i sill canno accoun for all nonridge facial expressions. Affine ransformaions do no have he expressive power o model he deails of various facial expressions, for insance, smiling or neural expression. One approach o handling nonlinear facors, such as expression variaions, would be o adop more sophisicaed appearance models. For example, AAMs [21] migh be used. However, he dimensions of he sae space would be remendously increased since i encodes boh shape and exure informaion. The increased expressive power of AAMs clearly comes a considerable exra compuaional expense. Anoher facor o consider in reviewing he quaniaive resuls presened is ha he ground-ruh posiions are manually annoaed; hus, here may be some noise in hese posiions. In fac, Jepson e al. [22] repor ha he lower bound of he average rmse for he dudek video is 3.1 pixel error. This baseline error has boh ground-ruh and fiducial-posiion errors ha canno be recovered by similariy ransformaions. Here oo, AAMs could break ou of he limis imposed by he assumpion of a global affine ransformaion bu, again, a a considerable addiional compuaional expense. Finally, i should be undersood ha, using our mehod or, indeed, any of he algorihms agains which we compared our mehod, face racking is limied o a single face and reacquisiion is needed when he objec is los. Our experimenal resuls show ha our mehod works reasonably well on shorerm occlusion where he lenghs of occlusion are 6, 20, and 22 frames for he dudek, Rams, and Smiley video sequences, respecively. However, racking failure would definiely occur when long ime occlusion is presen or he facial image appears significanly differen afer occlusion. This is because he assumpion of smooh ransiions beween frames is invalid. VIII. CONCLUSION AND FUTURE WORK This paper has presened an adapive framework for humanface racking. Our adapive affine cascade sraegy performs wo-sage facored sampling ha effecively reduces he search

12 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS space.

This facored sampling sraegy uses fewer paricles and makes sampling more efficien. Equally imporan, we have adapively buil a reference image in every frame.

Wih hese wo key ingrediens, our mehod is robus o variaion of illuminaion, appearance, pose, and emporary occlusions.

12 12 TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS space. In our es videos, we employ 400 o 600 paricles for coarse facored sampling and 100 paricles for fine facored sampling. This facored sampling sraegy uses fewer paricles and makes sampling more efficien. Equally imporan, we have adapively buil a reference image in every frame. This is achieved by properly selecing a angen plane and using a linear variey from an appearance manifold. Wih hese wo key ingrediens, our mehod is robus o variaion of illuminaion, appearance, pose, and emporary occlusions. Furhermore, hree video sequences wih ground-ruh posiions have been adoped o assess he performance of our mehod, and encouraging resuls have been obained. Finally, since our mehod is an online learning paradigm, no prior raining is required. Human-face racking is sill a difficul ask due, in par, o large freedom of face movemen and appearance changes. Our fuure work will focus on emporary disappearance and coninue o explore appearance manifolds. ACKNOWLEDGMENT The auhors hank David A. Ross from Universiy of Torono for he permission o use his videos and source code. REFERENCES [1] A. Yilmaz, O. Javed, and M. Shah, Objec racking: A survey, ACM J. Compu. Surv., vol. 38, no. 4, pp. 1 45, [2] N. Gordon, Bayesian mehods for racking, Ph.D. disseraion, Univ. London, London, U.K., [3] N. Gordon, D. Salmond, and A. Smih, A novel approach o non-linear and non-gaussian Bayesian sae esimaion, Proc. Ins. Elec. Eng. F, vol. 140, no. 2, pp , Apr [4] A. Douce, N. de Freias, and N. Gordon, Sequenial Mone Carlo Mehods in Pracice. New York: Springer-Verlag, [5] M. Isard and A. Blake, Condensaion Condiion densiy propagaion for visual racking, In. J. Compu. Vis., vol. 29, no. 1, pp. 5 28, Aug [6] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremenal learning for robus visual racking, In. J. Compu. Vis., vol. 77, no. 1 3, pp , May [7] G. Hager and P. Belhumeur, Efficien region racking wih parameric models of geomery and illuminaion, Trans. Paern Anal. Mach. Inell., vol. 20, no. 10, pp , Oc [8] I. Mahews, R. Ishikawa, and S. Baker, The emplae updae problem, Trans. Paern Anal. Mach. Inell., vol. 26, no. 6, pp , Jun [9] M. J. Black and A. D. Jepson, Eigenracking: Robus maching and racking of ariculaed objecs using a view-based represenaion, in Proc. ECCV, 1996, pp [10] S. Von Duhn, L. Yin, M. J. Ko, and T. Hung, Muliple-view face racking for modeling and analysis based on non-cooperaive video imagery, in Proc. CVPR, 2007, pp [11] K.-C. Lee and D. Kriegman, Face racking in video wih hybrid of Lucas Kanade and condensaion algorihm, in Proc. In. Conf. Mulimedia Expo, 2003, pp [12] W. Zhang, Q. Wang, and X. Tang, Real ime feaure based 3-D deformable face racking, in Proc. ECCV, 2008, pp [13] A. Levy and M. Lindenbaum, Sequenial Karhunen Loeve basis exracion and is applicaion o images, Trans. Image Process., vol. 9, no. 8, pp , Aug [14] J. Ho, K.-C. Lee, M.-H. Yang, and D. Kriegman, Visual racking using learned linear subspace, in Proc. CVPR, 2004, pp [15] P. Simard, Y. L. Cun, and J. Denker, Efficien paern recogniion using a new ransformaion disance, in Proc. NIPS, 1992, pp [16] H. Seung and D. Lee, The manifold ways of percepion, Science, vol. 290, no. 5500, pp , Dec [17] S. Zhou, R. Chellapa, and B. Moghaddam, Visual racking and recogniion using appearance-adapive models in paricle filers, Trans. Image Process., vol. 13, no. 11, pp , Nov [18] B. Norh, A. B. M. Isard, and J. Rischer, Learning and classificaion of complex dynamics, Trans. Paern Anal. Mach. Inell., vol. 22, no. 9, pp , Sep [19] D. Ross. [Online]. Available: hp:// [20] J. MacCormick and M. Isard, Pariioned sampling, ariculaed objecs, and inerface-qualiy hand racking, in Proc. ECCV, 2000, pp [21] T. Cooes and G. Edwards, Acive appearance models, Trans. Paern Anal. Mach. Inell., vol. 23, no. 6, pp , Jun [22] A. D. Jepson, D. J. Flee, and T. El-Maraghi, Robus online appearance models for visual racking, Trans. Paern Anal. Mach. Inell., vol. 25, no. 10, pp , Oc [23] S. Avidan, Suppor vecor racking, Trans. Paern Anal. Mach. Inell., vol. 26, no. 8, pp , Aug [24] X. Song, J. Cui, H. Zha, and H. Zhao, Vision-based muliple ineracing arges racking via on-line supervised learning, in Proc. ECCV, 2008, pp [25] Q. Yu, T. Dinh, and G. Medioni, Online racking and reacquisiion using co-rained generaive and discriminaive rackers, in Proc. ECCV, 2008, pp [26] Y. Li, H. Ai, T. Yamashia, S. Lao, and M. Kawade, Tracking in low frame rae video: A cascade paricle filer wih discriminaive observers of differen lifespans, in Proc. CVPR, 2007, pp [27] K.-C. Lee and D. Kriegman, Online learning of probabilisic appearance manifolds for video-based recogniion and racking, in Proc. CVPR, 2005, pp [28] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based objec racking, Trans. Paern Anal. Mach. Inell., vol. 25, no. 5, pp , May Yui Man Lui (S 07) is currenly working oward he Ph.D. degree in he Deparmen of Compuer Science, Colorado Sae Universiy, For Collins. His curren research ineress include special manifolds, face recogniion, face racking, and acion classificaion. Mr. Lui is he recipien of he Honeywell Bes Suden Paper Award from he 2009 Inernaional Conference on Biomerics: Theory, Applicaions and Sysems (BTAS); he coauhor of he Bes Paper Award from he 2008 Inernaional Conference on Auomaic Face and Gesure Recogniion; and he recipien of an honorable menion for he Honeywell Bes Suden Paper Award from BTAS in J. Ross Beveridge (M 83) received he Ph.D. degree from he Universiy of Massachuses, Amhers, in He is currenly an Associae Professor wih he Deparmen of Compuer Science, Colorado Sae Universiy, For Collins. He has served as an Associae Edior for Paern Recogniion and is currenly an Associae Edior for Image and Vision Compuing. Dr. Beveridge has served as an Associae Edior for he TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.Hewas he Program Cochair for he 1999 Conference on Compuer Vision and Paern Recogniion. He was he recipien of he Ousanding Reviewer Awards in 2007 and 2008 and he Bes Paper Award from he Inernaional Conference on Auomaic Face and Gesure Recogniion in L. Darrell Whiley received he Ph.D. degree from Souhern Illinois Universiy in He is currenly he Chair wih he Deparmen of Compuer Science, Colorado Sae Universiy, For Collins. He serves on he ediorial board of he journals Arificial Inelligence and Evoluionary Compuaion. He also serves as he Chair of he Associaion for Compuing Machinery Special Ineres Group on Geneic and Evoluionary Compuaion, Sigevo.

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto Visual Percepion as Bayesian Inference David J Flee Universiy of Torono Basic rules of probabiliy sum rule (for muually exclusive a ): produc rule (condiioning): independence (def n ): Bayes rule: marginalizaion: