Robust parameterized component analysis: theory and applications to 2D facial appearance models

Size: px

Start display at page:

Download "Robust parameterized component analysis: theory and applications to 2D facial appearance models"

Donald Morris
6 years ago
Views:

Compuer Vision and Image Undersanding 91 (2003) 53 71 www.elsevier.

Black b a Deparmen of Communicaions and Signal Theory, La Salle School of Engineering, Universia Ramon LLull, Barcelona 08022 Spain b Deparmen of Compuer Science, Brown Universiy, Box 1910,

1 Compuer Vision and Image Undersanding 91 (2003) Robus parameerized componen analysis: heory and applicaions o 2D facial appearance models Fernando De la Torre a, * and Michael J. Black b a Deparmen of Communicaions and Signal Theory, La Salle School of Engineering, Universia Ramon LLull, Barcelona Spain b Deparmen of Compuer Science, Brown Universiy, Box 1910, Providence, RI 02912, USA Received 15 February 2002; acceped 11 February 2003 Absrac Principal componen analysis (PCA) has been successfully applied o consruc linear models of shape, graylevel, and moion in images. In paricular, PCA has been widely used o model he variaion in he appearance of peopleõs faces. We exend previous work on facial modeling for racking faces in video sequences as hey undergo significan changes due o facial expressions. Here we consider person-specific facial appearance models (PSFAM), which use modular PCA o model complex inra-person appearance changes. Such models require aligned visual raining daa; in previous work, his has involved a ime consuming and error-prone hand alignmen and cropping process. Insead, he main conribuion of his paper is o inroduce parameerized componen analysis o learn a subspace ha is invarian o affine (or higher order) geomeric ransformaions. The auomaic learning of a PSFAM given a raining image sequence is posed as a coninuous opimizaion problem and is solved wih a mixure of sochasic and deerminisic echniques achieving sub-pixel accuracy. We illusrae he use of he 2D PSFAM model wih preliminary experimens relevan o applicaions including video-conferencing and avaar animaion. Ó 2003 Elsevier Inc. All righs reserved. Keywords: Facial appearance models; Principal componen analysis; Robus saisics; Eigen-regisraion; Facial analysis * Corresponding auhor. addresses: forre@salleurl.edu (F. De la Torre), black@cs.brown.edu (M.J. Black). URLs: hp:// hp:// /$ - see fron maer Ó 2003 Elsevier Inc. All righs reserved. doi: /s (03)

54 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) 53 71 1.

2 54 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Inroducion This paper addresses he problem of learning a linear subspace represenaion of a raining se in which he daa (e.g., images) may have undergone some unknown parameric ransformaion (e.g., affine). The key idea is o simulaneously solve for he opimal linear subspace represening he daa while aligning he raining daa wih ha subspace. To illusrae he mehod we develop i in he conex of face modeling. In paricular, we adop he idea of modular eigenspaces (ME) [31,40,44] and apply our parameerized componen analysis echnique o he problem of developing person-specific facial appearance models (PSFAM). Consider he problem of learning a linear subspace represening he variaion of he subjecõs righ eye in Fig. 1. The images were capured by asking he user o change he configuraion of he eyes (open, close, look righ, ec.) while holding he head sill. However, i is no reasonable o assume ha he person is absoluely sill during he raining ime, and in pracical siuaions here are always small moions beween frames. Observe ha in his kind of sequence i is difficul o gaher aligned daa due o personõs moion and he lack of labeled poins for solving he correspondence problem beween frames. Alhough many compuer vision researchers have used principal componen analysis (PCA) o model he face [11,17 19,26,40,41,52] he major drawback of his radiional echnique is ha i requires normalized (aligned) samples in he raining daa. While, in he recogniion process, alignmen of he daa wih respec o he face model is a common sep as noed by Marınez [37], lile work has addressed problems posed by misalignmen a he learning sage. Previous mehods for consrucing appearance models [11,18,19,26,40,41] have cropped he region of ineres by hand, or have used a hand-labeled, pre-defined, feaure poins o compue he ranslaion, scaling and roaion ha brough each image ino alignmen wih a prooype. However, his way of collecing daa is likely o inroduce errors due o inaccuracies which arise from labeling he poins by hand, even wih he use of landmarks, since i is difficul o achieve sub-pixel accuracy. In addiion, manual cropping is a edious, unpleasan, and ime consuming ask. The aim of he paper is illusraed in Fig. 2, where Fig. 2a shows some original images used for raining. From hese (non-aligned images) we compue a se of linear bases using PCA in he sandard way. Fig. 2b shows he original images reconsruced using he non-aligned bases. Fig. 2c shows he reconsruced images obained using he parameerized componen analysis echnique presened here. Fig. 1. Some frames from a raining sequence of images.

F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) 53 71 55 Fig. 2.

3 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Fig. 2. Reconsrucion of image daa using an eigenspace represenaion: (a) example frames from he raining daa; (b) reconsrucion of he righ eye wihou any alignmen; (c) reconsrucion of he righ eye wih he proposed mehod (eigen-regisraion). This eigen-regisraion echnique ieraively compues he subspace while aligning he raining images w.r.. his subspace. Tha is, he algorihm ha we propose in his paper will simulaneously learn he local appearance basis, creaing an eigenspace while compuing he moion o align he images w.r.. he eigenspace. In he case of modular eigenspaces (ME) [31,40,44] considered here, masks which define he spaial domain of he ME are defined by hand in he firs frame (no appearance model is previously learned) and afer ha he mehod is fully auomaic. Preliminary resuls were presened in [13]. 2. Previous work I is beyond he scope of his paper o review all possible applicaions of PCA and subspace mehods, herefore we jus briefly describe he heory and poin o relaed work for furher informaion Subspace learning Le D ¼½d 1 d 2 d T Š¼½d 1 d 2 d d Š T be a marix D 2 R dt, where each column d i is a daa sample, T is he number of raining images, and d is he number of pixels in each image. If he effecive rank of D is much less han d, we can approximae he column space of D wih k d principal componens. Le he firs k principal

4 56 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) componens of D be B ¼½b 1 ;...; b k Š2R dk. The columns of B span he subspace of maximum variaion of D. 1 Alhough a closed form soluion for compuing he principal componens (B) can be achieved by finding he k larges eigenvecors of he covariance marix DD T [20], here i is useful o exploi work ha formulaes PCA/subspace learning as he minimizaion of an energy funcion [16,20,21] E pca ðb; CÞ ¼kD BCk 2 F ¼ XT i¼1 kd i Bc i k 2 2 ¼ XT ¼1 X d p¼1 d p : Xk j¼1 b pj c j! 2 ; where C ¼½c 1 c 2 c n Š and each c i is a vecor of coefficiens used o reconsruc he daa vecor d i. Observe ha subspace learning involves approximaely facoring he daa, D, ino he produc of he bases, B, and he coefficiens, C, herefore i can be posed as a bilinear esimaion problem. There exis many mehods for minimizing his equaion including alernaed leas squares (ALS), criss-cross regression, varians of expecaion-maximizaion (EM), ec., bu in he case of PCA, hey share he same basic philosophy. These algorihms alernae beween solving for he coefficiens C wih he appearance bases B fixed and hen solving for he bases B wih C fixed. Typically, boh updaes are compued by solving a linear sysem of equaions Adding moion ino he subspace formulaion Since he preliminary work of Sirovich and Kirby [48] and he successful eigenface applicaion of Turk and Penland [49], PCA has been widely applied o he consrucion of a face subspace. Since hen, here has been a lo of work and ineres in rying o consruc more accurae models of he high dimensional manifold of faces. During he las few years here has been a growing rend o apply new machine learning or mulivariae saisical echniques o consruc more accurae face models. Many 2D/ 3D linear/non-linear models [26,39,46,52] have been proposed based on suppor vecor machines, mixure of facor analyzers, Independen Componen Analysis, Kernel PCA, ec. See [12,26,52] for an exended review in he conex of recogniion and modeling. Mis-regisraion or variaions in scale inroduce significan non-lineariies in he manifold of faces and can reduce he accuracy of racking and recogniion algorihms. While previous approaches have deal wih hese issues as a separae, off-line regisraion processes (ofen manual), here i is inegraed ino he learning procedure. 1 Bold capial leers denoe a marix D, bold lower-case leers a column vecor d. d j represens he jh column of D and d j is a column vecor represening he jh row of D. d ij denoes he scalar in row i and column j of D and he scalar ih elemen of a column vecor d j. All non-bold leers represen scalar variables. d ji is he ih scalar elemen of he vecor d j. diag is an operaor ha ransforms a vecor o a diagonal marix, or a marix ino a column vecor by aking each of is diagonal componens. rðdþ is he race operaor. kdk 2 2 ¼ dt d denoes he L 2 norm and kdk 2 W ¼ dt Wd is he weighed L 2 norm. kdk 2 F ¼ rðdt DÞ¼rðDD T Þ is he Frobenius norm of D. D 1 D 2 denoes he Hadamard (poin wise) produc.

5 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Recenly, here has been an ineres in he simulaneous compuaion of appearance bases and he moion ha aligns he raining images. This is a classic chickenand-egg problem. Once he correspondence of ineresing poins hrough an image sequence is known, learning he appearance model is sraighforward, and if he appearance is known solving for he correspondence is easy. De la Torre e al. [17] proposed a mehod for face racking which recovers affine parameers using subspace mehods. This mehod dynamically updaes he eigenspace by uilizing he mos recen hisory. The updaing algorihm esimaes he parameric ransformaion, which aligns he acual image w.r.. he eigenspace and recalculaes a local eigenspace. Because he new images usually conain informaion no available in he eigenspace, he moion parameers are calculaed in a robus manner. However, he mehod assumes ha an iniial eigenspace is learned from a raining se aligned by hand. Schweizer [47] proposed a deerminisic mehod which regisers a se of images wih respec o heir eigenfeaures, applying i o he flower garden sequence for indexing purposes. However, he assumpion of affine or quadraic moion models [47] is only valid when he scene is planar. The exension o he general case of arbirary 3D scenes and camera moions remains unclear. As Schweizer noices [47] he algorihm is likely o ge suck in local minima, since i comes from a linearizaion and uses gradien descen mehods. Alernaively, Rao [45] proposed a neural-nework which can learn a ranslaion-invarian code for naural images. Alhough he suggess updaing he appearance basis, he experimens show only ranslaion-invarian recogniion, as proposed by Black and Jepson [4]. Frey and Jojic [24] ook a differen approach and hey inroduce an expecaion maximizaion (EM) algorihm for facor analysis (similar o PCA) ha is invarian o geomeric ransformaions. The proposed mehod is problemaic because he compuaional cos grows exponenially wih he number of possible spaial ransformaions, and can be oo compuaionally inensive when working wih realisic high dimensional (greaer han wo) moion models. Using a differen approach Mandel and Penev [36] repor he ineresing observaion ha non-properly aligned daa lie on curved manifolds. This observaion forms he basis of an algorihm o align visual daa. Resuls where repored on image sequences of faces o compensae for ranslaional moion. However, i is no clear how o exend he mehod o more complex high dimensional moion models wihou considerably increasing he compuaional cos. In his paper, unlike previous mehods we use sochasic and muli-resoluion echniques o avoid local minima in he minimizaion process. Also, we exend previous approaches o muliple regions wihin a robus (o ouliers) and coninuous opimizaion framework. In a differen direcion, here has been inensive research on auomaically or semi-auomaically building facial shape models using exraced landmarks. Mos of he previous work in his area assumes ha he objec has already been segmened from he image sequence and in some cases he feaures or curves are placed by hand. If his is he case, he problem is how o pu he feaures in correspondence using rigid or non-rigid ransformaions [9,29]. In he oher direcion, Walker e al. [32] have proposed a mehod for auomaically placing landmarks o define correspondence beween images and hence auomaically consrucing appearance models.

6 58 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) See he repor of Cooes and Taylor [12] for a good review in auomaic 2D/3D landmark placemen. In conras o previous auomaic landmark mehods, we use parameerized maching wih a low dimensional model (e.g. affine) and generalize he maching by incorporaing a subspace for he appearance variaion Person specific models While mos work on face racking focuses on generic rackers which are independen of he ideniy of he person being racked [5,8,10,11,26,27,34], here we focus on PSFAM [17,26,22,50] for racking a single individual and use PCA o model he variaions due o changes in expression. Alhough PSFAM are only valid for one person, hey remain useful in many vision relaed applicaions such as visionbased human compuer ineracion [5,8,10,11,17 19,26,27,31], driver faigue deecion, facial animaion, face deecion/recogniion, video-conferencing, ex o speech, ec., which usually involve racking or modeling a paricular user. We build hese PSFAMs using modular eigenspaces (ME) [31,40] which have benefis over global eigenspace mehods (e.g., more accurae reconsrucion of he regions of ineres, lower compuaional cos, robusness o occlusions [37], ec.). However, i is worh poining ou ha represenaions oher han ME have been explored successfully for face recogniion and racking; for insance, Local Feaure Analysis [42,43] or Gabor jes wih elasic graph maching [51]. Alhough hese echniques have shown good performance in recogniion and racking domains, hey do no address he issue of learning a model invarian o geomeric ransformaions. 3. Generaive model for 2D faces The generaive model we propose for image formaion akes ino accoun he moion and appearance of he face. Adoping he ME approach we use predefined masks for he various image feaures and learn he appearance bases wihin hese regions. Fig. 3 shows some frames of a raining se for learning a 2D PSFAM. Given his raining daa as inpu, he algorihm ha we propose in his paper is able o facor he raining daa ino appearance and moion of he predefined face regions. In principle he regions of suppor (masks) could be compued as an eigenspace-based segmenaion problem (finding independen regions). However, in he case of he face, hese regions are quie clear, and a rough approximaion is sufficien. Therefore, we define he masks in he firs image and hey will remain he same for he enire raining image sequence. Le d 2 R d1 be he region of d pixels belonging o he face, defined by hand in he firs image. p l ¼½p l 1 pl 2 pl d ŠT 2 R d1 denoes he binary mask for he region l and i has he same size as he face region (d pixels). Each of he maskõs pixels ake a binary P value, p l p 2f0; 1g and here is no overlap beween masks, ha is, L l¼1 pl p ¼ 1 8p. pl will conain d l pixels wih value 1, which define he spaial domain of he mask l (see Fig. 3) and P L l¼1 d l ¼ d.

F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) 53 71 59 Fig. 3. The generaive model for an image sequence.

7 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Fig. 3. The generaive model for an image sequence. Face images are decomposed using appearance models wihin regions corresponding o he eyes, mouh, and remainder of he face. The appearance wihin regions varies independenly. In he curren implemenaion he regions move ogeher according o a single affine (or oher parameerized) model. Each of hese masks will have an associaed eigenspace. Le d l 2 R dl1 be he image pach of he region l and le c l be he appearance coefficiens of he region l a ime. B l ¼½b l 1 bl 2 bl k l Š2R d lk l are he k l appearance bases for he lh region. ~B l 2 R dk l (which is inroduced for noaional convenience) will be equal o B l for all pixels where p l p ¼ 1 (i.e., belongs o he l mask) and oherwise can ake an arbirary value. The graylevel of he pach, or region l, will be reconsruced by a linear combinaion of an appearance basis B ~ l,as d 1 B 1 c 1 6 d ¼ ¼ ¼ XL ðp l B ~ l c l Þ: ð1þ B L c L i¼ Moion d L If he face o be modeled can be considered o be far away from he camera, i can be approximaed by a plane. The moion of planar surfaces, under orhographic or perspecive projecion, can be recovered wih a parameric model of 6 or 8 parameers [5]. For simpliciy, he rigid moion of he face will be parameerized by an affine model: f 1 ðx p ; a l Þ¼½al 1 al 4 ŠT þ A l f ½x p x l c y p yc lšt, where A l f is a

8 60 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) marix conaining he affine parameers (a l 2 al 3 al 5 al 6 ). Le al ¼½a l 1 al 2 a l 6 ŠT denoe he vecor of affine moion parameers of he mask l a ime and le x p ¼½x p y p Š T denoe he Caresian coordinaes of he image a he pixel p and x l c ¼½xl c yl c ŠT denoe he cener of he lh region. Throughou he paper, we will assume ha he rigid moion of all he modular eigenspaces (w.r.. he cener) is he same (i.e., a 1 ¼ a 2 ¼aL ). Once he appearance and moion models have been defined, he graylevel of each pixel of he image d is explained as a superposiion of a region-subspace plus a warping, see Fig. (3); ha is, d ¼ P L l¼1 ðpl B ~ l c l Þðf 1ðx; a l ÞÞ where x ¼½x 1 x 2 x d Š T and he noaion ðp l B ~ l c l Þðf 1ðx; a l ÞÞ means ha he reconsruced image region ðp l B ~ l c l Þ is warped by he moion ðf 1ðx; a l ÞÞ. Observe ha his image model is essenially he same as previous appearance represenaions [4,11,18] bu wih he addiion of modular eigenspaces and we now rea he basis as parameers o be esimaed. 4. Learning he model parameers Once he model has been defined, in order o auomaically learn he PSFAM, i is necessary o learn he model parameers. In his secion, we describe he learning procedure; ha is, given an observed image sequence ðd 2 R dt Þ and L masks in he firs image (p ¼fp 1 ;...; p L g), we find he parameers B, C, A, and r, ha bes reconsruc he sequence (in a robus saisical sense). Where A¼fA 1 ; A 2 ;...; A L g is he se of moion parameers of all he face regions in all he image frames. A i ¼½a i 1 ai 2 ai T Š is he marix which conains he moion parameers for each image in he ih region. Analogously, C¼fC 1 ; C 2 ;...; C L g, where C i ¼½c i 1 ci 2 ci T Š and B¼fB1 ; B 2 ;...; B L g. A his poin, learning he model parameers can be posed as a minimizaion problem. In his case he residual will be he difference beween he image a ime and he reconsrucion using he model. In order o ake ino accoun oulying daa, we inroduce a robus objecive funcion, minimizing E rereg :!! E rereg ðb; C; A; rþ ¼ XT X d q d p XL X k ðf 1 ðx p ; a l ÞÞ; r p ; ð2þ ¼1 p¼1 l¼1 p l p b l pj cl j j¼1 where b l pj is he ph pixel of he jh basis of Bl for he region l. Observe ha he pixel residual is filered by he Geman McClure robus error funcion [25] given by qðx; r p Þ¼x 2 =ðx 2 þ r 2 p Þ, in order o reduce he influence of oulying daa. r p is a parameer ha conrols he convexiy of he robus funcion and is used for deerminisic annealing [4,7]. Benefis of he robus formulaion for subspace relaed problems are explained elsewhere [15,16]. Observe ha he previous equaion is a pached version of Eigenracking [4], and similar o AAM [11] or Flexible Eigenracking [18] wihou shape consrains. However, in conras o hese approaches [4,11,18], in E rereg he appearance bases B are now reaed as a se of parameers o be esimaed.

9 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Sochasic sae iniializaion The error funcion E rereg, Eq. (2), is a non-convex funcion, hus, wihou a good saring poin, any gradien descen mehod may ge rapped in local minima. When compuing he moion parameers, as in he case of opical flow, a coarse-o-fine sraegy [4,12], in which he inpu images are represened by a Gaussian pyramid, can help avoid local minima. Alhough a coarse-o-fine sraegy is helpful, his echnique is insufficien in our case, since in real image sequences he size of he face can be small in comparison o he number of pixels in he background, and large moions can be performed (e.g., in he sequences ha we ried, he face can move more han 20 pixels from frame o frame). In order o cope wih such real condiions, we explore he use of sochasic mehods such as simulaed annealing (SA), geneic algorihms (GA) [38] or Condensaion (paricle filering) [6] for moion esimaion. Beke and Makris [3] have used a fas version of SA o mach raffic signals over rigid parameers, Laniis e al. [33] made use of GA o fi an acive shape model (ASM). De la Torre e al. [19] applied paricle filering [6] for appearance based racking of rigid and non-rigid moion. The use of paricle filering allows swiching beween models (e.g., models wih differen spaial suppor [19]), coping wih large moion changes and avoiding local minima in he parameer esimaion process. Alhough he echniques are very similar compuaionally speaking, here we make use of GA [38] wihin a coarse-o-fine sraegy. Given he firs image of he sequence we manually iniialize he masks a he highes resoluion level and assign he graylevel image values o he firs basis for each region B¼fb 1 1 ;...; bl 1g. Aferwards, we ake he subse of he m frames closes in ime (ypically m ¼ 15), and use a GA for a firs esimaion of he moion parameers which minimize Eq. (2). For he iniial esimaion of he moion parameers wih he GA, we use a leas squares version of Eq. (2); ha is, qðxþ ¼x 2. Given he geneic esimaion of hese parameers, we recompue he bases B which preserve 60% of he energy. This iniializaion procedure is repeaed unil all he frames in he image sequence are iniialized. The procedure is summarized as Manual iniializaion in he firs frame. Iniialize he mask in he image d 1. Iniialize he bases B¼fb 1 1 ;...; bl 1 g wih he graylevel values of d 1. Sochasic iniializaion of he moion and appearance parameers for D. for i ¼ 2:m:T (Malab noaion) Run he GA for compuing he moion and appearance parameers in fd i ;...; d iþm g. Perform SVD on he regisered se of images from 1 o m and keep he number of bases which preserve 60% of he energy. Updae he se of bases B. end The GA uses 300 individuals over 13 generaions for each frame. The selecion funcion we use is he normalized geomeric ranking, which defines he probabiliy of one individual as P i ¼ðq=ð1 ð1 qþ P ÞÞð1 qþ ðr 1Þ, where q is he probabiliy of selecing he bes individual, r is he rank of he individual, and P he populaion

10 62 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) size. See [38] for a more deailed explanaion of he GA. A he beginning, q has a low value, and i is successively increased over generaions acing as a emperaure parameer in he deerminisic annealing [4,7] for improving he local search. The crossover process is a convex combinaion beween wo samples, i.e., a cromosome 1 þð1 aþcromosome 2 where 1 P a P 0. The geneic operaor is a simple Gaussian random perurbaion, which also depends on he emperaure parameer. In our experimens we ake q ¼ 0:04 and a ¼ 0: Robus deerminisic learning The previous secion describes a mehod for compuing an iniial esimae of he parameers B, C, A. In order o improve he soluion and achieve sub-pixel accuracy, a normalized gradien descen algorihm for minimizing Eq. (2) has been employed in [13]. Alernaively (and convenienly) we can reformulae he minimizaion problem as one of ieraively reweighed leas-squares (IRLS), which provides an approximae, ieraive, soluion o he robus M-esimaion problem [30,35]. For a given r, a marix W 2 R dt, which conains he posiive weighs for each pixel and each image, is calculaed P for each ieraion as a funcion of he previous residuals e pi ¼ d p ðp l k p j¼1 bl pj cl j Þðf 1ðx p ; a l ÞÞ. Each elemen, w pi (ph pixel of he ih image) of W will be equal o w pi ¼ wðe pi ; r p Þ=e pi, where wðe pi ; r p Þ¼oqðe pi ; r p Þ= oe pi ¼ 2e pi r 2 p =ðe2 pi þ r2 p Þ2 ; [28]. Given an iniial error, he weigh marix W is compued and Eq. (2) becomes E wereg ðb; C; A; rþ ¼ XT d XL 2 ðp l B ~ l c l Þðf 1ðx; a l ÞÞ ð3þ ¼1 ¼ XT ¼1 X L l¼1 l¼1 kd l ðfðx; al ÞÞ Bl c l k2 W l ; where f will warp he images owards he eigenspace, whereas f 1 warps he bases owards he images. Observe ha f will be approximaely he inverse of f 1. Recall ha kdk 2 W ¼ dt Wd is a weighed norm. W 2 R dd is a diagonal marix, such ha he diagonal elemens are he h column of W. W l 2 R d ld l is a diagonal marix, where he diagonal is creaed by he elemens of he h column of W which belong o he lh region. Observe ha if W is a marix wih all ones we have he leas-squares soluion. Eq. (4) provides he formulaion for robus parameerized componen analysis. Minimizing (4) wih respec o he parameers gives a subspace ha is invarian o he allowed geomeric ransformaions and robus o ouliers on a pixel level. Clearly, finding he minimum is a challenge and he process for doing so is described below. Noice ha, if he moion parameers are known, compuing he basis and he coefficiens ranslaes ino a weighed bilinear problem (compuing basis B and coefficiens C). In order o compue he updaes of he bases and coefficiens in closed form in he simples way, we use he following observaion: W ; ð4þ

11 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) E wereg ¼ XT ¼1 X L l¼1 kðd l w Þ Bl c l k2 W l ¼ Xd l p¼1 X L l¼1 kðd l w Þp ðc l Þ T ðb l Þ p k 2 ðw l Þ p; ð5þ where ðd l w Þ is he warped image dl ðfðx; a ÞÞ and i is he h column of he marix D w (jus he d l pixels of he lh region). Recall ha ðd l w Þp is a column vecor which corresponds o he ph row of he marix D w and ha ðw l Þ p is a diagonal marix which conains he ph row of he marix W of he region l. Minimizing Eq. (4) is a non-linear opimizaion problem w.r.. he moion parameers. Following previous work on moion esimaion [2,4,27], we linearize he variaion of he funcion, using a firs-order Taylor series approximaion. Wihou loss of generaliy, raher han linearizing he ransformaion which warps he eigenspace owards he image f 1 ðx; a Þ, we linearize he ransformaion which aligns he incoming image w.r.. he eigenspace fðx; a Þ (see Eq. 4). Expanding, d l he Taylor series abou he iniial esimaion of he moion parameers a l0 given by he GA): ðfðx; al0 þ Da l ÞÞ in (which are d l ðfðx; al0 þ Da l ÞÞ ¼ dl ðfðx; al0 ÞÞ þ J l Dal þ h:o::; ð6þ where J l is he Jacobian a ime of he lh region and h.o.. denoes he higher order erms. J l ¼ odl ðfðx; al0 ÞÞ od l ðfðx; al0 ÞÞ od l ðfðx; al0 ÞÞ oa l 1 oa l 2 is compued as 2 rd1 Tðfðx 1; a l0 J l ¼ rdd T l ðfðx d l ; a l0 where rd i ðfðx i ; a l0 ÞÞ ¼ ÞÞ ofðx 1;a l0 Þ oa l ÞÞ ofðx d l ;a l0 Þ oa l ; od iðfðx i ; a l0 ÞÞ ox oa l m od i ðfðx i ; a l0 T ÞÞ 2 R 21 oy is he spaial gradien of he image d warped wih a l0 a he posiion x i. ðofðx i ; a l0 Þ=oa l Þ2R26 is he derivaive of he parameric moion w.r.. he moion parameers evaluaed a he pixel x i and warped wih he iniial moion parameers a l0. In he case ha fðx p ; a l Þ is an affine model, ofðx p; a l0 Þ=oa l would be equal o ofðx p ; a l0 Þ ¼ 1 x p x c y p y c : ð7þ oa l x p x c y p y c Observe ha afer he linearizaion he objecive funcion E wereg, Eq. (4), is convex in each of he parameers. For insance, Da can be compued in closed form by solving a linear sysem of equaions:

12 64 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) ððj 1 ÞT W 1 J1 Þ 3 2 ðj ÞT W 1 ðd1 ðfðx; a0 ÞÞ B1 c 1 Þ Da ¼ ; ððj L ÞT W L JL Þ ðj L ÞT W L ðdl ðfðx; a0 ÞÞ BL c L Þ where recall ha W l is a marix conaining he weighs for he region l a ime. In his case, we have assumed ha Da l ¼ Da 8l and drop he superscrip l since all regions in he ME are assumed o have he same moion. However, E wereg is no longer convex as a join funcion of hese variables. In order o learn he parameers, we break he esimaion problem ino wo sub-problems. We alernae beween esimaing C and A wih a Gauss Newon scheme [2,4] and learning for he basis B and scale parameers r unil convergence (see [16,15] for more deailed informaion). Each of he updaes for C; A and B are compued in closed form. This muli-linear fiing algorihm monoonically reduces he cos funcion, alhough i is no guaraneed o converge o he global minimum. We also use a coarse-o-fine sraegy [2,4,12] o cope wih large moions and o improve he efficiency of he algorihm. Towards ha end, a Gaussian image pyramid is consruced. Each level of he pyramid is consruced by aking he image a he previous resoluion level, convolving i wih a Gaussian filer and subsampling. Deails are given below. For each resoluion level (coarse o fine) unil convergence of C; A, and B. Unil convergence of C; A. Unil convergence of A, rewarp D o D w and updae he moion parameers for each region by compuing: ða l Þ¼ðal ÞþDa 8l ¼ 1...L. Updae he appearance coefficiens for each region and each image ððb l Þ T W l Bl Þc l ¼ðB l Þ T W l d ðfðx; a l ÞÞ 8l ¼ 1...L; 8 ¼ 1...T. Updae B preserving 85% of he energy, solving: ðc l ðw l Þ p ðc l Þ T Þðb l Þ p ¼ C l ðw l Þ p ðd l w Þp 8l ¼ 1...L; 8p ¼ 1...d l. Recompue he error, weighs (W) and he scale saisics r [16]. Propagae he moion parameers o he nex finer resoluion level [2,4,12] (he ranslaion parameers are muliplied by a facor of 2). Once he moion parameers are propagaed he bases are recompued. Since he face usually performs smooh changes in moion and appearance over ime, he previous model can be improved by incorporaing dynamical informaion as addiional regularizaion erms ino he energy funcion framework, minimizing: E dwereg ¼ E wereg þ XT X L ðk 1 kc l Cl c cl 1 k W þ k 2 ka l Cl a al 1 k _W Þ: ¼2 l¼1 Here we have inroduced he linear dynamics C l c of he appearance coefficiens, and he he linear dynamics C l a of he moion parameers. The firs erm E wereg expresses a daa conservaion erm, while he second erm inroduces a emporal smoohness consrain ino he model. The addiion of his dynamical informaion will ac as a regularizaion erm o prefer smooh soluions of he appearance and moion parameers. However, due o he coupling, a more efficien echnique han IRLS will be a normalize gradien descen [13].

13 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Experimens and applicaions 5.1. Auomaic learning of eigeneyes Eyes are one of he key elemens in Vision Based Human Compuer Ineracion. Tracking he eye becomes a difficul ask because he image changes are no solely due o moion bu also o appearance change [4,18,19,26,40]. In his experimen, we auomaically learn a person-specific eigeneye model wihou any manual cropping, excep in he firs image. We assume ha during he raining process he person is no moving far from he firs frame (around 5 8 pixels). However, i is no reasonable o assume ha he person is absoluely sill during he raining session. Recall ha Fig. 1 illusraes he eigen-regisraion mehod and shows a few images from a raining se. In he firs frame, we manually selec he mask for he eyes, face, and background (in Fig. 3 he regions are represened). In his case, because we are assuming a small moion, he GA has no been applied for iniializing he algorihm, and we minimize Eq. (4) wih he robus deerminisic learning mehod proposed, wih a coarse-o-fine sraegy (2 levels) over he enire raining se (around 300 frames). We have presupposed ha he daa had few ouliers, so we give r a high value. The resuls are shown in Fig. 2; see Secion 1 for more deails. Fig. 4 shows he normalized reconsrucion error of he eye for he original raining se D and he aligned raining se D w wih he same number of bases. The normalized reconsrucion error for he image i is r i ¼kd i Bc i k=kd i k. The reconsrucion error resuling from our mehod (solid line) compared wih sandard Fig. 4. Normalized reconsrucion error of he righ eye for he experimen 5.2 versus he number of frames. See ex for deails abou he normalized error.

14 66 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) PCA (doed line). Once he eigeneyes have been learned, racking can be performed wih deerminisic echniques [4,11] or sochasic ones [19]. Applicaions o driver faigue deecion are being explored [19] Auomaic face learning In his experimen, we explore he possibiliy of learning he enire face model, including modeling mouh changes. The modular face model is composed of four regions (see Fig. 3). Some frames of he sequence ( pixels and 320 frames) are shown in Fig. 5a. In his sequence, he person can suddenly move more han 20 pixels from frame o frame, along wih large scale and roaion changes. In his case we make use of he sochasic iniializaion wih he GA for an iniial esimaion of he parameers. Fig. 5b shows he normalized face (w.r.. he firs frame) reconsruced wih he learned bases afer he convergence of he algorihm. Recall ha we have jus iniialized he regions in he firs image and no previous appearance model was given. Noice ha he reconsruced images in he b rows are sabilized indicaing ha he affine ransformaion from he inpu images o he learned eigenspaces has been accuraely recovered. The faces in Fig. 5b display variaions due simply o appearance (expression) and no o moion. In his case we preserve 85% of he energy in each modular eigenspace. A his poin, i is ineresing o observe ha ME achieves beer compression facors han he regular eigenspace for he same number of parameers. Each face image (Fig. 5b) can be reconsruced wih 23 parameers and furher work needs o be done o deermine he viabiliy of his model for applicaions such as videoconferencing. Noe also ha hese figures show he resuls for auomaic regisraion and learning wih respec o he raining daa. For video conferencing (or similar) applicaions where one needs o rack and reconsruc he appearance and moion of he face one needs o solve for he ransformaion beween he model and he daa o be reconsruced. This is he eigenracking problem addressed in [4] Virual avaars In his experimen we animae one face given anoher using PSFAMs. In general i is hard o model and animae faces and ofen complex models encoding he underlying physical musculaure of he face are used (e.g., Candide model [23]). Here we learn he PSFAM of wo people wih parameerized componen analysis inroduced in his paper. Then, we manually selec all pairs of corresponding images which share a common emoional sae, i.e., we associae he face regions wih equal expression conen, and collec wo raining ses D and ^D (for more informaion see [14]), one raining se for each person. Once we have D and ^D, we use he recenly proposed asymmeric coupled componen analysis (ACCA) [14] o learn he relaionship beween hese wo ses, and predic one from he oher. Fig. 6 shows frames of a virual female face animaed by he appearance of he inpu male face. The firs column shows he original inpu sream ( ^D); he second one, (D), is he resul of animaing

F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) 53 71 67 Fig. 5. (a) Original image sequence; (b) reconsruced normalized face.

15 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Fig. 5. (a) Original image sequence; (b) reconsruced normalized face. he face wih ACCA plus he affine moion of he head. As we can observe his approach allows us o model he rich exure presen on he face providing fairly realisic animaions.

68 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) 53 71 Fig. 6.

16 68 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) Fig. 6. (a) Original face; (b) animaed virual face. 6. Discussion and fuure work This paper has inroduced robus parameerized componen analysis o learn modular subspaces ha are invarian o various geomeric ransformaions. The robus formulaion of he problem exends previous work and has proven effecive for learning low dimensional models of human faces. In paricular, we have shown how he mehod can simulaneously consruc an eigenspace while aligning unregisered raining images. The learned eigenspace capures he moion-invarian appearance variaion in he raining daa and he mehod can be applied o arbirary parameerized deformaions. Due o he complexiy of he objecive funcion, a sochasic iniializaion of he algorihm has proven o be essenial for avoiding local minima. Since he final soluion is sensiive o he iniializaion from he geneic algorihm, one exension of he work here would be o ake muliple iniial esimaes from he sochasic iniializaion, solve for he bases and hen perform model selecion. We are exploring anoher exension o he opimizaion echnique ha incremenally aligns he raining images wih an increasing number of bases (e.g., beginning wih he bases corresponding o 40% of he energy and successively increasing i unil 85%). Inuiively, his would firs align he daa w.r.. o he mos common feaures and laer w.r.. he more deailed ones. While our parameerized componen analysis mehod is a general echnique for learning linear subspaces, here we have illusraed i wih examples from face modeling. In paricular, we have illusraed he mehod in he conex of 2D PSFAMs

17 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) and have presened several applicaions of hese models. Observe ha parameerized componen analysis always improves he qualiy of he appearance basis if some misalignmen exiss in he raining se (due o manual cropping, moion of he person, ec). Alhough we have presened a mehod for learning PSFAM, he mehod can be also useful when improving he basis of a raining se conaining faces from differen people. As described here, he mehod is appropriae for learning appearance models in an off-line process. The mehod could be exended o be useful for on-line learning by simply replacing he closed form soluion wih a gradien descen algorihm or any adapive mehod. Based on he recen exension of EigenTracking [4] o deal wih Suppor Vecor Machines [1], i would also be ineresing and quie sraighforward o consider exending our mehod o oher saisical learning echniques like SVM, independen componen analysis, ec. Modeling he face wih modular eigenspaces coupled by he moion can resul in he loss of correlaions beween he pars (e.g., when smiling some wrinkles appear in he eye region). Now we are working on modeling he face wih symmeric coupled componen analysis [14] and are experimening wih hierarchical componen analysis in which one se of coefficiens model he coupling beween regions while each individual region has is own local variaion. Finally, he work presened in his paper on auomaic learning of 2D PSFAM has he limiaion of being applicable o some paricular view of he face, in his case he fronal view. However, i is likely ha in many real applicaions he head will undergo 3D moions resuling in changes o he spaial domain of he facial eigenspaces. An exension o model 3D changes is needed. We are working on exending he PSFAM o model 3D changes by incorporaing shape informaion. This can be done using he same coninuous opimizaion echniques described here [11,18]. Videos wih he resuls for all he experimens performed in his paper can be down-loaded from hp:// Acknowledgmens The firs auhor has been parially suppored by he 2001BEAI Gran of he he Direccio General de Recerca of he Generalia of Caalunya. The second auhor was parially suppored by he DARPA HumanID projec (ONR Conrac N ) and a gif from he Xerox Foundaion. We hank Allan Jepson for discussions on robus PCA and eigen-regisraion. References [1] S. Avidan, Suppor vecor racking, in: Conf. on Compuer Vision and Paern Recogniion, [2] J.R. Bergen, P. Anandan, K.J. Hanna, R. Hingorani, Hierarchical model- based moion esimaion, in: European Conf. on Compuer Vision, 1992, pp [3] M. Beke, N. Makris, Fas objec recogniion in noisy images using simulaed annealing, in: Inerna. Conf. on Compuer Vision, 1994, pp

18 70 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) [4] M.J. Black, A.D. Jepson, Eigenracking: Robus maching and racking of objecs using view-based represenaion, Inerna. J. Compu. Vision 26 (1) (1998) [5] M.J. Black, Y. Yacoob, Recognizing facial expressions in image sequences using local parameerized models of image moion, Inerna. J. Compu. Vision 25 (1) (1997) [6] A. Blake, M. Isard, Acive Conours, Springer, Berlin, [7] A. Blake, A. Zisserman, Visual Reconsrucion, MIT Press, Cambridge, MA, [8] M. La Casia, S. Sclaroff, Fasj reliable racking under varying illuminaion, in: Conf. on Compuer Vision and Paern Recogniion, 1999, pp [9] H. Chui, A. Rangarajan, A new algorihm for non-rigid poin maching, in: IEEE Conf. on Compuer Vision and Paern Recogniion, 2000, pp [10] R. Cipolla, A. Penland, Compuer Vision for Human Machine Ineracion, Cambridge Universiy Press, Cambridge, [11] T.F. Cooes, G.J. Edwards, C.J. Taylor, Acive appearance models, in: European Conf. on Compuer Vision, 1998, pp [12] T.F. Cooes, C.J. Taylor, Saisical models of appearance for compuer vision, in: World Wide Web Publicaion, February, Available from <hp:// [13] F. de la Torre, Auomaic learning of appearance face models, in: Second Inerna. Workshop on Recogniion, Analysis and Tracking of Faces and Gesures in Real-ime Sysems, 2001, pp [14] F. de la Torre, M.J. Black, Dynamic coupled componen analysis, Compu. Vision Paern Recogniion (2001) [15] F. de la Torre, M.J. Black, A framework for robus subspace learning, In. J. Compu. Vision 54, (2003). [16] F. de la Torre, M.J. Black, Robus principal componen analysis for compuer vision, in: Inerna. Conf. on Compuer Vision, 2001, pp [17] F. de la Torre, S. Gong, S. McKenna, View alignmen wih dynamically updaed affine racking, in: Inerna. Conf. on Auomaic Face and Gesure Recogniion, 1998, pp [18] F. de la Torre, J. Viria, P. Radeva, J. Melenchon, Eigenfilering for flexible eigenracking, in: Inerna. Conf. on Paern Recogniion, Barcelona, 2000, pp [19] F. de la Torre, Y. Yacoob, L. Davis, A probabilisic framework for rigid and non-rigid appearance based racking and recogniion, in: Inerna. Conf. on Auomaic Face and Gesure Recogniion, 2000, pp [20] K.I. Diamanaras, Principal Componen Neural Neworks (Theory and Applicaions), Wiley, New York, [21] C. Eckard, G. Young, The approximaion of one marix by anoher of lower rank, Psychomerika 1 (1936) [22] G.J. Edwards, C.J. Taylor, T.F. Cooes, Improving idenificaion performance by inegraing evidence from sequences, Compu. Vision Paern Recogniion (1999) [23] P. Eiser, B. Girod. Model-based esimaion of facial expression parameers from image sequences, in: Inerna. Conf. on Image Processing, 1997, pp [24] B.J. Frey, N. Jojic, Transformaion-invarian clusering using he EM algorihm, IEEE Trans. Paern Anal. Machine Inell. 25 (1) (2003) [25] S. Geman, D. McClure, Saisical mehods for omographic image reconsrucion, Bull. In. Sais. Ins. LII (1987) 4 5. [26] S. Gong, S. Mckenna, A. Psarrou, Dynamic Vision: From Images o Face Recogniion, Imperial College Press, [27] G. Hager, P. Belhumeur, Efficien region racking wih parameric models of geomery and illuminaion, IEEE Trans. Paern Anal. Machine Inell. 20 (10) (1998) [28] F. Hampel, E. Ronchei, P. Rousseeuw, W. Sahel, Robus Saisics: The Approach Based on Influence Funcions, Wiley, New York, [29] A. Hill, C.J. Taylor, A.D. Bre, A framework for auomaic landmark idenificaion using a new mehod of nonrigid correspondence, Paern Anal. Machine Inell. 3 (22) (2000) [30] P.W. Holland, R.E. Welsch, Robus regression using ieraively reweighed leas-squares, Commun. Sais. A (6) (1977)

19 F. De la Torre, M.J. Black / Compuer Vision and Image Undersanding 91 (2003) [31] T. Jebara, K. Russell, A. Penland, Mixures of eigenfeaures for real-ime srucure from exure, in: Inerna. Conf. on Compuer Vision, [32] T.F. Cooes K.N. Walker, C.J. Taylor, Deermining correspondences for saisical models of appearance, in: European Conf. on Compuer Vision, 2000, pp [33] A. Laniis, A. Hill, T.F. Cooes, C.J. Taylor, Locaing facial feaure using geneic algorihms, in: Inerna. Conf. on Digial Signal Processing, 1995, pp [34] A. Laniis, C. Taylor, T. Cooes, Auomaic inerpreaion and coding of face images using flexible models, IEEE Trans. Paern Anal. Machine Inell. 19 (7) (1997) [35] G. Li, Robus regression, in: D.C. Hoaglin, F. Moseller, J.W. Tukey (Eds.), Exploring Daa, Tables, Trends and Shapes, Wiley, New York, [36] E.D. Mandel, P.S. Penev, Facial feaure racking and pose esimaion in video sequences by facorial coding of he low-dimensional enropy manifolds due o he parial symmeries of faces, in: IEEE ICASSP, vol. IV, 2000, pp [37] A.M. Marınez, Recognizing imprecisely localized, parially occluded and expression varian faces from a single sample per class, IEEE Trans. Paern Anal. Machine Inell. 24 (6) (2002) [38] M. Michell, An Inroducion o Geneic Algorihms, MIT Press, Cambridge, MA, [39] B. Moghaddam, Principal manifolds and bayesian subspaces for visual recogniion, in: Sevenh Inerna. Conf. on Compuer Vision, 1999, pp [40] B. Moghaddam, A. Penland, Probabilisic visual learning for objec represenaion, Paern Anal. Machine Inell. 19 (7) (1997) [41] S.K. Nayar, T. Poggio, Early Visual Learning, Oxford Universiy Press, Oxford, [42] P.S. Penev, Local feaure analysis: a saisical heory for informaion represenaion and ransmission, Ph.D. Thesis, The Rockefeller Universiy, [43] P.S. Penev, J.J. Aick, Local feaure analysis: a general saisical heory for objec represenaion, Nework: Compu. Neural Sys. 7 (3) (1996) [44] A. Penland, B. Moghaddam, T. Sarner, View-based and modular eigenspaces for face recogniion, in: IEEE Conf. on Compuer Vision and Paern Recogniion, 1994, pp [45] R.P.N. Rao, Developmen of localized oriened recepive fields by learning a ranslaion-invarian code for naural images, Nework: Compu. Neural Sys. 9 (1998) [46] S, Romdhani, S. Gong, A. Psarrou, Muli-view nonlinear acive shape model using kernel pea, in: Briish Machine Vision Conf., 1999, pp [47] H. Schewizer, Opimal eigenfeaure selecion by opimal image regisraion, in: Conf. on Compuer Vision and Paern Recogniion, 1999, pp [48] L. Sirovich, M. Kirby, Low-dimensional procedure for he characerizaion of human faces, J. Op. Soc. Am. A 4 (3) (1987) [49] M. Turk, A. Penland, Eigenfaces for recogniion, J. Cogniive Neurosci. 3 (1) (1991) [50] T. Veer, N.F. Troje, Separaion of exure and shape in images of faces for image coding and synhesis, J. Op. Soc. Am. A (14) (1997) [51] L. Wisko, J.M. Fellous, N. Kr uger, C. von der Malsburg, Face recogniion using by elasic bunch graph maching, PAMI 19 (7) (1997) [52] M. Yang, D. Kriegman, N. Ahuja, Deecing faces in images: a survey, IEEE Trans. Paern Anal. Machine Inell. (2001).

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto Visual Percepion as Bayesian Inference David J Flee Universiy of Torono Basic rules of probabiliy sum rule (for muually exclusive a ): produc rule (condiioning): independence (def n ): Bayes rule: marginalizaion: