Video-Based Face Recognition Using Probabilistic Appearance Manifolds

Video-Based Face Recogniion Using Probabilisic Appearance Manifolds Kuang-Chih Lee Jeffrey Ho Ming-Hsuan Yang David Kriegman klee10@uiuc.edu jho@cs.ucsd.edu myang@honda-ri.com kriegman@cs.ucsd.edu Compuer Science Compuer Science & Engineering Honda Research Insiue Universiy of Illinois, Urbana-Champaign Universiy of California, San Diego 800 California Sree Urbana, IL 61801 La Jolla, CA 92093 Mounain View, CA 94041 Absrac This paper presens a novel mehod o model and recognize human faces in video sequences. Each regisered person is represened by a low-dimensional appearance manifold in he ambien image space. The complex nonlinear appearance manifold expressed as a collecion of subses (named pose manifolds), and he conneciviy among hem. Each pose manifold is approximaed by an affine plane. To consruc his represenaion, exemplars are sampled from videos, and hese exemplars are clusered wih a K-means algorihm; each cluser is represened as a plane compued hrough principal componen analysis (PCA). The conneciviy beween he pose manifolds encodes he ransiion probabiliy beween images in each of he pose manifold and is learned from a raining video sequences. A maximum a poseriori formulaion is presened for face recogniion in es video sequences by inegraing he likelihood ha he inpu image comes from a paricular pose manifold and he ransiion probabiliy o his pose manifold from he previous frame. To recognize faces wih parial occlusion, we inroduce a weigh mask ino he process. Exensive experimens demonsrae ha he proposed algorihm ouperforms exising frame-based face recogniion mehods wih emporal voing schemes. 1 Inroducion Face recogniion has long been an acive area of research, and numerous algorihms have been proposed over he years. However, mos research has been focused on recognizing faces from a single image. Face recogniion using video presens various challenges and opporuniies. Typically, recogniion using image sequences is done using a wo-sage sysem: a racking module and a recogniion module. Given a video frame, a racking module akes an esimae of he objec s locaion in he previous frame and reurns a subimage in he curren frame ha conains he objec. A recogniion module hen operaes on he subimage, perhaps inegraing informaion/decisions from earlier frames. In a video, head pose may vary significanly. Therefore, successful video-based face recogniion mus be able o classify faces wih a range of image plane and 3-D orienaions. In addiion, a good recogniion mehod should be robus o misalignmen errors inroduced by inaccuracies from he racking module. Meanwhile, parial occlusion poses anoher serious challenge, and his is likely o occur a some insans in unconsrained applicaions such as vision-based human compuer ineracion. On he oher hand, recogniion in video offers he opporuniy o inegrae informaion emporally across he video sequence, which may help o increase he recogniion raes. Our framework explois emporal coherence in he following ways. Firs, our proposed appearance model is composed of a collecion of pose manifolds, and a marix of ransiion probabiliies o connec hem. The ransiion probabiliies among he pose manifolds are learned from raining videos each one characerizes he probabiliy of moving from one pose o anoher pose beween any wo consecuive frames. We use he ransiion probabiliy o implicily infer he appropriae pose for each incoming video frame, and hen inegrae his informaion by Bayes rule o perform face recogniion. Therefore, our mehod effecively capures he dynamics of pose changes and hereby explois he emporal informaion in a video sequence for recogniion. Second, we use consecuive frames o define a mask whose elemens represen he probabiliy ha a pixel corresponds o an occlusion. The mask is ieraively updaed by analyzing he difference beween he observed image a each ime insance and he reconsruced image prediced from previous frame. We have implemened he proposed mehod and evaluaed i wih numerous experimens. The experimenal resuls show ha our mehod is effecive in recognizing faces in videos conaining large variaion of head moion as well as parial occlusions. This paper is organized as follows. We briefly summarize he relaed lieraure which moivaes his work in Secion 2. In Secion 3, we deail and conras our algorihms wih oher exising work. Numerous experimens on a large and raher difficul daa se are presened in Secion 4. We conclude wih remarks and fuure work in Secion 5. 1

2 Relaed Work Mos of he research work in he lieraure concenraes on represenaion and classificaion mehods for recognizing faces in sill and ofenimes single images [4, 24, 30]. Alhough here exis numerous face recogniion algorihms operaing on image sequences, hey ypically use emporal voing o improve idenificaion raes [12, 26, 28]. We also noe ha here exis several algorihms ha aim o exrac 2-D or 3-D face srucure from video sequences for recogniion and animaion [5, 14, 6, 7, 8, 9, 29, 11, 23]. However hese mehods require meiculous procedures o build 2-D or 3-D models, and do no fully exploi emporal informaion for recogniion. Among he few aemps aiming o ruly uilize emporal informaion for face recogniion in image sequences raher han simple voing, Li e al. presened a mehod o consruc ideniy surfaces using shape and exure models as well as kernel feaure exracion algorihms [16]. This approach esimaes pose angle firs in order o selec an appropriae shape model for racking and recogniion. However, i does no fully ake advanage of coherence informaion beween consecuive frames excep for a weighed emporal voing scheme o fi model parameers. Zhou and Chellappa [31] proposed a generic framework o rack and recognize human faces simulaneously by adding an ideniy variable o he sae vecor in he sequenial imporance sampling mehod. They hen marginalize over all sae vecors o yield an esimae of he poserior probabiliy of he ideniy variable. Though his probabilisic approach aims o inegrae moion and ideniy informaion over ime, i neverheless considers only ideniy consisency in emporal domain and hus may no work well when he arge is parially occluded. Furhermore, i is no clear how one can exend his work o deal wih large 3-D pose variaion. Krueger and Zhou [15] applied an on-line version of radial basis funcions o selec represenaive face images as exemplars from raining videos, and in urn his faciliaes racking and recogniion asks. The sae vecor in his mehod consiss of affine parameers as well as an ideniy variable, and he sae ransiion probabiliy is learned from affine ransformaions of exemplars from raining videos in a way similar o [27]. Since only 2-D affine ransformaions are considered, his model is effecive in capuring small 2-D moion bu may no deal well wih large 3-D pose variaion or occlusion. Recenly, Li e al. [17] applied piecewise linear models o capure local moion and a ransiion marix among hese models o describe nonlinear global dynamics. They applied he learned local linear models and heir dynamic ransiions o synhesize new moion video such as choreography. Our work bears some resemblance o heir mehod in he sense ha boh mehods uilize local linear models, somehing advocaed in several prior works [3, 1, 19], and boh learn he relaionships among hese models [13, 20, 21, 25]. However in his paper, we consider propagaing he probabilisic likelihood of he linear models hrough he ransiion marix (i.e., uilizing emporal informaion) o recognize human ideniy. Furhermore, we exploi he informaion learned in he local models and ransiion marix o infer missing daa in recognizing parially occluded faces. 3 Probabilisic Appearance Manifold Consider a recogniion problem wih N objecs where he images of an objec are acquired by varying he viewpoin. I is well undersood ha he se of images of an objec under varying viewing condiions can be reaed as a lowdimensional manifold in he image space as demonsraed in parameric appearance manifold work [19] or view-based Eigenspace approach [22]. The recogniion ask is sraighforward if he appearance manifold M k for each individual k is known: for a es image I, he ideniy k can be deermined by finding he manifold M k wih minimal disance o I, i.e., k = arg min d H (I, M k ). (1) k Here, d H denoes he L 2 Hausdorff disance beween he image I and M k. Le x M k denoe a poin on a manifold M k where dim(m k ) dim(i). Given a poin x M k, le he corresponding reconsruced face image be denoed Îx where dim(i) = dim(îx). If x is he poin on M k a minimal L 2 disance o I, hen d H (I, M k ) = d(i, x ) where d(, ) denoes he L 2 disance. Alernaively, x can be regarded as he resul of some nonlinear projecion of I ono M k. Ck2 Ck1 Ck3 Mk I Ck4 dh(mk,i) x Ck5 Ck6 Figure 1: Appearance manifold. A complex and nonlinear manifold can be approximaed as he union of several simpler pose manifolds; here, each pose manifold is represened by a PCA plane. Probabilisically, Equaion 1 is he resul of defining he condiional probabiliy p(k I) as p(k I) = 1 Λ exp( 1 σ 2 d2 H(I, M k )). (2) 2

where Λ is a normalizaion erm, and for a given image I k = arg max k p(k I). (3) In order o implemen his recogniion scheme, one mus be able o esimae he projeced poin x M k, and hen he image o model disance, d H (I, M k ), can be compued for a given I and for each M k. However, such disances can be compued accuraely only if M k is known exacly. In our case, M k is usually no known and can only be approximaed wih samples. The main par of our algorihm is o provide a probabilisic framework for esimaing x and d H (x, I). Noe ha if we define he condiional probabiliy p Mk (x I) o be he probabiliy ha among poins on M k, Î x has he smalles L 2 -disance o I, hen d H (I, M k ) = d(x, I)p Mk (x I)dx, (4) M k and Equaion 1 is equivalen o k = arg min d(x, I)p Mk (x I)dx. (5) k M k The abovemenioned formulaion shows ha d H (I, M k ) can be viewed as he expeced disance beween a single image frame I and a complex appearance manifold M k. Clearly, if M k were fully known or well-approximaed (e.g., described by some algebraic equaions), hen p Mk (x I) could be reaed as a δ funcion a he se of poins wih minimal disance o I. When sufficienly many samples are drawn from M k, he expeced disance d(i, M k ) will be a good approximaion of he rue disance. The reason is ha p Mk (x I) in he inegrand of Equaion 4 will approach a dela funcion wih is energy concenraed on he se of poins wih minimal disance o I. In our case, M k, a bes, is approximaed hrough a sparse se of samples, and so we will model p Mk (x I) wih a Gaussian disribuion. Since he appearance manifold M k is complex and nonlinear, i is reasonable o decompose M k ino a collecion of m simpler disjoin manifolds, M k = C k1 C km where C ki is called a pose manifold. Each pose manifold is furher approximaed by an affine plane compued hrough principal componen analysis (called a PCA plane). We define he condiional probabiliy p(c ki I) for 1 i m as he probabiliy ha C ki conains a poin x wih minimal disance o I. Since p Mk (x I) = m i=1 p(cki I)p C ki(x I), we have, d H (I, M k ) = d(x, I)p Mk (x I)dx M k = p(c ki I) d H (x, I)p C ki(x I)dx C ki = i=1 p(c ki I)d H (I, C ki ). (6) i=1 I-1 MA I I-2 I+1 I+2 I-3 Figure 2: Difficuly of frame-based recogniion: The wo solid curves denoe wo differen appearance manifolds, M A and M B I is difficul o reach a decision on he ideniy from frame I 3 o frame I because hese frames have smaller L 2 disance o appearance manifolds M A han M B. However, by looking a he sequence of images I 6... I +3, i is apparen ha he sequence has mos likely originaed from appearance manifold M B. The above equaion shows ha he expeced disance d(i, M k ) can be also reaed as he average expeced disance beween I and each pose manifold C ki. In addiion, his equaion ransforms he inegral o a finie summaion which is feasible o compue numerically. For face recogniion from video sequences, we can exploi emporal coherence beween consecuive image frames. As shown in Figure 2, he L 2 norm may occasionally be misleading during recogniion. Bu if we consider previous frames in an image sequence raher han jus one, hen he se of closes poins x will race a curve on a pose manifold. In our framework, his is embodied by he erm p(c ki I) in Equaion 6. In Secion 3.1, we will apply Bayesian inference o incorporae emporal informaion o provide a beer esimaion of p(c ki I), and hus d H (I, M k ) o achieve beer recogniion performance. 3.1 Compuing p(c ki I ) For recogniion from a video sequence, we need o esimae p(c ki I ) for each i a ime. To incorporae emporal informaion, p(c ki I ) should be aken as he join condiional probabiliy p(c ki I, I 0: 1 ) where I 0: 1 denoes he frames from he beginning up o ime 1. We furher assume I and I 0: 1 are independen given C ki, as well as C ki and I 0: 1 are independen given C 1. ki Using Bayes rule we have he following recursive formulaion: p(c ki I+3 I-4 I, I 0: 1 ) = α p(i C ki = α p(i C ki ) = α p(i C ki ) j=1 j=1 MB I-5 I-6, I 0: 1 )p(c ki I 0: 1 ) p(c ki C kj 1, I 0: 1)p(C kj 1 I 0: 1) p(c ki C kj 1 )p(ckj 1 I 1I 0: 2 )(7) 3

where α is a normalizaion erm o ensure a proper probabiliy disribuion. The emporal dynamics of he video sequence is capured by he ransiion probabiliy beween he manifolds, p(c ki C kj 1 ). Noe ha p(cki C kj 1 C kj 1 ) is he probabiliy of x C ki given x 1 C kj. For wo consecuive frames I 1 and I, because of emporal coherency, we expec ha heir projeced poins x 1 and x should have small geodesic disance on M (See Figure 2). Tha is he ransiion probabiliy p(c ki geodesic disance beween C ki and C kj. Ck1 P(Ck1 Ck2) Ck2 Mk ) is relaed implicily o he P(Ck2 Ck3) Ck3 Figure 3: Dynamics among pose manifolds. The dynamics among he pose manifolds are learned from raining videos which describes he probabiliy of moving from one manifold o anoher a any ime insance. 3.2 Learning Manifolds and Dynamics For each person k, we collec a leas one video sequence conaining l consecuive images S k = {I 1,, I l }. We furher assume ha each raining image is a fair sample drawn from he appearance manifold M k. There are hree seps in he algorihm. We firs pariion hese samples ino m disjoin subses {S 1,, S m }. For each collecion S ki, we can consider i as conaining poins drawn from some pose manifold C ki of M k, and from he images in S ki, we consruc a linear approximaion o he C ki of he rue manifold M k. Afer all he C ki have been compued, we esimae he ransiion probabiliies p(c ki C kj ) for i j. In he firs sep, we apply a K-means clusering algorihm o he se of images in he video sequences. We iniialize m seeds by finding m frames from he raining videos wih he larges L 2 disance o each oher. Then he general K-means algorihm is used o assign images o he m clusers. As our goal in performing clusering is o approximae he daa se raher han o derive semanically meaningful cluser ceners, i is worh noing ha he resuling clusers are no worse han wice wha he opimal cener would be if hey could be easily found [10]. Second, for each S ki we obain a linear approximaion of he underlying subse C ki M k by compuing a PCA plane L ki of fixed dimension for he images in S ki. Since he PCA planes approximae appearance manifold M i, heir dimension is he inrinsic dimension of M, and herefore all PCA planes L i have he same dimension. Finally, he ransiion probabiliy p(c ki C kj ) is defined by couning he acual ransiions beween differen S i observed in he image sequence: p(c ki C kj ) = 1 Λ ki l δ(i q 1 S ki )δ(i q S kj ) (8) q=2 where δ(i q S kj ) = 1 if I q S kj and oherwise i is 0. The normalizing consan Λ ki ensures ha p(c ki C kj ) = 1. (9) j=1 where we se p(c ki C ki ) o a consan κ. A graphic represenaion of a ransiion marix wih m = 5 learned from a raining video is depiced in Figure 4. Wih C ki and is linear approximaion L ki defined, we can define how p(i C ki ) can be calculaed. We can compue he L 2 disances ˆd ki = d H (I, L ki ) from I o each L ki. We rea ˆd ki as an esimae of he rue disance from I o C ki, i.e., d H (I, C ki ) = d H (I, L ki ). p(i C ki ) is defined as p(i C ki ) = 1 Λ ki exp( 1 2 σ 2 ˆd 2 ki) (10) wih Λ ki = m 1 i=1 exp( ˆd 2 2 σ 2 ki ). Noice ha we use a non-compac subspace L ki o approximae a compac pose manifold C ki. The infinie exen of L ki migh be beer capured by he underlying Gaussian, and similar work has been done by Moghaddam e al.[18]. However, our experimen shows ha he recogniion resul using his more elaborae algorihm is no beer han he one proposed in he paper. This can be explained by he fac ha alhough he linear subspaces are non-compac, he es images will almos always be drawn from a compac subse of he image space. This effec makes he subspaces funcionally compac in our algorihm. In oher words, he subspaces behave as hey only have finie exen. 3.3 Face Recogniion from Video Given an image I from a video sequence, we compue for each person k he disance d H (I, M k ) using he Equaion 6. Noe ha p(c ki I) has a emporal dependency, and i is compued recursively using Equaion 7. Once all he d H (I, M k ) have been compued, he poserior p(k I) is compued by Equaion 2 wih appropriae σ, and he human ideniy is decided by Equaion 5. 4

Pose 1 2 3 4 5 1 2 3 4 5 Figure 4: Graphic represenaion of a ransiion marix learned from a raining video. In his example, he appearance manifold is approximaed by 5 pose subspaces. The reconsruced cener image of each pose subspace is shown a he op raw and column. The ransiion probabiliy marix is drawn by he 5 5 block diagram. The brigher block means a higher ransiion probabiliy. I is easy o see ha he fronal pose (pose 1) has higher probabiliy o change o oher poses; he righ pose (pose 2) has almos zero probabiliy o direcly change o he lef pose (pose 3). when compuing d H (M k, I ). We inroduce an image mask W, which defines he probabiliy ha a pixel is occluded, where W has he same dimension as image I, and is elemens are iniialized wih a 1, i.e., assuming here is no occlusion a he firs frame and no pixel is downweighed. The d H (M k, I ) is hen replaced by he weighed disance d H (M k, W. I ) where. denoe elemen-by-elemen muliplicaion. Le he weighed projecion of W. I on M k be x, he mask W is updaed in each frame I by he esimae a a previous frame W 1 by W (1) = exp( 1 2 σ 2 (Îx I ). (Îx I )) (11) in he firs ieraion. Alernaively, W can be ieraively updaed based on he W (1) and Î(1) x (i.e., he reconsruced image based on W (1) and d H (M k, W (1). I )) W (i+1) = exp( 1 2 σ 2 (Î(i) x I ). (Î(i) x I )) (12) unil he difference beween W (i) and W (i 1) hreshold value a he i-h ieraion. is below a I is also worh menioning ha he proposed framework explois he emporal coherence in he appearance of consecuive face images by inegraing he manifold ransiion a he previous and curren ime insance. For face recogniion wih varying pose, our mehod ensures ha he ransiions beween pose manifolds do no occur arbirarily bu raher in a consrained order. For example he appearance of one person s face canno change immediaely from lef profile o righ profile in wo consecuive frames, bu raher i mus pass hrough some inermediae pose or orienaion (See Figure 6). This process can also be considered as puing a firs order Markov process or finie sae machine over a piecewise linear srucure. In conras, simple emporal voing scheme has been commonly adoped in mos video-based face recogniion mehods [16] [26]. 3.4 Recognizing Parially Occluded Faces Similar o our formulaion exploiing emporal informaion for recogniion, he same approach can be easily exended o deal wih parial occlusion of a face by considering he previous frame as prior informaion. The original formulaion for d H (C ki, I ) reas every pixel in image I wih equal weigh assuming ha here is no occlusion anywhere in he image sequence. If we knew which pixels corresponded o occlusions, we would pu lower weighs on hose pixels Figure 5: Top row: (lef) an unoccluded face image, (cener) a reconsruced image using corresponding pose manifold, and (righ) a corresponding mask). Boom row: (lef) a face image parially occluded by one hand, (cener) a reconsruced image using corresponding pose manifold, and (righ) an updaed mask. Boh he appearance manifold and mask informaion a previous frames are uilized o esimae he curren occlusion mask in he equaions above. We firs perform he weighed projecion o find a reconsruced image using he corresponding pose manifold and ieraively esimae he occlusion areas in he curren frame. Once we ge an updaed mask W in frame I by Equaion 11, we evaluae Equaion 6 for face recogniion by replacing d H (C ki, I ) wih d H (C ki, W. I ). Figure 5 shows an example where a face is parially occluded by an objec (lower lef). The reconsruced image using he corresponding pose manifold is shown in he lower cener. The updaed mask is shown in he lower righ where he values have been hresholded a dark pixel denoes a probabiliy of occlusion. Noe ha he updaed mask maches he occluded region reasonably well. Noe also ha 5

he mask predics ha several pixels are occluded hough in fac hey are no. This is caused by he disagreemen beween he inpu image and he reconsruced image. Neverheless, he regions ha maer mos for recogniion (i.e., he cenral face region and he occluded region) are weighed appropriaely. Our experimenal resuls, presened in he nex secion, also demonsrae ha he mask scheme is effecive in recognizing parially occluded faces. 4 Experimens and Resuls We evaluaed he proposed algorihm on wo ses of videos: one wihou any occlusion and one wih parial occlusion. The overall recogniion rae in he experimens is defined by he number frames where he ideniy is correcly recognized divided by he number of frames in all he es videos. 4.1 Number of Linear PCA Planes 94 92 90 Recogniion Rae (%) 88 86 84 82 80 5 10 15 20 25 30 Number of PCA Planes Figure 6: Sample gallery videos used in he experimens. Noe he pose variaion changed is raher large in his daa se. We performed numerous experimens and compared he proposed algorihm wih oher mehods in he conex of video-based recogniion. Since here is no sandard daabase ha conains large 2-D and 3-D head roaion for video-based face recogniion, we colleced a se of 45 videos of 20 differen people for experimens (This daa se will be made available o he vision communiy in he near fuure.). Each individual in our daabase has a leas wo videos where each person moves in a differen combinaion of 2-D and 3-D roaion, expression, and speed. Each video was recorded in an indoor environmen and each one lased for a leas 20 seconds (wih 30 color frames of 640 480 pixels per second). Some cropped frames from he videos are shown in Figure 6. A varian of he eigen-subspace racker [2] was used o locae he face, and he resuls were inspeced by humans. Each image was hen downsampled o 19 19 pixels for compuaional efficiency. To reduce he effec of misalignmen caused by he racker, we added small 2-D perurbaions including ranslaion (wihin 2 pixels in all direcions), and scaling (wihin a scale from 0.9 o 1.1), o enlarge he raining ses before applying he proposed probabilisic algorihm. Figure 7: Recogniion rae vs. number of piecewise linear PCA planes of our mehod. I shows ha he proposed mehod is raher robus o parameer selecion (i.e., he number of pose manifolds used in approximaing appearance manifold.) We firs evaluae he proposed algorihm in he es se wihou occlusion, and analyze he number of PCA planes required o consruc appearance manifolds yielding good recogniion resuls. Figure 7 demonsrae ha he average recogniion rae does no change much when he number of PCA planes is varied from 5 o 30. The resuls sugges ha he appearance manifold can be effecively approximaed wih a small number of PCA planes. The proposed algorihm performs well over a reasonably large range which shows ha one can easily pick an appropriae number of PCA planes. Obviously, a smaller number of PCA planes is preferable for compuaional efficiency reasons. However, he recogniion rae drops significanly and quickly when he number of manifolds is raher small (fewer han five for his daa se). This is consisen wih he claim ha he appearance manifold is nonlinear and complex. 4.2 Transiion Marix P (C ki C kj ) In his se of experimens, we demonsrae ha he ransiion marix, P (C ki C kj ), in he proposed mehod capure he image dynamics sufficienly o improve recogni- 6

COMPARISON OF TEMPORAL STRATEGIES Temporal Sraegy Accuracy (%) Proposed Mehod 92.1 Temporal Voing 84.2 Uniform Trans. 85.0 Table 1: Recogniion resuls using various emporal sraegies on a es se of videos wihou occlusion. ion raes. Using he se of videos wihou occlusion, we compared our mehod wih wo differen sraegies, emporal voing and a uniform ransiion probabiliy scheme. All hree mehods used he same number of manifolds for each person m = 5; hey differ in heir way of uilizing emporal informaion. The emporal voing scheme, commonly used in recogniion mehods is based on muliple frames, makes an ideniy decision by aking voes of he resuls of he previous f frames. In his case, 20 frames were used. The uniform ransiion scheme simply ses all he enries of ransiion marix o 1, which means ha no emporal dynamics are learned or uilized in he recogniion process. The experimenal resuls, shown in Table 1, demonsrae ha our mehod ouperforms ohers by a significan margin. In oher words, learning ransiion probabiliies among he pose manifolds does faciliae recogniion which canno be achieved by mehod using no dynamics informaion or a simple emporal voing scheme wih a large window size. 4.3 Comparison wih Single Frame Algorihms and he Effec of Occlusion COMPARISON OF RECOGNITION METHODS Mehod Accuracy (%) Videos w/o Videos wih occlusion occlusion Proposed Mehod 93.2 93.0 Ensemble of LPCA 82.2 20.9 Eigenface 75.5 28.4 Fisherface 75.4 20.5 Table 2: Recogniion resuls using differen mehods. The resuls are based on he average recogniion raes achieved by each mehod. For compleeness, we compared our mehod wih several frame-based face recogniion algorihms in he lieraure, and he resuls are shown in Table 2. All mehods were rained wih he exac same cropped images. We consruced 30 PCA planes and learn heir dynamics from he raining videos in he proposed algorihm. For he Ensemble of LPCA mehod, we used he same 30 PCA planes consruced in he proposed mehod bu did no use he learned ransiion marix. This mehod is, in spiri, similar o he view-based Eigenface mehod [22]. The dimensionaliy of Fisherface mehod is se o 19 (i.e., he number of classes minus 1) and he dimensionaliy for oher mehods is empirically se o 30. Though i may no seem o be fair o compare video-based and frame-based recogniion algorihms, hese baseline experimens sugges ha frame-based mehods may no work well in an unconsrained environmen where here are large pose changes. For he es videos wihou occlusion, he Ensemble of LPCA mehod performs beer han classic linear models (Eigenface and Fisherface mehods) because an image sequence usually conain 2-D and 3-D roaions, which can no be effecively approximaed by a global linear model. These resuls also show ha he use of image dynamics by our mehod grealy helps face recogniion in video. Excep for he proposed mehod, all oher mehods performed poorly on he es videos where some faces were parially occluded. This resul shows ha appearance coherence beween consecuive frames helps in predicing occlusions and in urn faciliaes he recogniion process. 5 Conclusion and Fuure Work We have presened a novel framework for video-based face recogniion. The proposed mehod builds an appearance manifold which is approximaed by piecewise linear subspaces and he dynamics among hem embodied in a ransiion marix learned from an image sequence. I is worh noicing ha he image sequences considered in his paper conains large 2-D and 3-D roaions as well as parial occlusions. These siuaions migh occur in many visionbased human-compuer ineracion or surveillance applicaions. As experimenally demonsraed, our mehod approximaes nonlinear appearance manifold well and achieves good recogniion raes in video-based face recogniion. Though he proposed model handles large moions well, i is neverheless sensiive o large illuminaion changes, and our fuure work will address his. Acknowledgmens Suppor of his work was provided by Honda Research Insiue, and he Naional Science Foundaion CCR 00-86094 and IIS 00-85980. This work was carried ou a Honda Research Insiue. We would like o hank he anonymous reviewers for heir commens and suggesions, and all he people who help o record heir faces in our video daabase. 7

References [1] C. M. Bishop and J. M. Winn. Non-linear Bayesian image modelling. In Proc. European Conf. on Compuer Vision, volume 1, pages 3 17, 2000. [2] M. J. Black and A. D. Jepson. Eigenracking: Robus maching and racking of ariculaed objecs using a view-based represenaion. In l. J. Compuer Vision, 26(1):63 84, 1998. [3] C. Bregler and S. Omohundro. Surface learning wih applicaions o lipreading. In Advances in Neural Informaion Processing Sysems, pages 43 50, 1994. [4] R. Chellappa, C. L. Wilson, and S. Sirohey. Human and machine recogniion of faces: A survey. Proceedings of he IEEE, 83(5):705 740, 1995. [5] T. Cooes, C. J. Taylor, D. Cooper, and J. Graham. Acive shape models - Their raining and applicaion. Compuer Vision and Image Undersanding, 61:38 59, 1995. [6] D. DeCarlo, D. Meaxas, and M. Sone. An anhropomeric face model using variaional echniques. In Proc. SIG- GRAPH, pages 67 74, 1998. [7] G. J. Edwards, C. J. Taylor, and T. F. Cooes. Inerpreing face images using acive appearance models. In Proc. IEEE In l. Conf. on Auomaic Face and Gesure Recogniion, pages 300 305, 1998. [8] G. J. Edwards, C. J. Taylor, and T. F. Cooes. Improving idenificaion performance by inegraing evidence from sequence. In Proc. IEEE Conf. on Compuer Vision and Paern Recogniion, pages 486 491, 1999. [9] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few o many: Illuminaion cone models for face recogniion under variable lighing and pose. IEEE Trans. Paern Analysis and Machine Inelligence, 23(6):643 660, 2001. [10] D. Hochbaum and D. Shmoys. A bes possible heurisic for he k-cener problem. Mahemaics of Operaions Research, 10:180 184, 1985. [11] X. Hou, S. Li, H. Zhang, and Q. Cheng. Direc appearance models. In Proc. IEEE Conf. on Compuer Vision and Paern Recogniion, volume 1, pages 828 833, 2001. [12] A. J. Howell and H. Buxon. Towards unconsrained face recogniion from image sequences. In Proc. IEEE In l. Conf. on Auomaic Face and Gesure Recogniion, pages 224 229, 1996. [13] M. Isard and A. Blake. A mixed-sae Condensaion racker wih auomaic model-swiching. pages 107 112, 1998. [14] T. Jebara, K. Russell, and A. Penland. Mixures of eigen feaures for real-ime srucure from exure. In Proc. In l. Conf. on Compuer Vision, pages 128 135, 1998. [15] V. Krüeger and S. Zhou. Exemplar-based face recogniion from video. In Proc. European Conf. on Compuer Vision, volume 4, pages 732 746. [16] Y. Li, S. Gong, and H. Liddell. Consrucing facial ideniy surface in a nonlinear discriminaing space. In Proc. IEEE Conf. on Compuer Vision and Paern Recogniion, volume 2, pages 258 263, 2001. [17] Y. Li, T. Wang, and H.-Y. Shum. Moion exures: A wolevel saisical model for characer moion synhesis. In Proc. SIGGRAPH, pages 465 472, 2002. [18] B. Moghaddam and A. Penland. Probabilisic visual learning for objec recogniion. IEEE Trans. Paern Analysis and Machine Inelligence, 19(7):696 710, 1997. [19] H. Murase and S. K. Nayar. Visual learning and recogniion of 3-D objecs from appearance. In l. J. Compuer Vision, 14:5 24, 1995. [20] B. Norh, A. Blake, M. Isard, and J. Rischer. Learning and classificaion of complex dynamics. IEEE Trans. Paern Analysis and Machine Inelligence, 22(9):1016 1034, 2000. [21] V. Pavlović, J. M. Rehg, T. J. Cham, and K. P. Murphy. A dynamic Bayesian nework approach o figure racking using learned dynamic models. In Proc. In l. Conf. on Compuer Vision, pages 94 101, 1999. [22] A. Penland, B. Moghaddam, and T. Sarner. View-based and modular eigenspaces for face recogniion. In Proc. IEEE Conf. on Compuer Vision and Paern Recogniion, 1994. [23] S. Romdhani, V. Blanz, and T. Veer. Face idenificaion by fiing 3D morphable model using linear shape and exure error funcions. pages 3 19, 2002. [24] A. Samal and P. A. Iyengar. Auomaic recogniion and analysis of human faces and facial expressions: A survey. Paern Recogniion, 25(1):65 77, 1992. [25] A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa. Video exures. In Proc. SIGGRAPH, pages 489 498, 2000. [26] G. Shakhnarovich, J. W. Fisher, and T. Darrell. Face recogniion from long-erm observaions. In Proc. European Conf. on Compuer Vision, volume 3, pages 851 865, 2002. [27] K. Toyama and A. Blake. Probabilisic racking in a meric space. In Proc. In l. Conf. on Compuer Vision, volume 2, pages 50 59, 2001. [28] H. Wechsler, V. Kakkad, J. Huang, S. Gua, and V. Chen. Auomaic video-based person auhenicaion using he RBF nework. In Proc. In l. Conf. on Audio and Video-Based Biomeric Person Auhenicaion, pages 177 183, 1997. [29] W. Y. Zhao and R. Chellappa. Symmeric shape-fromshading using self-raio image. In l. J. Compuer Vision, 45(1):55 75, 2001. [30] W. Y. Zhao, R. Chellappa, A. Rosenfeld, and J. P. Phillips. Face recogniion: A lieraure survey. Technical Repor CAR-TR-948, Cener for Auomaion Research, Universiy of Maryland, 2000. [31] S. Zhou and R. Chellappa. Probabilisic human recogniion from video. In Proc. European Conf. on Compuer Vision, volume 3, pages 681 697, 2002. 8