Learning Non-Linear Reconstruction Models for Image Set Classification

Size: px

Start display at page:

Download "Learning Non-Linear Reconstruction Models for Image Set Classification"

Bryan Carr
6 years ago
Views:

1 Testing Training Learning Non-Linear Reonstrution Moels for Image Set Classifiation Munawar Hayat, Mohamme Bennamoun, Senian An Shool of Computer Siene an Software Enginnering The University of Western Australia {mohamme.bennamoun, Abstrat Training image sets Class Speifi Moels We propose a eep learning framework for image set lassifiation with appliation to fae reognition. An Aaptive Deep Network Template (ADNT is efine whose parameters are initialize by performing unsupervise pretraining in a layer-wise fashion using Gaussian Restrite Boltzmann Mahines (GRBMs. The pre-initialize ADNT is then separately traine for images of eah lass an lass-speifi moels are learnt. Base on the minimum reonstrution error from the learnt lass-speifi moels, a maority voting strategy is use for lassifiation. The propose framework is extensively evaluate for the task of image set lassifiation base fae reognition on Hona/UCSD, CMU Mobo, YouTube Celebrities an a Kinet ataset. Our experimental results an omparisons with existing state-of-the-art methos show that the propose metho onsistently ahieves the best performane on all these atasets.. Introution Fae reognition has traitionally been onsiere as a single image lassifiation problem. With the reent avanes in imaging tehnology, multiple images of a person are beoming reaily available in numerous senarios suh as vieo base surveillane, multi-view amera networks, personal albums an images of a person aquire over a long perio of time. Fae reognition from these multiple images is formulate as an image set lassifiation problem an has gaine a signifiant attention from the researh ommunity in reent years [7, 7, 5, 4, 8, 4, 6]. Compare with single image base lassifiation methos, fae reognition from image sets offers more promises as it an effetively hanle a wie range of variations that are ommonly present in the faial images of a person. These variations inlue hanging illumination onitions, view point variations, expression eformations, olusions an isguise. Faial images of a person uner ifferent variations are ommonly moele on a non-linear mani- Image Set Test image set Image Set k Reonstrution For eah image Voting Θ( Θ(k Output: Class Label ytest Figure. The blok iagram of the propose metho. During training, lass-speifi moels are learne from images of eah person. These moels are then use by a reonstrution error base voting strategy to eie about the lass of a test image set. fol geometry suh as Grassmannian manifol [7, 5, 8] or Lie Group of Riemannian manifol [6]. This moeling of images on manifols requires prior assumptions relate to the speifi ategory of the manifol on whih fae images are believe to lie. In ontrast, this paper introues a eep learning base framework whih makes no prior assumption regaring the unerlying geometry of fae images an an automatially learn an isover the struture of the omplex non-linear surfae on whih fae images of a person (uner ifferent variations are present. The propose framework first efines an Aaptive Deep Network Template (ADNT whose weights are initialize by unsupervise layer-wise pre-training using Gaussian Restrite Boltzmann Mahines (GRBMs. The pre-initialize ADNT is then separately traine for images of eah lass to learn lass-speifi moels. The training is performe in a way that the ADNT learns to reonstrut images of that lass. A lass-speifi moel is therefore mae to learn the struture an the geometry of omplex non-linear surfae on whih fae images of that lass are present. For lassifiation, a reonstrution error an maority voting base strategy is evise. The propose framework is evaluate for vieo base fae reognition on Hona/UCSD [8], CMU Mobo [6] an YouTube Celebrities atasets [6] as well as a Kinet ataset [9, 5] an ahieves state of the art performane.

2 . Relate Work Image set lassifiation generally involves two maor steps:. to fin a representation of the images in the set, an. to efine suitable istane metris for the omputation of the similarity between these representations. Base on the use type of representation, existing image set lassifiation methos an be ategorize into parametri-moel an non-parametri-moel methos. The parametri-moel methos [] approximate an image set in terms of the parameters of a ertain statistial istribution moel an then measure the similarity between two image sets (two istribution parameters using e.g. KL-ivergene. These methos fail to proue a esirable performane if there is no strong statistial relationship between the test an the training image sets. The other type of image set representation methos i.e. non-parametri methos o not make any assumption about the statistial istribution of the ata. These methos have shown promising results an are being atively evelope reently. The non-parametri moel base methos represent an image set either by its representative exemplars or on a geometri surfae. Base upon the type of representation, ifferent istane metris have been evelope to etermine the between-set istane. For example, for the image sets represente in terms of representative exemplars, the set-set istane an be efine as the Euliean istane between the set representatives. These an simply be the set mean [7] or aaptively learnt set samples [4]. Cevikalp et al. [4] learn the set samples from the affine hull or onvex hull moels of the set images. The set to set istane is then terme as Affine Hull Image Set Distane (AHISD or Convex Hull Image Set Distane (CHISD. Hu et al. [4] efine set-set istane as the istane between their Sparse Approximate Nearest Points (SANPs. SANPs of two sets are etermine from the mean image an the affine hull moel of the orresponing set an are sparse approximate from the set images while simultaneously searhing for the losest points in the respetive sets. As set representative base methos require the omputation of a one-to-one set istane, these methos are apable of hanling intra set variations very effetively. However, their performane is highly prone to outliers. They are also omputationally very expensive as a one-to-one math of the query set with all sets in the galley is require. These methos oul therefore be very slow in the ase of a large gallery size. Unlike set representative base methos, the seon ategory of non-parametri methos moel a omplete image set as a point on a geometri surfae [7, 5, 8, 6, 0, 3]. The image set an be represente either by a subspae, mixture of subspaes or on a omplex non-linear manifol. Prinipal angles have been very ommonly use to etermine the istane between image sets represente by a linear subspae. The prinipal angles 0 θ θ π between two subspaes are efine as the smallest angles between any vetor in one subspae an any other vetor in the seon subspae. The similarity between sets is then efine as the sum of the osines of the prinipal angles. For image set representations on manifols, appropriate istane metris have been aopte suh as the geoesi istane [] an the proetion kernel metri [7] on the Grassmann manifol, an the log-map istane metri [9] on the Lie group of Riemannian manifol. In orer to isriminate image sets on the manifol surfae, ifferent learning strategies have been evelope. Mostly, Linear Disriminant Analysis (LDA is ontrive for ifferent set representations. Examples inlue Disriminative Canonial Correlations (DCC [7], Manifol Disriminant Analysis (MDA [5], Graph Embeing Disriminant Analysis (GEDA [8] an Covariane Disriminative Learning (CDL [6]. The methos whih moel an image set on a geometri surfae make prior assumption about the unerlying surfae on whih the fae ata lies. For example, [7] assumes that fae images lie on a linear surfae an represents the image set as a linear subspae. Methos inluing MMD, MDA an GEDA represent an image set on a non-linear Grassmannian manifol, whereas, CDL [6] represents an image set in terms of the ovariane matrix of pixel values on Lie Group of Riemannian manifol. For our propose metho, we o not make any prior assumptions about the struture of the surfae on whih the faial images of a person lie. We instea efine a eep learning base framework whih inorporates non-linear ativation funtions to automatially learn the unerlying manifol struture. Deep learning has reently gaine signifiant researh attention in a number of areas [, 3, 5]. Ours is the first metho whih inorporates eep learning for image set lassifiation. The etaile esription about our metho is presente next. 3. Propose Tehnique We first efine an Aaptive Deep Network Template (ADNT whih will be use to learn the unerlying struture of the ata. The arhiteture of our ADNT is summarize in Fig an the etails are presente in Se 3.. For suh a eep network to perform well, an appropriate initialization of the weights is require. We initialize the weights of the ADNT by performing pre-training in a greey layer wise fashion using Gaussian Restrite Boltzmann Mahines (etails in Se 3.. The ADNT with preinitialize weights is then separately fine-tune for eah of the k lasses of the training image sets. We therefore en up with a total of k fine-tune eep network moels, eah orresponing to one of the k lasses. The fine-tune moels are then use for image set lassifiation (etails in Se 3.3

W e ( 04 W e ( Enoer 400 W e (3 00 W ( Deoer 400 W ( 04 W (3 Figure. Struture of the Aaptive Deep Network Template (ADNT. The parameters of the template are initialize by unsupervise pre-training.

3 W e ( 04 W e ( Enoer 400 W e (3 00 W ( Deoer 400 W ( 04 W (3 Figure. Struture of the Aaptive Deep Network Template (ADNT. The parameters of the template are initialize by unsupervise pre-training. The initialize template is then use to learn lass speifi moels 3.. The Aaptive Deep Network Template (ADNT As epite in Fig, our ADNT is an Auto-Enoer (AE, onsisting of two parts: an enoer an a eoer. Both the enoer an the eoer have three hien layers eah, with a share thir layer (the entral hien layer. The enoer part of the AE fins a ompat low imensional meaningful representation of the input ata. We an formulate the enoer as a ombination of non-linear funtions s(. use to map the input ata x to a representation h given by, h = s(w e (3 h + b (3 e h = s(w e ( h + b ( e h = s(w e ( x + b ( e Where W e (i R i i is the enoer weight matrix for layer i having i noes, b (i e R i is the bias vetor an s(. is the non-linear ativation funtion (typially a sigmoi or tangent hyperboli. The enoer parameters are learnt by ombining the enoer with the eoer an ointly training the enoer-eoer struture to reonstrut the input ata by minimization of a ost funtion. The eoer an therefore be efine as a ombination of nonlinear funtions whih reonstrut the input x from the enoer output h. The reonstrute output x of the eoer is given by, x = s(w (3 x + b (3 x = s(w ( x + b ( x = s(w ( h + b( ( ( We an represent the omplete enoer-eoer struture (the ADNT by its parameters θ ADNT = {θ W, θ b }, { } 3 { } 3. where θ W = W e (i, W (i an θb = b e (i, b (i Later (in Se. 3.3 we will use this template an separately train it for all lasses of the training image sets to learn lass speifi moels. 3.. ADNT s Parameter Initialization The above efine ADNT is use to learn lass speifi moels. This is aomplishe by separate training of the ADNT for images of eah lass of the training image sets. The training is performe with stohasti graient esent through bak propagation. The training fails if the ADNT is initialize with inappropriate weights. More speifially, if the initialize weights are too large, the network gets stuk in loal minima. On the other han, if the initialize weights are too small, the vanishing graient problem is enountere uring bak propagation in the initial layers an the network beomes infeasible to train. The weights of the template are therefore initialize by performing unsupervise pre-training []. For that, a greey layer-wise approah is aopte an Gaussian RBMs are use. Below, we first present a brief overview of binary an Gaussian RBMs an then explain their use for our ADNT s parameter initialization. An RBM is a generative unirete graphial moel with a bipartite struture of two sets of binary stohasti noes terme as the visible ({v i } Nv, v i {0, } an the hien layer noes ({h } N h, h {0, }. The noes of the visible layer are symmetrially onnete with the noes of the hien layer through a weight matrix W R Nv N h but there are no intra layer noe onnetions. The oint probability p(v, h of the RBM struture is given by, p(v, h = exp( E(v, h (3 Z Z is the partition funtion (use as a normalization onstant an E(v, h is the energy funtion of the moel efine as: E(v, h = i b i v i h i w i v i h (4 Where b an are the biases of the visible an hien layer noes respetively. The training of an RBM for learning its moel parameters {W, b, } is performe by Contrastive Divergene (CD, a numerial metho propose by Hinton et al. [, 3] for effiient approximation to graient omputation an RBM parameter learning. The stanar RBM evelope for binary stohasti ata an be generalize to the real value ata by appropriate moifiations in its energy funtion. Guassian RBM

4 (GRBM is one suh popular extension whose energy funtion is efine by moifying the bias term of the visible units as: E GRBM (v, h = (v i b i σ h v i w i h i i σ i i (5 σ i is the stanar eviation of the real value Gaussian istribute inputs to the visible noe v i. It is possible to learn σ i for eah visible unit but this beomes iffiult when using CD for GRBM parameter learning. We instea aopt an alternative approah an fix σ i to a unit value in the ata pre-proessing stage. Due to the restrition that there are no intra-layer noe onnetions, inferene beomes reaily tratable for the RBM as oppose to most irete graphial moels. The probability istributions for GRBM are given by, where p(h = v = sigmoi ( i w iv i + p(v i h = σ i π exp( (vi ui σi (6 u i = b i + σi w i h (7 Sine our ata is real value, we use GRBMs to initialize the weights of our ADNT. Two layers are onsiere at a time an the GRBM parameters are learnt. Initially, the noes of the input layer are onsiere to be visible units v an the noes of the first hien layer as the hien units h of the first GRBM an its parameters are learnt. The ativations of the first GRBM s hien units are then use as an input to train the seon GRBM. The proess is repeate for all three hien layers of the enoer part of the ADNT struture. The weights learnt for the enoer layers are then tie to the orresponing eoer layers i.e. W (3 = W e ( T (, W = W e ( T (, W = W e (3 T (See Fig. for notations 3.3. Image Set Classifiation Algorithm We are now reay to esribe our reonstrution error base image set lassifiation algorithm. The omplete algorithm is summarize in Alg. The etails are presente below. Problem Formulation: Given k training image sets {X } k an their orresponing lass labels y [,, k], where the image set X = { x (t} has N N images x (t R x y belonging to lass, the problem of image set lassifiation an be formulate as follows: given a test image set X test = { x (t} N test, fin the lass y test to whih X test belongs to? Unsupervise Pre-Training: We first efine our ADNT an initialize its weights by performing unsupervise pretraining. Our ADNT is a multi-layer neural network with noes. In orer to initialize the weights of the ADNT by GRBMs, we generate an unsupervise training ata set. Fae images from all training image sets are gathere into a ata set X u = { x (t X ; [,, k] }. The images in the resulting ata set X u are ranomly shuffle an use for layerwise GRBM training of all layers of the enoer part of the template ( The weights of the eoer layers ( are then initialize with their orresponing tie weights of the enoer layers. Using pretraining for weights initialization has several avantages over ranom initialization. Sine the ADNT is pre-traine for fae images, the initialize weights are very lose to the atual weights []. Therefore, it is highly unlikely that the network gets stuk in a loal optima. Moreover, with properly initialize weights, the graient omputation beomes feasible resulting in the onvergene of the weights to optimal values. Learning Class Speifi Moels: Now that we have the ADNT struture with pre-initialize weights, we separately fine tune its parameters θ ADNT = {θ W, θ b } for eah of the k training image sets. We therefore learn k lass-speifi moels. The learning of a lass-speifi moel θ( is arrie out by performing stohasti graient esent through bak propagation for the minimization of the reonstrution error, over all examples x (t of a training image set X, J (θ ADNT ; x (t X = x (t x (t x (t Sine the moel is being traine to reonstrut the input ata, it might en-up learning an ientity funtion an reproue the input ata. Appropriate settings in the onfigurations of the ADNT are therefore require to ensure that a lass speifi moel learns the unerlying struture of the ata an proues useful representations. For our ADNT, sine the number of noes in the first hien layer are larger than the imensions of the input ata, we first learn an overomplete representation of the ata by mapping it to a high imensional spae. This high imensional representation is then followe by a bottlenek i.e. the ata is mappe bak to a ompat, abstrat an low imensional representation in the subsequent layers of the enoer. With suh mapping, the reunant information in the ata is isare an only the require useful ontent of the ata is retaine. In orer to avoi over-fitting an improve generalization of the learnt moel to unknown test ata, we introue regularization terms into the ost funtion of ADNT. A weight eay penalty term J w an a sparsity onstraint J sp are (8

5 ae an the moifie ost funtion beomes, J reg ( θ ADNT; x (t X = x (t x (t x (t +λwj w +λ spj sp λ w an λ sp are regularization parameters. J w ensures small values of weights for all hien units. It is efine as the summation of the Frobenius norm of all weight matries, J w = 3 i W e (i + F 3 i W (i F (9 (0 J sp enfores that the mean ativation ρ (i (over all training examples of the th unit of the ith hien layer is as lose as possible to a sparsity target ρ (typially a small value, set to 0 3 in our experiments in Se. 4. It is efine in terms of the KL ivergene as, J sp = = 5 ( KL ρ ρ (i i 5 ρ log i ρ ρ (i + ( ρ log ρ ρ (i ( A lass-speifi moel θ( is ahieve by training the regularize ADNT over all images of the set X, ( θ( = min J reg θ ADNT ; x (t X ( θ ADNT A lass-speifi moel θ( is therefore mae to learn the unerlying struture of the manifol on whih fae images of that lass lie. Sine the ativation funtions use are non-linear an a number of layers are stake together, the AE struture is apable of learning very omplex non-linear manifol strutures. Classifiation: { Given a test image set X test = x (, x (, x (Ntest}, we separately reonstrut (using Eqs. & eah of its image x (t X test from all lass speifi moels θ(, = k. If x (t ( is the reonstrution of the image x (t from moel θ( (the moel finetune with images of X, then the reonstrution error is given by, r (t ( = x (t x (t ( (3 After omputing the reonstrution errors for all k moels, the eision about the lass y (t of the image x (t is mae base upon the riteria of minimum reonstrution error, y (t = arg min r (t ( (4 Here the iea is that the unseen image x (t will be reonstrute with the least error only from the moel traine from images with the same label. Following this proeure, the lass labels of all N test images of the test set are ompute. The label y test of the test image set X test is then efine as the most reurring label amongst all images of X test. This is given by, y test = arg max t δ (y (t,where { δ (y (t, y (t (5 = = 0, otherwise Algorithm Propose Image Set Classifiation Metho Input: Training ata: k Image Sets {X } k s.t. X = { x (t} N labels: y [,, k] Testing Data: Image Set X test = { x (t} N test Output: Label y test of X test Training Define ADNT struture { } Unsupervise ata: X u x (t X ; [,, k] Train GRBMs using X u to initialize θ ADNT = {θ W, θ b } for = k o ( θ( min J reg θ ADNT; x (t X θ ADNT en for Testing for eah image x (t X test o for θ( = θ( θ(k o h (t s(w e (3 s(w e ( s(w e ( x (t + b ( e + b ( e + b (3 e x (t ( s(w (3 r (t ( x (t x (t ( en for s(w( s(w( h(t + b ( + b( + b(3 Assign label to image x (t : y (t arg min en for Label of X test: y test arg max 4. Experiments r (t ( t δ(y(t. See Eq. 5 The performane of our propose metho is evaluate on four ata sets for the task of image set lassifiation for fae reognition. These atasets inlue three gray sale fae vieo atasets: Hona/UCSD ataset [8], CMU Mobo ataset [6], YouTube Celebrities ataset [6]; an an RGB-D Kinet ataset obtaine by ombining three Kinet atasets. The etaile esription of eah of these atasets an their performane evaluation using our metho an state-of-the-art methos is presente in Se 4.. Here, we first esribe the pre-proessing steps an the ommon experimental onfigurations followe for all atasets.

6 4.. Experimental Settings The fae from eah frame in the vieos of Hona/UCSD an Mobo atasets is automatially etete using Viola an Jones fae etetion algorithm [4]. It was observe that fae etetion by [4] faile in a signifiant number of frames in the ase of YouTube Celebrities ataset ue to its poor image resolution an large hea rotations. We use [] to trak the fae region aross every vieo sequene given the loation of the fae winow in the first frame (provie with the ataset. In the ase of Kinet fae atasets, the ranom regression forrest base lassifier propose in [5] is use to automatially etet faes from epth images. As epth ata is pre-aligne with RGB, the same loation of the etete fae in the epth image is use for the orresponing RGB image. After a suessful etetion, the fae region is roppe an all olore images are onverte to gray sale levels. The roppe gray sale images are then resize to 0 0, an for Hona/UCSD, Mobo an YouTube elebrities atasets respetively. The epth an the gray sale images of the Kinet atasets are resize to 0 0. Histogram equalization is applie on all images to minimize illumination variations. No other pre-proessing suh as bakgroun removal or alignment is applie. Eah roppe an histogram equalize fae image is then ivie into 4 4 (5 5 in ase of CMU Mobo ataset, as in [4, 4] istint non-overlapping uniformly spae retangular bloks an R 59 histograms of LBP8, u [0] are ompute for every blok. Histograms from all bloks are onatenate into a single vetor whih is use as a fae feature vetor in all of our experiments. In ase of the Kinet ataset, the LBP feature vetors for gray sale an epth images are onatenate an the resulting feature vetor is use. Compare Methos We ompare our propose metho with a number of reently propose state of the art image set lassifiation methos. These inlue Disriminant Canonial Correlation Analysis (DCC [7], Manifolto-Manifol Distane (MMD [7], Manifol Disriminant Analysis (MDA [5], the Linear version of the Affine Hullbase Image Set Distane (AHISD [4], the Convex Hullbase Image Set Distane (CHISD [4], Sparse Approximate Nearest Points (SANP [4], Covariane Disriminant Learning (CDL [6] an Set to Set Distane Metri Learning (SSDML [9]. The implementations provie by the respetive authors are use for all methos exept CDL whih was arefully implemente by us. The parameters for all methos are optimize for best performane. Speifially, for DCC, we set the imensions of the embeing spae to 00. The number of retaine imensions for a subspae are set to 0 (90% energy is preserve an the orresponing 0 maximum anonial orrelations are use to ompute set-set similarity. The parameters for MMD an MDA are aopte from [7] an [5] respetively. No parameter settings are require for AHISD. For CHISD, the same error penalty term (C = 00 as in [4] is aopte. For SANP, same weight parameters as in [4] are aopte for onvex optimization. No parameter settings are require for CDL an SSDML. 4.. Results an Analysis Hona/UCSD Dataset: The Hona/UCSD ataset [8] ontains 59 vieo sequenes of 0 ifferent subets. The number of frames for eah vieo sequene varies from to 645. For our experiments, we onsier eah vieo as an image set. Similar to [8, 4, 7, 5], we use 0 vieo sequenes for training an the remaining 39 for testing. In orer to ahieve onsisteny in the results, we repeat our experiments ten times with ifferent ranom seletions of the training an testing sets. The ahieve performane in terms of average ientifiation rates an stanar eviations of our metho an the ompare methos is presente in Table. The results show that the propose metho ahieves perfet lassifiation on the Hona/UCSD ata set. CMU Mobo Dataset: The Mobo (Motion of Boy ataset [6] was originally reate for human boy pose ientifiation. The ataset ontains a total of 96 sequenes of 4 subets walking on a treamill. Similar to [4, 7, 4], we ranomly selet one sequene of a subet for training an the remaining three sequenes are use for testing. We repeat our experiments 0 times for ifferent ranom seletions of the training an the testing sets. The average ientifiation rates of our propose metho along with a omparison with other methos is provie in Table. The results suggest that the propose metho ahieves a very high performane of 97.96% an outperforms the other methos. YouTube Celebrities Dataset: YouTube Celebrities [6] is the largest an the most hallenging ataset use for image set lassifiation base fae reognition. The ataset ontains 90 vieos of 47 elebrities ollete from YouTube. The fae images of the ataset exhibit a large iversity an variations in the form of pose, illumination an expressions. Moreover, the quality an resolution of the images is very low ue to the high ompression rate. Sine the fae regions in the vieos are roppe by traking [], the low image quality introues many traking errors an the region of the roppe fae is not uniform aross frames of even the same vieo. We iretly use the fae region automatially extrate from traking an o not refine its ropping by enforing onstraints as in [6]. For performane evaluation, we use five fol ross valiation experimental settings as propose in [4, 5, 5].

Methos Hona/UCSD CMU Mobo YouTube Kinet DCC

95 9.5 ±.00 MMD CVPR 08 [7] 9.05 ±.5 9.50 ±.

5 MDA CVPR 09 [5] 94.36 ± 3.38 80.97 ±.8 55.

4 ± 5.95 9.73 ±.9 SANP CVPR [4] 95.3 ± 3.

an Kinet atasets for ifferent methos The

overlap into five fols with 9 image sets

rates an the stanar eviations of ifferent

for this ataset ompare with the Hona/UCSD

This is owing to the hallenging nature of

7 Methos Hona/UCSD CMU Mobo YouTube Kinet DCC CVPR 07 [7] 9.56 ± ± ± ±.00 MMD CVPR 08 [7] 9.05 ± ± ± ±.5 MDA CVPR 09 [5] ± ± ± ± 3.57 AHISD CVPR 0 [4] 9.8 ± ± ± ±.8 CHISD CVPR 0 [4] 93.6 ± ± ± ±.9 SANP CVPR [4] 95.3 ± ± ± ± 3. CDL CVPR [6] ± ± ± ± 0.96 SSDML ICCV 3 [9] 86.4 ± ± ± ± 3.39 Our Metho 00.0 ± ± ± ±.69 Table. Experimental Results on Hona, CMU, YouTube an Kinet atasets for ifferent methos The whole ataset is equally ivie (with minimum overlap into five fols with 9 image sets per subet in eah fol. Three of these image sets are ranomly selete for training, whereas the remaining six sets are use for testing. Table summarizes the average ientifiation rates an the stanar eviations of ifferent methos. It an be observe that the ahieve ientifiation rates for all methos are low for this ataset ompare with the Hona/UCSD an Mobo ataset. This is owing to the hallenging nature of the ataset. The vieos have been apture in real life senarios an they exhibit a wie range of appearane variations. The results suggest that our propose metho signifiantly outperforms the existing methos an ahieves a relative performane improvement of 9.0% over the seon best metho. Figure 3. Example images from gray sale atasets: Hona/UCSD (top, CMU/Mobo (enter an YouTube (bottom. Eah row orrespons to images of one ientity. Figure 4. Sample images from Kinet atasets: CurtinFaes (top, Biwi (enter an our ataset (bottom Kinet Dataset: We also evaluate the performane of our propose metho for RGB-D base fae reognition from Kinet ata. Fae reognition from Kinet ata is still in its infany an only a few work have aresse this problem [9]. The metho by Li et al. [9] first pre-proesses Kinet epth images to ahieve a anonial frontal view for faes with profile an non-frontal views. The sparse representation base lassifiation metho of [8] is use for reognition. One evaluate on CurtinFaes, the metho ahieves a lassifiation rate of 9.% for RGB, 88.7% for D an 96.7% for fusion of RGB-D ata. The propose metho is single frame base an oes not make use of the plentitue of ata whih an be instantly aquire from a Kinet sensor (30 frames per seon. Here we formulate fae reognition from Kinet ata as an RGB-D base image set lassifiation problem. Our formulation avois expensive pre-proessing steps (suh as hole filling, spike removal an anonial view estimation; otherwise require for single image base lassifiation an effetively makes use of the abunant an reaily available Kinet ata. The metho in [9] is evaluate on CurtinFaes (a Kinet RGB-D atabase of 5 subets. For our image set lassifiation experiments, we ombine three Kinet atasets: CurtinFaes [9], Biwi Kinet [5] an an in-house ataset aquire at our lab. The number of subets in eah of these atasets is 5 (5000 RGB-D images, 0 (5,000 RGB-D images an 48 (5000 RGB-D images respetively. Sample RGB images from these atasets are shown in Figure 4. Eah row orrespons to images of a person taken from CurtinFaes (top row, Biwi (mile row an our Kinet ataset (last row. These atasets are ombine into a single ataset of 0 subets. The images in the oint ataset have a large range of variations in the form of hanging illumination onitions, hea pose rotations, expression eformations, sunglass isguise, an olusions by han. For performane evaluation, RGB-D images of eah subet are ranomly ivie into five uniform fols. Consiering eah fol as an image set, we selet one set for training an the remaining sets for testing. All experiments are repeate five times for ifferent seletions of training an testing sets. The results average over five iterations are summarize in Table. The results show that the propose metho ahieves a very high performane. The results suggest that image set lassifiation proves to be a better hoie for Kinet base fae reognition. It avois omputationally expensive pre-proessing steps an the ahieve ientifiation rates with all image set lassifiation tehniques in Table are omparable or better than the single image base tehnique (96.7% of [9].

8 Aknowlegements This work is supporte by SIRF sholarship from The University of Western Australia (UWA an Australian Researh Counil (ARC grant DP0066. Thanks to Salman H. Khan for useful isussions. 5. Conlusion We propose a novel eep learning framework for image set lassifiation. An aaptive multi-layer auto-enoer struture has been introue whih is first pre-traine for appropriate parameter initialization an then use for learning lass speifi moels. A lass speifi moel automatially learns the unerlying non-linear omplex geometri surfae of the images of that lass. These learnt moels are then use for a minimum reonstrution error base lassifiation strategy uring testing. The propose framework was extensively evaluate on three benhmark gray sale atasets as well as an RGB-D Kinet ataset an state of the art performane has been ahieve. Referenes [] O. Aranelovi, G. Shakhnarovih, J. Fisher, R. Cipolla, an T. Darrell. Fae reognition with image sets using manifol ensity ivergene. In CVPR, volume, pages IEEE, 005. [] Y. Bengio, A. Courville, an P. Vinent. Representation learning: A review an new perspetives. TPAMI, 35(8:798 88, 03. [3] M. A. Carreira-Perpinan an G. E. Hinton. On ontrastive ivergene learning. In Artifiial Intelligene an Statistis, volume 005, page 7, 005. [4] H. Cevikalp an B. Triggs. Fae reognition base on image sets. In CVPR, pages IEEE, 00. [5] G. Fanelli, J. Gall, an L. Van Gool. Real time hea pose estimation with ranom regression forests. In CVPR, pages IEEE, 0. [6] R. Gross an J. Shi. The mu motion of boy (mobo atabase. Tehnial report, 00. [7] J. Hamm an D. Lee. Grassmann isriminant analysis: a unifying view on subspae-base learning. In ICML, pages ACM, 008. [8] M. Harani, C. Sanerson, S. Shirazi, an B. Lovell. Graph embeing isriminant analysis on grassmannian manifols for improve image set mathing. In CVPR, pages IEEE, 0. [9] M. T. Harani, C. Sanerson, A. Wiliem, an B. C. Lovell. Kernel analysis over riemannian manifols for visual reognition of ations, peestrians an textures. In WACV, pages IEEE, 0. [0] M. Hayat, M. Bennamoun, an A. A. El-Sallam. Clustering of vieo-pathes on grassmannian manifol for faial expression reognition from 3 vieos. In WACV, 03. [] G. Hinton, S. Osinero, M. Welling, an Y.-W. Teh. Unsupervise isovery of nonlinear struture using ontrastive bakpropagation. Cognitive siene, 30(4:75 73, 006. [] G. E. Hinton, S. Osinero, an Y.-W. Teh. A fast learning algorithm for eep belief nets. Neural omputation, 8(7:57 554, 006. [3] G. E. Hinton an R. R. Salakhutinov. Reuing the imensionality of ata with neural networks. Siene, 33(5786: , 006. [4] Y. Hu, A. S. Mian, an R. Owens. Sparse approximate nearest points for image set lassifiation. In CVPR, pages 8. IEEE, 0. [5] S. H. Khan, M. Bennamoun, F. Sohel, an R. Togneri. Automati feature learning for robust shaow etetion. In CVPR, 04. [6] M. Kim, S. Kumar, V. Pavlovi, an H. Rowley. Fae traking an reognition with visual onstraints in real-worl vieos. In CVPR, pages 8. IEEE, 008. [7] T. Kim, J. Kittler, an R. Cipolla. Disriminative learning an reognition of image set lasses using anonial orrelations. IEEE TPAMI, 9(6:005 08, 007. [8] K.-C. Lee, J. Ho, M.-H. Yang, an D. Kriegman. Vieobase fae reognition using probabilisti appearane manifols. In CVPR, volume, pages I 33. IEEE, 003. [9] B. Y. Li, A. S. Mian, W. Liu, an A. Krishna. Using kinet for fae reognition uner varying poses, expressions, illumination an isguise. In WACV, pages IEEE, 03. [0] T. Oala, M. Pietikainen, an T. Maenpaa. Multiresolution gray-sale an rotation invariant texture lassifiation with loal binary patterns. TPAMI, 4(7:97 987, 00. [] D. A. Ross, J. Lim, R.-S. Lin, an M.-H. Yang. Inremental learning for robust visual traking. IJCV, 77(-3:5 4, 008. [] P. Turaga, A. Veeraraghavan, A. Srivastava, an R. Chellappa. Statistial omputations on grassmann an stiefel manifols for image an vieo-base reognition. IEEE TPAMI, 33(:73 86, nov. 0. [3] M. Uzair, A. Mahmoo, A. Mian, an C. MDonal. A ompat isriminative representation for effiient image-set lassifiation with appliation to biometri reognition. In ICB, pages 8, 03. [4] P. Viola an M. J. Jones. Robust real-time fae etetion. IJCV, 57:37 54, 004. [5] R. Wang an X. Chen. Manifol isriminant analysis. In CVPR, pages IEEE, 009. [6] R. Wang, H. Guo, L. S. Davis, an Q. Dai. Covariane isriminative learning: A natural an effiient approah to image set lassifiation. In CVPR, pages IEEE, 0. [7] R. Wang, S. Shan, X. Chen, an W. Gao. Manifol-manifol istane with appliation to fae reognition base on image set. In CVPR, pages 8. IEEE, 008. [8] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, an Y. Ma. Robust fae reognition via sparse representation. TPAMI, 3:0 7, 009. [9] P. Zhu, L. Zhang, W. Zuo, an D. Zhang. From point to set: Exten the learning of istane metris. In ICCV, 03.

Image Set Classification Based on Synthetic Examples and Reverse Training

Image Set Classification Based on Synthetic Examples and Reverse Training Qingjun Liang 1, Lin Zhang 1(&), Hongyu Li 1, and Jianwei Lu 1,2 1 School of Software Engineering, Tongji University, Shanghai,