Single Sample Face Recognition via Learning Deep Supervised Auto-Encoders Shenghua Gao, Yuting Zhang, Kui Jia, Jiwen Lu, Yingying Zhang

Size: px

Start display at page:

Download "Single Sample Face Recognition via Learning Deep Supervised Auto-Encoders Shenghua Gao, Yuting Zhang, Kui Jia, Jiwen Lu, Yingying Zhang"

Cameron Anabel Mason
5 years ago
Views:

1 Sngle Sample Face Recognton va Learnng Deep Supervsed Auto-Encoders Shenghua Gao, Yutng Zhang, Ku Ja, Jwen Lu, Yngyng Zhang Abstract Ths paper targets

Motvated by the success of deep learnng n mage representaton, we propose a supervsed auto-encoder, whch s a new type of buldng block for deep archtectures.

Frst, we enforce the faces wth varants to be mapped wth the canoncal face of the person, for example, frontal face wth neutral expresson and normal

As a result, our supervsed auto-encoder extracts the features whch are robust to varances n llumnaton, expresson, occluson, and pose, and facltates the

Expermental results on the AR, Extended Yale B, CMU-PIE, and Mult-PIE datasets demonstrate that by couplng wth the commonly used sparse representaton

sample per person face recognton, and t acheves hgher recognton accuracy compared wth other deep learnng models, ncludng the deep Lambertan network, n

Moreover, supervsed auto-encoder can also be used for face verfcaton, whch further demonstrates ts effectveness for face representaton.

INTRODUCTION Sngle sample per person (SSPP) face recognton [1], [2] 1 s a very mportant research topc n computer vson because of ts potental applcatons n

1, there s only one tranng mage (gallery mage) n SSPP, and the faces to be recognzed may contan lots of varances n, for example, llumnaton, expresson,

Because of the presence of such challengng ntra-class varances, a robust face representaton whch can overcome the effect of these varances s extremely

Yutng Zhang s wth Zhejang Unversty, Zhejang provnce, Chna. Ku ja s wth Unversty of Macau, Macau SAR, Chna.

However, permsson to use ths materal for any other purposes must be obtaned from the IEEE by sendng a request to pubs-permssons@eee.org.

by havng only one tranng sample for each person, the commonly used subspace analyss based face representaton methods, lke Egenfaces [4] and Fsherfaces

To seek a good SSPP mage representaton, lots of endeavors have been made whch acheve some good performance under certan settngs.

SSPP scenaro. However, the performance of these methods s stll not unsatsfactory for challengng real data.

1 1 Sngle Sample Face Recognton va Learnng Deep Supervsed Auto-Encoders Shenghua Gao, Yutng Zhang, Ku Ja, Jwen Lu, Yngyng Zhang Abstract Ths paper targets learnng robust mage representaton for sngle tranng sample per person face recognton. Motvated by the success of deep learnng n mage representaton, we propose a supervsed auto-encoder, whch s a new type of buldng block for deep archtectures. There are two features dstnct our supervsed auto-encoder from standard auto-encoder. Frst, we enforce the faces wth varants to be mapped wth the canoncal face of the person, for example, frontal face wth neutral expresson and normal llumnaton; Second, we enforce features correspondng to the same person to be smlar. As a result, our supervsed auto-encoder extracts the features whch are robust to varances n llumnaton, expresson, occluson, and pose, and facltates the face recognton. We stack such supervsed auto-encoders to get the deep archtecture and use t for extractng features n mage representaton. Expermental results on the AR, Extended Yale B, CMU-PIE, and Mult-PIE datasets demonstrate that by couplng wth the commonly used sparse representaton based classfcaton, our stacked supervsed auto-encoders based face representaton sgnfcantly outperforms the commonly used mage representatons n sngle sample per person face recognton, and t acheves hgher recognton accuracy compared wth other deep learnng models, ncludng the deep Lambertan network, n spte of much less tranng data and wthout any doman nformaton. Moreover, supervsed auto-encoder can also be used for face verfcaton, whch further demonstrates ts effectveness for face representaton. Index Terms Sngle tranng sample per person; Face recognton; Supervsed Auto-encoder; Deep archtecture I. INTRODUCTION Sngle sample per person (SSPP) face recognton [1], [2] 1 s a very mportant research topc n computer vson because of ts potental applcatons n many realstc scenaros lke passport dentfcaton, gate ID dentfcaton, vdeo survellance, etc. However, as shown n Fg. 1, there s only one tranng mage (gallery mage) n SSPP, and the faces to be recognzed may contan lots of varances n, for example, llumnaton, expresson, occluson, pose, etc. Therefore, SSPP face recognton s an extremely challengng task. Because of the presence of such challengng ntra-class varances, a robust face representaton whch can overcome the effect of these varances s extremely desrable, and wll greatly facltate SSPP face recognton. However, restrcted Shenghua Gao and Yngyng Zhang are wth ShanghaTech Unversty, Shangha, Chna. Yutng Zhang s wth Zhejang Unversty, Zhejang provnce, Chna. Ku ja s wth Unversty of Macau, Macau SAR, Chna. Jwen Lu s wth Advanced Dgtal Scences Center, Sngapore. E-mal: gaoshh@shanghatech.edu.cn Copyrght (c) 2013 IEEE. Personal use of ths materal s permtted. However, permsson to use ths materal for any other purposes must be obtaned from the IEEE by sendng a request to pubs-permssons@eee.org. 1 SSPP s one specfc task of one shot learnng [3] (a) The AR dataset (b) The Extended Yale B dataset C09 C29 C27 (c) The CMU-PIE dataset 0 o -15 o 15o (d) The Mult-PIE dataset Fg. 1. Samples of gallery mages (the frst column) and probe mages (the rest faces) on the AR, Extended Yale B, CMU-PIE, and Mult-PIE datasets. by havng only one tranng sample for each person, the commonly used subspace analyss based face representaton methods, lke Egenfaces [4] and Fsherfaces [5], are no longer sutable nor applcable to SSPP. To seek a good SSPP mage representaton, lots of endeavors have been made whch acheve some good performance under certan settngs. For example, wth the help of manually generated vrtual faces [6] or an external dataset [1], tradtonal face representaton methods can be extended to the SSPP scenaro. However, the performance of these methods s stll not unsatsfactory for challengng real data. After representng each face wth a feature vector, a classfcaton technque, lke Nearest Neghbor, sparse representaton based classfcaton (SRC) [7], can be used to predct the labels of the probe mages. Deep neural networks have demonstrated ther great successes n mage representaton [8], [9], and ther fundamental ngredent s the tranng of a nonlnear feature extractor at each layer [10], [11], [12]. After the layer-wse tranng of each buldng block and the buldng of a deep archtecture, the output of the network s used for the mage representaton n the subsequent task. As a typcal buldng block n deep neural networks, Denosng Auto-Encoder [11] extracts the features through a determnstc nonlnear mappng, and t s robust to the noses of the data. Image representatons based on the denosng auto-encoder have shown good performance n many tasks, lke object recognton, dgt recognton, etc. Motvated by the success of denosng auto-encoder based deep neural networks, and drven by the SSPP face recognton, we propose a supervsed auto-encoder to buld the deep neural network. C07 C05

2 2 We frst treat the faces wth all type of varants (for example, llumnaton, expresson, occluson, or poses) as mages contamnated by noses. Wth a supervsed auto-encoder, we can recover the face wthout the varant, meanwhle, such a supervsed auto-encoder also extracts robust features for mage representaton n SSPP scenaro,.e., the features correspondng to the same person should have the same (deally) or smlar features, and such features should be stable to the ntraclass varances whch commonly exst n the SSPP scenaro. The contrbutons of ths paper are two-fold. Frstly, we propose a new type of buldng block supervsed autoencoder, for buldng the deep neural network. Dfferent from standard Auto-Encoder, on the one hand, we enforce all the faces wth varances to be mapped wth the canoncal face of the person, for example, frontal face wth neutral expresson and normal llumnaton. Such strategy helps remove the varances n face recognton. On the other hand, by mposng the smlarty preservaton constrants on the extracted features, the supervsed auto-encoder makes the features correspondng to the same person smlar, therefore t extracts more robust features for face representaton. Secondly, by leveragng the supervsed auto-encoder, robust features can be extracted for mage representaton n SSPP face recognton, therefore mproves the recognton accuracy of SSPP. The rest of the paper s organzed as follows: we wll revew the related work, ncludng the commonly used mage representaton methods for SSPP scenaro as well as the auto-encoder and ts varants n Secton II. In Secton III, we wll ntroduce our supervsed auto-encoder, and dscuss ts applcaton n SSPP. We wll expermentally evaluate the proposed technque and ts parameters n Secton IV, and conclude our work n Secton V. II. RELATED WORK I: Work Related to SSPP Face Representaton. Though subspace analyss based methods are usually adopted for effectve and effcent face representaton n general face recognton, they are no longer sutable nor applcable to the SSPP scenaro. On the one hand, because of the lmted number of gallery mages and uncertanty of the varances between probe mages and gallery mage, t s not easy to estmate the data dstrbuton and get the proper projecton matrx for unsupervsed methods, lke Egenfaces[4], 2DPCA [13], etc. To make these unsupervsed methods more sutable for SSPP, by takng advantage of some vrtual faces generated by dfferent methods, Projecton-Combned Prncpal Component Analyss ((PC) 2 A) [14], Enhanced (PC) 2 A (E(PC) 2 A) [15], etc., have been proposed. On the other hand, to make FL- DA [5] based mage representaton able to be used n the SSPP case where the ntra-class varance s mpossble to be drectly estmated because only one tranng sample s provded for each person, vrtual samples are usually generated by usng small perturbaton [16], transformaton [17], SVD decomposton [6], or submages generated by dvdng each mage nto small patches [18]. Moreover, the ntra-class varances can also be estmated from a generc dataset [1], [19], where each subject contans mages wth varances n pose, expresson, llumnaton, occluson, etc. Recently, wth the emergence of deep learnng technque, Tang et al. propose a Deep Lambertan Network (DLN) [20] whch combnes Deep Belef Nets [10] wth Lambertan reflecton assumpton. Therefore DLN extracts llumnaton nvarant features for face representaton and t also shows good performance under SSPP settng, but t s not able to handle other varances, lke expressons, poses, occlusons, etc., whch restrcts ts applcaton n real world problems. II: Work Related to Auto-Encoder. Auto-encoder whch s also termed as autoassocator or Dabolo network, s one of the commonly used buldng blocks n deep neural networks. It contans two modules. () A encoder maps the nput x to the hdden nodes through some determnstc mappng functon f: h = f(x). () A decoder maps the hdden nodes back to the orgnal nput space through another determnstc mappng functon g: x = g(h). For real-valued nput, by mnmzng the reconstructon error x g(f(x)) 2 2, the parameters of encoder and decoder can be learnt. Then the output of the hdden layer s used as the feature for mage representaton. It has been shown that such a nonlnear auto-encoder s dfferent from PCA [21], and t has been proven that tranng an auto-encoder to mnmze reconstructon error amounts to maxmzng a lower bound on the mutual nformaton between nput and the learnt representaton [11]. To further boost the ablty of auto-encoder for mage representaton n buldng deep networks, Vncent et al. [11] propose a denosng auto-encoder whch enhances ts generalzaton by tranng wth locally corrupted nputs. Rafa et al. enhance the robustness of auto-encoder to noses by addng the Jacoban [22], or Jacoban and Hessan at the same tme [23], nto the objectve of basc auto-encoder. Zhou et al. [24] use the poolng operaton after the Reconstructon Independent Component Analyss [25] encodng process, and enforce the pooled features to be smlar for nstances wth the same class label. These extensons mprove the performance of auto-encoder based neural networks for mage representaton n object and dgt recognton. III: Deep Learnng based Face Verfcaton Systems. In [26], a Samese Networks s proposed for face verfcaton and such SN s based on Convolutonal Neural Network n [26]. In the experment, we have tred dfferent settngs and use the one wth the best performance. As the SN s proposed for face verfcaton, to do the face recognton, we run face verfcaton over all pars of faces n the test gallery set and choose the par that s most smlar. In ths way, we can predct the label of the probe. Recently, Tagman et al. also propose a Convolutonal Neural Networks based face verfcaton system [27], whch sgnfcantly outperforms the exstng hand-crafted features based systems on the Labeled Face n the Wld (LFW) database. Smlarly, Fan et al. also use another varant of convolutonal neural network whch s termed as Pyramd CNN, for face verfcaton, and also acheve smlar performance on the LFW database. But these works are proposed for face verfcaton, where a par of faces are gven and the algorthm dentfes whether the two faces belongng to the same person or not (a random guess s 50%). Our task solves face recognton n whch, gven a test mage,

3 3 t tres to dentfy precsely the correct person among many. Random guess gves one over the number of subjects. Besdes the tasks to be solved are dfferent, the network archtecture of our paper s also dfferent wth these all exstng works. Moreover, our work s also closely related to the work of Zhu et al. [28][29]. In [28], a supervsed tranng s deployed to tran a network consstng of encodng layer and poolng layer. In [29], multple CNN networks traned on dfferent regons of face and a regresson layer are traned to solve the face verfcaton task. Smlar to these work, our work also makes faces of the same person be represented smlarly, but we are based on dfferent archtectures. Moreover, dfferent from [28][29], features learnt by our method s not desgned for some specfc task. As shown later, our model can also be used for face recognton or face verfcaton. III. SUPERVISED AUTO-ENCODER FOR SSPP FACE REPRESENTATION The motvaton of our supervsed auto-encoder for SSP- P face representaton comes from the denosng autoencoder [11]. In ths secton, we wll frst revst the denosng auto-encoder. Then we wll propose our supervsed auto-encoder, ts formulaton, ts optmzaton, ts dfferences wth denosng auto-encoder, and ts applcaton n SSPP face representaton. A. A Revst of Denosng Auto-Encoder A denosng auto-encoder tres to reconstruct the clean nput data by usng the manually corrupted verson of t. Mathematcally, denote the nput as x. In denosng autoencoder, nput x s frst corrupted by some predefned nose, for example, Addtve Gaussan nose ( x x N(x, σ 2 I)), maskng nose (a fracton of x s forced to 0), or saltand-pepper nose (a fracton of x s forced to be 0 or 1). Such a corrupted x s used as the nput of the encoder h = f( x) = s f (W x + b f ). Then the output of encoder h s nput nto the decoder ˆx = g(h) = s g (W h + b g ). Here s f and s g are predefned actvaton functons of encoder and decoder respectvely, whch can be sgmod functons, hyperbolc tangent functons, or rectfer functons [30], etc. W R d h d x and b f R d h are the parameters of encoder, and W R dx d h and b g R dx are the parameters of decoder. d x and d h are the dmensonalty of nput data and the number of hdden nodes respectvely. Based on the above defntons, the objectve of denosng auto-encoder s gven as follows: mn W,W,b f,b g L(x, ˆx) (1) x X Here L s the reconstructon error, typcally squared error L(x, ˆx) = x ˆx 2 for real-valued nputs. After learnng f and g, the output of clean nput (f(x)) s used as the nput of the next layer. By tranng such denosng autoencoder layer by layer, stacked denosng auto-encoders are bult. Expermental results show that stacked denosng autoencoders greatly mprove the generalzaton performance of the neural network. Even when the fracton of corrupted pxels (corrupted by zero maskng noses) reaches up to 55%, the recognton accuracy s stll better or comparable wth that of a network traned wthout corruptons. B. Supervsed Auto-Encoder In real applcatons, lke passport or Gate ID dentfcaton, the only tranng sample (gallery mage) for each person s usually a frontal face wth frontal/unform lghtng, neutral expresson, and no occluson. However, the test faces (probe mages) are usually accompaned by varances n llumnaton, occluson, expresson, pose, etc. Compared to the denosng auto-encoder, these gallery mages can be seen as clean data and these probe mages can be seen as corrupted data. For robust face recognton, we desre to learn the features whch are robust to these varances. The success of denosng autoencoder convnces us of the possblty to learn such features. Then the problem becomes: How do we learn a mappng functon whch captures the dscrmnatve structures of the faces of dfferent persons, whle stayng robust to the possble varances of these faces? Once such a functon s learnt, robust features can be extracted for mage presentaton, wth an expected mprovement to the performance of SSPP face recognton. Gven a set of data whch contan the gallery mages (clean data), probe mages (corrupted data) as well as ther labels, we use them to tran a deep neural network for feature extracton. We denote each probe mage n ths dataset as x, and ts correspondng gallery mage as x ( = 1,..., N). It s desrable that x and x should be represented smlarly. Therefore the followng formulaton s proposed (followng the work [11], [22], only the ted weghts case s explored n ths paper,.e., W = W T ): 1 mn W,b f,b g N where ρ x = 1 N ρ x = 1 N KL(ρ ρ 0 ) = j ( x g(f( x )) λ f(x ) f( x ) 2 2) + α ( KL(ρ x ρ 0 ) + KL(ρ x ρ 0 ) ) (2) 1 2 (f(x ) + 1), 1 2 (f( x ) + 1), ( ρ0 log( ρ 0 ρ j ) + (1 ρ 0 ) log( 1 ρ 0 1 ρ j ) ). In ths paper, the actvaton functons used are the hyperbolc tangent,.e., h = f(x) = tanh(w x + b f ), and g(h) = tanh(w T h + b g ). We lst the propertes of the supervsed auto-encoder as follows: 1) The frst term n equaton (2) s the reconstructon error. It means that though gallery mages contan some varances, after passng through the encoder and the decoder, they wll be repared. In ths way, our learnt model s robust to the varances of expresson, occluson, pose, etc., whch are qute dfferent to the noses n the Denosng Autoencoder. (3)

4 4 smlar smlar smlar smlar Input: Fg. 2. Archtecture of Staked Supervsed Auto-Encoders. The left fgure: The basc supervsed auto-encoder, whch s comprsed of the clean/ corrupted faces, there features (hdden layer), as well as the reconstructed clean face by usng the corrupted face. The mddle fgure: The output of prevous hdden layer s used as the nput to tran the next supervsed auto-encoder. We repeat such tranng several tmes untl the desred number of hdden layers s reached. In ths paper, only two hdden layers are used. The rght fgure: Once the network s traned, gven any nput face, the output of the last hdden layer s used as the feature for mage representaton. 2) The second term n equaton (2) s the smlarty preservaton term. Snce the output of the hdden layer s used as the feature, f(x ) and f( x ) correspond to the features of the same person. It s desrable that they should be the same (deally) or smlar. Such a constrant enforces the learnng of a nonlnear mappng robust to the varances that commonly appear n SSPP. 3) The thrd and the fourth term, the Kullback-Leber dvergence (KL dvergence) terms n equaton (2) ntroduce sparsty n the hdden layer. Work from bologcal studes shows that the percentage of the actvated neurons of human bran at the same tme s around 1% to 4% [31]. Therefore the sparsty constrant on the actvaton of the hdden layer s commonly used n the auto-encoder based neural networks and results show that sparse auto-encoder often acheves better performance [32] than that traned wthout the sparsty constrant. Snce the actvaton functon we used s the hyperbolc tangent, ts output s between -1 and 1, where the value of -1 s regarded as non-actvated [33]. Therefore we map the output of the encoder to the range (0,1) frst n equaton (3). Here ρ x and ρ x are the mapped average actvatons of the clean data and the corrupted data respectvely. By choosng a small ρ 0, the KL dvergence regularzer enforces that only a few fracton of neurons are actvated [34], [33]. Followng the work [34], [33], we also set ρ 0 to ) Weghtng the frst and the second term wth 1 N (N s the total number of tranng samples) helps balance the contrbutons of the the frst two terms and the last terms n the optmzaton. Otherwse we may need to tune α on dfferent datasets because the tranng samples for dfferent datasets may be dfferent. 5) We am at learnng an feature extractor to represent faces correspondng to the same person smlarly, but [35] and [26] focus on learnng a dstance metrc n the last layer of ther respectve DNN archtectures. Our work s dfferent from [35] and [26] n terms of network archtecture and applcaton. 6) Snce the labels of the faces are used whle tranng ths buldng block, we term our proposed formulaton as the Supervsed Auto-Encoder (SAE). We llustrate the dea of such supervsed auto-encoder n Fg 2 (the left fgure). C. Optmzaton of Supervsed Auto-Encoder The optmzaton method s very mportant for good performance of deep neural networks. Followng the work [36], [37], the entres of W are randomly sampled from the unform 6 6 dstrbuton between [ d h +d x, d h +d x ], and b f and b g are ntalzed wth zero vectors. It s worth notng that such normalzed ntalzaton s very mportant for the good performance of Auto-Encoder based method presumably because the layer-to-layer transformatons mantan magntudes of actvatons (flowng upward) and gradents (flowng backward) [36]. Then the Lmted memory BroydenCFletcherCGoldfarbCShanno (L-BFGS) algorthm s used for learnng the parameters because of ts faster convergence rate and better performance compared to stochastc gradent descent methods and conjugate gradent [38] methods. As the computatonal cost of the smlarty preservaton term (the 2 nd term n equaton (2)) and the calculaton of ts gradent wth respect to the unknown parameters s almost the same wth that of the reconstructon error term, the overall computatonal complexty s stll O(d x d h ) [22] n our supervsed auto-encoder. D. Stackng Supervsed Auto-Encoders to Buld Deep Archtecture Deep neural networks demonstrate better performance for mage representaton than shallow neural networks [32]. Therefore we also tran the supervsed auto-encoder n a layerwse manner and get the deep archtecture. Here we term such an archtecture as the Stacked Supervsed Auto-Encoders (SSAE). After learnng the actvaton functon of the encoder n prevous layer, t s appled to the clean data and corrupted data respectvely, and the outputs serve as the clean nput data and corrupted nput data to tran the supervsed auto-encoder n the next layer. Once the whole network s traned, the output

5 5 of the hghest layer wll be used as the feature for mage representaton. We llustrate ths learnng strategy n Fg. 2. A. Expermental Setup E. The Dfferences Between Stacked Supervsed Auto- Encoders and Stacked Denosng Auto-Encoders Our stacked auto-encoders are dfferent from the stacked denosng auto-encoders n the followng three aspects: 1. Input. Denosng auto-encoder s an unsupervsed feature learnng method, and the corrupted data are generated by manually corruptng the clean data wth predefned noses. Therefore the traned model s robust to these predefned noses. However, n our supervsed denosng auto-encoder, the corrupted data are the photos taken under dfferent settngs, and they are physcally meanngful data. By enforcng the reconstructed probe mage to be smlar to ts gallery mage, we can overcome the possble varances whch appear n SSPP face recognton, and whch are qute dfferent from the noses n denosng auto-encoder. Ths s why our supervsed autoencoder needs the labels of data to tran the network. 2. Formulaton. The smlarty preservaton term s used to further emphasze the desred property of SSPP mage representaton. Though we enforce the reconstructed probe mage to be smlar to the gallery mage, t cannot be guaranteed that the extracted features of them are smlar because the varances n pose, expresson, llumnaton, etc. can make the probe and the gallery qute dfferent. But n SSPP face recognton, t s desrable that the mages of the same person have smlar representaton. To ths end, the smlarty preservaton term s added to the objectve of SAE n equaton (2). Such term further makes the extracted features be robust to the varances n face recognton. As shown n Secton IV-F, ths term greatly mproves the recognton accuracy, especally for the cases wth large ntra-class varances. 3. Buldng a deep archtecture. In stacked denosng autoencoders, once the actvaton functon of the encoder n prevous layer s traned, t maps the clean data to the hdden layer, and the outputs serve as the clean nput data of next layer SAE. Then the noses whch are generated n the same way wth that n the prevous layer are added on the clean data to serve as the corrupted data. As clamed n [34], snce the features n the hdden layer are no longer n the same feature space wth the nput data, the meanng of applyng the noses generated n the same way wth that n prevous layer to the output of the hdden layer s no longer clear. Smlar to [34], we apply the actvaton functon of the encoder to both the clean data and the corrupted data, and ther outputs serve as the clean and corrupted nput data of the next layer. Hence ths way of stackng the basc supervsed auto-encoders s more natural for buldng a deep archtecture. We use the Extended Yale B, AR, and CMU-PIE, and Mult-PIE datasets for evaluaton of the effectveness of the proposed SSAE model. All the mages are gray-scale mages and are manually algned. 2 The mage sze s Thus the dmensonalty of the nput vector s Each nput feature s normalzed wth the l 2 normalzaton. We set λ = λ 0 dx d h. Further, we set λ 0 = 10 on the CMU-PIE and Mult-PIE dataset, and set λ 0 = 5 on the AR and the Extended Yale B datasets. The reason for ths s that the varances are more sgnfcant on the CMU-PIE and Mult-PIE dataset than that of the AR or the Extended Yale B datasets. The weght correspondng to the KL dvergency term (α) s fxed to be Followng the work [11], we fx the number of the hdden nodes n all the hdden layers to be In our experments, we found that two hdden layers already gve a suffcently good performance. After extractng features wth stacked supervsed auto-encoders for face representaton, followng the work [7], sparse representaton based classfcaton s used for face recognton. The features are normalzed to make ther l 2 norms equal to 1 before sparse codng. Baselnes. We compare our Stacked Supervsed Auto- Encoders (SSAE) wth the followng work because of ther close relatonshps. It s also worth notng that all the comparsons are based on the same tranng/test set, and the same generc data f they are used. 1) Sparse Representaton for Classfcaton (SRC) [7] wth raw pxels; 2) LBP feature followed by SRC; 3) Collaboratve Representaton for Classfcaton (CRC) [39]. 4) AGL [1]; 5) One-Shot Smlarty Kernel (OSS) [40]. The generc data are used as the negatve data n OSS,.e., we use the same generc/tranng/testng splt for both OSS and our SAE. 6) Denosng Auto-Encoder (DAE) [11] wth 10% maskng noses followed by SRC; 7) Modfed Denosng Auto-Encoder (MDAE). We propose to use the reconstructon error term only n equaton (2) (λ = α = 0), and term such baselne method as Modfed Denosng Auto-Encoder (MDAE); IV. EXPERIMENTS In ths secton, we wll expermentally evaluate the stacked supervsed auto-encoder for extractng the features n SSPP face recognton task on Extended Yale B, CMU-PIE, and AR datasets. Important parameters wll also be expermentally evaluated. Besdes face recognton, we also use stacked supervsed auto-encoder for face verfcaton on the LFW dataset. 2 In all our experments, faces are cropped based on manually labeled landmark ponts, and algned based landmark ponts. We also tested our work wth the faces cropped wth the OpenCV face detector on the AR dataset, but the performance of our work on such data s less than 50% on AR. One possble reason for such poor performance s that we don t have enough data to tran a deep network to be robust to the msalgnment. The combnaton of our work wth automatc face algnment method s our future work along ths drecton.

6 6 8) Samese network (SN)[26]. 3 In all these baselne methods, 1-3 correspond to the very popular sparse codng related methods. 4-5 are specally desgned for SSPP, and 6-8 are most related deep learnng methods. 4 B. Dataset Descrpton The CMU-PIE dataset [42] contans 41,368 mages of 68 subjects. For each subject, the mages are taken under 13 dfferent poses, 4 dfferent llumnaton condtons, and 4 dfferent expressons. For each subject, we use the face mages taken wth the frontal pose, neutral expresson, and normal lghtng condton as the galleres, and use the rest of the mages taken wth the poses C27, C29, C07, C05, C09 as probes. We use mages of 20 subjects to learn the SSAE, and use the remanng 48 subjects for evaluaton. The AR dataset [43] contans over 4,000 frontal faces taken from 126 subjects (70 men and 56 women) n two dfferent sessons, and the mages contan varances n occluson (sunglasses or scarves), expresson (neutral expresson, smle, angry, scream), and llumnaton. Some mages contans both occluson and llumnaton varances. In our experments, 20 subjects from sesson 1 are used as the generc set for tranng the SSAE, and another subjects also from sesson 1 are used for evaluaton. The Extended Yale B dataset [44] contans 38 categores. For each subject, we use the frontal faces whose lght source drecton wth respect to the camera axs s 0 degree azmuth ( A+000 ) and 0 degree elevaton ( E+00 ) as gallery mages, and use the rest of the mages wth dfferent lghtng condtons as the probe mages. Followng the work of deep Lambertan networks (DLN) [20], 28 categores are used to tran the SSAE and the remanng 10 categores whch are from the orgnal Yale B dataset are used for evaluaton. The Mult-PIE dataset [45] contan mages of 337 persons taken under the four sessons over the span of 5 months. For each persons, mages are taken under 15 dfferent vew angles and 19 dfferent llumnatons whle dsplayng dfferent facal expressons. In our experments, only the mages wth the neutral expresson, near frontal pose (0, 15, -15 ), and dfferent llumnatons are used. For each person, the frontal face wth neutral expresson, frontal llumnaton s used as 3 We tred several dfferent network archtectures n order to get the best expermental result for the Samese network. Surprsngly, the one of the best performers s the two-layer network whch take a fully connected layer (wth 500 hdden unts and the tanh non-lnearty) as the frst layer and the Samese as the second one. The orgnal deep archtecture proposed n [26] (.e. the basc archtecture s C1-S2-C3-S4-C5-F6 ) doesn t work as well as ths smpler ones (Wth the archtecture lsted n [1], the accuracy of Samese Network on AR s below 60%). Probably, the lmted sze of our generc tranng set make t dffcult to tran a good convolutonal neural network wth deep archtecture. Also as the mage we used are well-algned whch s dfferent from [26], the convolutonal layers n the orgnal archtecture mght become redundant. 4 Because the codes of deepface [27] and Pyramd CNN [41] are not avalable, and the mplementaton and preprocessng nvolves lots of trcks, say very sophstcated algnment method and lots of outsde data requred for network tranng, here we don t compare our method wth these works. Another reason for not comparng wth [27], [41] s that the task solved by these methods s face verfcaton, but our work solves the face verfcaton task. the gallery and all the mages are used as the probe. The 249 persons n sesson 1 are used as the evaluaton, and the frst tme appearance mages correspondng to the other 88 persons whch only appear n sesson 2-4 are used to tran the network. C. Performance Evaluaton The performance of dfferent methods on the AR, Extended Yale B, CMU-PIE, and Mult-PIE datasets s lsted n Table I, Table II, and Table III. We can see that our SSAE outperforms all the rest of the methods ncludng those specally desgned methods for SSPP mage representaton, and t acheves the best performance, whch proves the effectveness of SSAE for extractng robust features for face representaton. Needless to say, SSAE based face representaton can also be combned wth ESRC f another set of labeled data are provded. I. SAE vs. hand-crafted features. The mprovement of the features learnt from our method over the popular LBP features s between about 10% (C27) and 31% (C29) on the CMU-PIE dataset, wth larger mprovements for subsets of larger poses. On the AR dataset and the Extended Yale B dataset where there s lttle pose varaton, the mprovement of our method over LBP s stll around 5%. Our method also outperforms ESRC, whch ntroduces a pre-traned ntraclass varance dctonary to extend the SRC method to the SSPP case, and the mprovement s very evdent for cases of large pose varance (e.g., CMU-PIE C07, C09, C29, Mult- PIE 15, -15). Compared wth ESRC, another advantage of our method s speed. ESRC reles on the ntra-class varance dctonary to acheve good performance, and the sze of such dctonary s usually bg. Such a bg dctonary wll ncrease computatonal costs, but n our method, the dctonary sze n the sparse codng s the same wth the number of gallery faces. Thus our method s sgnfcantly faster than ESRC n testng. For example, ESRC costs about 4.63 seconds to solve sparse codng for each probe face on Extended Yale B, whle our method can solve the sparse codng n about seconds. Here the number of atoms n ntra-class varance dctonary s 1746 on the Extended Yale B. Therefore the total atoms n the dctonary s 1756 for ESRC. In our method, the number of atoms n the dctonary of SRC s only 10. All the methods are based on Matlab mplementatons, and run on a Wndows Server (64bt) wth a 2.13GHz CPU and 16GB RAM. It s worth notng that as the ntra-class varance dctonary s very large on Mult-PIE, the optmzaton s very slow, therefore we don t nclude ts performance on the Mult- PIE dataset. II. SAE vs. other DNNs. Compared wth DAE, the mprovement of our method s over 30% on all the datasets. The reason for the poor performance of DAE for SSPP face representaton s that the tranng of DAE s unsupervsed, and zero maskng nose s used to tran the DAE. Therefore t s natural that the traned DAE cannot handle the varances n poses, expressons, etc. Dfferent from DAE, MDAE enforces the reconstructed faces of the probe mages, whch contan the varances n expresson, pose, etc, to be well algned to ther gallery mages (frontal faces), therefore ts performance s better than DAE, but t s stll nferor to that of stacked

7 7 SAE, especally for the cases wth varance n pose and expresson on AR and CMU-PIE. In contrast, besdes usng more approprate reconstructon error term, our SAE also enforces extracted features wth the same person to be smlar. Therefore, SAE can overcome these varances n SSPP and s more sutable for extractng features for face recognton. Moreover, on the Extended Yale B dataset, the performance of our method (83.97% when the number of hdden nodes s 1024, and 82.22% when the number of hdden nodes s 2048) s better than the 81% obtaned by Deep Lambertan Network (DLN) [20], whch s specally desgned for removng the llumnaton effect n SSPP face recognton [20]. In addton to the 28 categores from the Extended Yale B, DLN also uses the Toronto Face Database, whch s a very large dataset to tran the network. Although we use much less data to tran the SAE and t does not use any doman nformaton about llumnaton model as n DLN, performance of SAE s stll better. 5 In addton, our SAE also outperforms SN on all the datasets. These experments clearly demonstrate the superorty of SAE for recognton tasks over other DNN buldng blocks such as DAE and DLN. III. More observatons. We notce that as the poses of probe mages change, the performance drops notably on the CMU-PIE. For example, the recognton accuracy drops at least 10.83% (C09) compared wth that of the frontal face (C27) pose. Interestngly, we also notce that the performance of C07 (look up) and C09 (look down) s slghtly better than C05 (look rght) and C29 (look left). A possble reason s that the mssng parts of face are larger for C05 and C29 than that n C07 and C09 (please refer to Fg. 1). Smlar observatons can also be found on the Mult-PIE dataset n Table III where the recognton accuracy also drops by around 30% f the poses are non-frontal. Moreover, some probe mages and ther reconstructed mages on the AR dataset are also shown n Fg. 3. We can see our method can remove the llumnaton, and recover the neutral face from the faces wth dfferent expressons. For the faces wth occluson, our method can also smulate the faces wthout occluson, but compared wth llumnaton and expresson, t s more dffcult to recover the face from the occluded face because too much nformaton s lost. Such results are natural because human can nfer the faces wth normal llumnaton and neutral expresson from the experence (For the deep neural networks, the experence s learnt from the generc set). But t s also almost mpossble for our human to nfer the occluded face parts because too much nformaton s mssng. Moreover, some probe mages and ther reconstructed mages on the AR dataset are also shown n Fg. 3. We can see our method can remove the llumnaton, and recover the neutral face from the faces wth dfferent expressons. 5 Because the results on the AR and CMU-PIE datasets and the codes for DLN are not avalable for comparson, we don t report ts performance on the AR and CMU-PIE datasets. But t s unlkely that DLN works well on the AR and CMU-PIE as t s not desgned to handle other varatons n the data such as poses, occlusons, expressons, etc. Moreover, the state-of-the-art performance on Extended Yale B has reached 93.6% n terms of accuracy under the SSPP settng n [46]. But the expermental setup n [46] s a lttle dfferent from that n our paper. For the faces wth occluson, our method can also smulate the faces wthout occluson, but compared wth llumnaton and expresson, t s more dffcult to recover the face from the occluded face because too much nformaton s lost. Such results are natural because human can nfer the faces wth normal llumnaton and neutral expresson from the experence (For the deep neural networks, the experence s learnt from the generc set). But t s also almost mpossble for our human to nfer the occluded face parts because too much nformaton s mssng. TABLE I PERFORMANCE COMPARISON BETWEEN DIFFERENT METHODS ON THE EXTENDED YALE B AND THE AR DATASETS. (%) Method AR Extended Yale B DAE MDAE DLN NA 81 SN AGL CRC SRC OSS LBP ESRC SSAE TABLE II PERFORMANCE COMPARISON BETWEEN DIFFERENT METHODS ON THE CMU-PIE DATASET. (%) Method C27 C05 C07 C09 C29 DAE MDAE SN AGL CRC SRC OSS LBP ESRC SSAE TABLE III PERFORMANCE COMPARISON BETWEEN DIFFERENT METHODS ON THE MULTI-PIE DATASET. (%) DAE MDAE AGL CRC SRC LBP SSAE D. Evaluaton of Smlarty Preservaton Term The smlarty preservaton term, the second term n equaton (2), s very mportant n the formulaton of SAE. Here we lst the performance of dfferent networks traned wth the formulaton wth and wthout ths term n Table 4. It can be

8 20 20 40 40 60 60 (a): Corrupted faces used for tranng the network (b): Corrupted faces used for evaluaton 20 20 40 40 60 60 (c): Reconstructed faces of (a) (d): Reconstructed faces of (b) Fg. 3.

24 38.56 40.84 C27 C05 C07 C09 C29 recognton accuracy 100 95 90 85 75 70 65 60 55 layer 1 layer 2 15 0-15 pose Fg. 4. The Effect of the Smlarty Preservaton Term on the CMU-PIE Dataset. (%) Fg. 5. The effect of SSAE wth dfferent depth on the Mult-PIE dataset.

We can see that mprovement n performance ncreases wth the ncrease of the poses (the msalgnment of faces s larger for C05 and C29 than that n C07 and C09).

The Effect of Deep Archtecture As an example, we show the performance of the SSAE wth dfferent layers on the Mult-PIE dataset.

8 (a): Corrupted faces used for tranng the network (b): Corrupted faces used for evaluaton (c): Reconstructed faces of (a) (d): Reconstructed faces of (b) Fg. 3. A comparson between the orgnal mages and the reconstructed ones on the AR dataset. SSAE wthout smlarty preservaton term SSAE wth smlarty preservaton term C27 C05 C07 C09 C29 recognton accuracy layer 1 layer pose Fg. 4. The Effect of the Smlarty Preservaton Term on the CMU-PIE Dataset. (%) Fg. 5. The effect of SSAE wth dfferent depth on the Mult-PIE dataset. seen that ths term mproves the recognton by 3% for the bengn frontal face case (C27), about 10% for C09 and C07, and about 29% for C05 and C29. We can see that mprovement n performance ncreases wth the ncrease of the poses (the msalgnment of faces s larger for C05 and C29 than that n C07 and C09). Ths valdates that the smlarty preservaton term ndeed enhances the robustness of face representaton, especally for pose varatons. E. The Effect of Deep Archtecture As an example, we show the performance of the SSAE wth dfferent layers on the Mult-PIE dataset. We can see that 2- layer SSAE network outperforms the sngle layer network, whch demonstrates the effectveness of the deep archtecture. F. Parameter Evaluaton a. Number of Hdden Nodes n Hdden Layers. We change the number of hdden nodes from 512 to 4096 on the Extended Yale B, and show ther performance n Fg. 6 (top left). Results show that the recognton accuracy s hgher when the number of hdden nodes s 1024 or 2048, whch s the same or larger than the dmensonalty of the data. The reason for ths s that more hdden nodes usually ncrease the expressveness of the auto-encoder [32]. We also note that too many hdden nodes actually decreases the accuracy. A possble reason s that the data used for learnng the network n our settng s not enough (only mages of 28 subjects are used to learn the parameters), whch may lmt stablty of the learnt network. b. Weght of KL Terms (α). We plot the performance wth dfferent α n Fg. 6 (top rght). The poor performance when α s too small (10 6 ) proves the mportance of the sparsty term. But f we mpose too large weght on α (10 2 ), more hdden nodes wll hbernate for a gven nput, whch affects the senstveness of the model to the nput, and may make faces of dfferent persons be represented smlarly. Ths may be the reason that for a value of 10 2, our model has the lowest performance. So properly settng α s a requrement for the good performance of our model. Smlarly phenomenon also

9 9 The effect of the number of the hdden nodes on Ex t ended Yal e B The effect of α on Extended Yale B If the actvaton functon s ReLU, the objectve functon s rewrtten as follows: accuracy (%) accuracy (%) mn W,b f,b g N ( x g(f( x )) λ f(x ) f( x ) 2 2) + α ( f(x ) 1 + f( x ) 1 ) (6) number of hdden nodes accuracy (%) α The effect of λ 0 on Extended Yale B λ 0 Fg. 6. The effect of dfferent parameters n supervsed auto-encoder on the Extended Yale B dataset. happens n sparse representaton based face recognton [47], too small or too large sparsty both wll reduce the recognton accuracy. c. Weght of Smlarty Preservaton Term (λ). We plot the performance of SSAE wth dfferent λ n Fg. 6 (bottom). The poor performance wth smaller λ 0 (0.1 or 1) also demonstrates the mportance of the smlarty preservaton term. But f λ s too bg, the learnt network mght become less dscrmnatve for dfferent subjects as t enforces too strongly the smlarty of the learnt features for dverse samples. That leads to a drop n performance for very large λ. Moreover, from the plot, we see that good performance can be obtaned for a farly large range of λ. G. The Comparson of Dfferent Actvaton Functons Besdes hyperbolc tangent, we also evaluate the performance of SSAE wth other actvaton functons, ncludng sgmod and Rectfed Lnear Unts (ReLU) [48]. To acheve the sparsty n the hdden layer, dfferent strateges are used for dfferent actvaton functons. Specfcally, f the actvaton functon s sgmod, the objectve functon s rewrtten as follows: 1 ( mn x g(f( x )) λ f(x ) f( x ) 2 W,b f,b g N 2) + α ( KL(ρ x ρ 0 ) + KL(ρ x ρ 0 ) ) (4) We lst the performance of our SAE based on dfferent actvatons on the AR dataset n Table IV. We can see that hyperbolc tangent usually acheves the best performance. It s worth notng that ReLU s usually used for Convolutonal Neural Networks (CNN) and demonstrate good performance for mage classfcaton. But many trcks, ncludng the momentum, weght decay, early stoppng, are used to optmze the objectve functon n CNN. In our objectve functon optmzaton, we smply use the L-BFGS to optmze the objectve functon. Many exstng works [49][38] have shown that dfferent optmzaton methods wll greatly affect the performance of deep neural networks. Maybe more advanced optmzaton method and more trcks help mprove the performance of ReLU. TABLE IV PERFORMANCE COMPARISON WITH DIFFERENT ACTIVATION FUNCTIONS. (%) sgmod ReLU tanh AR Extended Yale B H. Extenson: SSAE for Face Verfcaton Besdes face recognton, our model can also be used for face verfcaton. Specally, we tested our work on the LFW dataset under the constraned wthout outsde data protocol. The performance of dfferent methods under such protocol s lsted n Table. V. Interestngly, though our work s desgned for learnng features to make faces of the same person represented smlarly, the performance of our model for face verfcaton s not bad. By learnng more sophstcated dstance metrc [50] wth our face representaton smultaneously, the performance of our method on LFW probably can be further boosted. TABLE V PERFORMANCE COMPARISON BETWEEN DIFFERENT METHODS ON THE LFW DATASETS. (%) where ρ x = 1 f(x ), N ρ x = 1 f( x ), N KL(ρ ρ 0 ) = ( ρ0 log( ρ 0 ) + (1 ρ 0 ) log( 1 ρ 0 ) ). ρ j j 1 ρ j (5) Method AUC V1-lke/MKL, funneled[51] APEM (fuson), funneled[52] MRF-MLBP[53] Fsher vector faces[54] Egen-PEP[55] MRF-Fuson-CSKDA[56] SSAE

10 10 V. CONCLUSION AND FUTURE WORK In ths paper, we propose a supervsed auto-encoder, and use t to buld deep neural network archtecture for extractng robust features for SSPP face representaton. By ntroducng a smlarty preservaton term, our supervsed auto-encoder enforces faces correspondng to the same person to be represented smlarly. Expermental results on the AR, Extended Yale B, and CMU-PIE datasets demonstrate clear superorty of ths module over other conventonal modules such as DAE or DLN. In vew of the sze of mages and tranng sets, we restrct the mage sze to be 32 32, and only mages of a handful of subjects are used to tran the network. For example, only 20 subjects are used on the CMU-PIE and AR datasets, and only 28 subjects on the Extended Yale B dataset. Obvously more tranng samples wll mprove the stablty of the learnt network and larger mages wll mprove the face recognton accuracy [57]. REFERENCES [1] Y. Su, S. Shan, X. Chen, and W. Gao, Adaptve generc learnng for face recognton from a sngle sample per person. n Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, [2] X. Tan, S. Chen, Z.-H. Zhou, and F. Zhang, Face recognton from a sngle mage per person: A survey, Pattern Recognton, vol. 39, no. 9, pp , [3] L. Fe-Fe, R. Fergus, and P. Perona, One-shot learnng of object categores, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 28, no. 4, pp , [4] M. Turk and A. Pentland, Egenfaces for recognton, n Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, [5] P. N. Belhumeur, J. P. Hespanha, and D. J. Kregman, Egenfaces vs. Fsherfaces: Recognton usng class specfc lnear projecton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 19, no. 7, pp , [6] Q. Gao, L. Zhang, and D. Zhang, Face recognton usng FLDA wth sngle tranng mage per person, Appled Mathematcs and Computaton, pp , [7] J. Wrght, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face recognton va sparse representaton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 31, no. 2, pp , [8] Q. V. Le, M. Ranzato, R. Monga, M. Devn, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, Buldng hgh-level features usng large scale unsupervsed learnng, n Proceedngs of the Internatonal Conference on Machne Learnng, [9] A. Krzhevsky, I. Sutskever, and G. E. Hnton, Imagenet classfcaton wth deep convolutonal neural networks, n Proceedngs of the Conference on Neural Informaton Processng Systems, [10] G. E. Hnton, S. Osndero, and Y.-W. Teh, A fast learnng algorthm for deep belef nets. Neural Computaton, vol. 18, no. 7, pp , [11] P. Vncent, H. Larochelle, I. Lajoe, Y. Bengo, Perre-Antone, and Manzagol, Stacked denosng autoencoders : Learnng useful representatons n a deep network wth a local denosng crteron, Journal on Machne Learnng Research, vol. 11, no. 5, pp , [12] D. Erhan, Y. Bengo, A. Courvlle, P. A. Manzagol, P. Vncent, and S. Bengo, Why does unsupervsed pre-tranng help deep learnng? Journal on Machne Learnng Research, vol. 11, no. 2007, pp , [13] J. Yang, D. Zhang, A. F. Frang, and J.-Y. Yang, Two-dmensonal PCA: A new approach to appearance-based face representaton and recognton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 21, no. 1, pp , [14] J. Wu and Z.-H. Zhou, Face recognton wth one tranng mage per person, Pattern Recognton Letters, vol. 23, pp , [15] S. Chen, D. Zhang, and Z.-H. Zhou, Enhanced (PC) 2 A for face recognton wth one tranng mage per person, Pattern Recognton Letters, vol. 25, pp , [16] A. M. Martnez, Recognzng mprecsely localzed, partally occluded, and expresson varant faces from a sngle sample per class, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 24, no. 6, pp , [17] S. Shan, B. Cao, W. Gao, and D. Zhao, Extended Fsherface for face recognton from a sngle example mage per person, n IEEE Internatonal Symposum on Crcuts and Systems, [18] S. Chen, J. Lu, and Z.-H. Zhou, Makng FLDA applcable to face recognton wth one sample per person, Pattern Recognton, pp , [19] T.-K. Km, J. Kttler, and J. Kttler, Locally lnear dscrmnant analyss for multmodally dstrbuted classes for face recognton wth a sngle model mage, IEEE Transactons on Pattern Analyss and Machne Intellgence, pp , [20] Y. Tang, R. Salakhutdnov, and G. Hnton, Deep lambertan networks, n Proceedngs of the Internatonal Conference on Machne Learnng, [21] N. Japkowcz, S. J. hanson, and M. A. Gluck, Nonlnear autoassocaton s not equvalent to PCA, Neural Computaton, vol. 12, no. 3, pp , [22] S. Rfa, P. Vncent, X. Muller, X. Glorot, and Y. Bengo, Contractve auto-encoders: Explct nvarance durng feature extracton, n Proceedngs of the Internatonal Conference on Machne Learnng, [23] S. Rfa, G. Mesnl, P. Vncent, X. Muller, Y. Bengo, Y. Dauphn, and X. Glorot, Hgher order contratve auto-encoder, n European Conference on Machne Learnng, [24] W. Y. Zou, A. Y. Ng, S. Zhu, and K. Yu, Deep learnng of nvarant features va tracked vdeo sequences, n Proceedngs of the Conference on Neural Informaton Processng Systems, [25] Q. V. Le, A. Karpenko, J. Ngam, and A. Y. Ngn, ICA wth reconstructon cost for effcent overcomplete feature learnng, n Proceedngs of the Conference on Neural Informaton Processng Systems, [26] S. Chopra, R. Hadsell, and Y. LeCun, Learnng a smlarty metrc dscrmnatvely, wth applcaton to face verfcaton, n Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, 2005, pp [27] Y. Tagman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closng the gap to human-level performance n face verfcaton, n Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, [28] Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep learnng denttypreservng face space, n Computer Vson (ICCV), 2013 IEEE Internatonal Conference on. IEEE, 2013, pp [29], Recover canoncal-vew faces n the wld wth deep neural networks, arxv preprnt arxv: , [30] V. Nar and G. E. Hnton, Rectfed lnear unts mprove restrcted boltzmann machnes, n Proceedngs of the Internatonal Conference on Machne Learnng, 2010, pp [31] P. Lenne, The cost of cortcal computaton, Current Bology, vol. 13, pp , [32] Y. Bengo, Learnng deep archtectures for AI, Foundatons and Trends R n Machne Learnng, vol. 2, no. 1, pp , [33] A. Ng, Sparse auto-encoder, CS294A Lecture notes, Stanford unversty. [34] J. Xe, L. Xu, and E. Chen, Image denosng and npantng wth deep neural networks, n Proceedngs of the Conference on Neural Informaton Processng Systems, [35] R. Salakhutdnov and G. E. Hnton, Learnng a nonlnear embeddng by preservng class neghbourhood structure, n Proceedngs of the Internatonal conference on Artfcal Intellgence and Statstcs, 2007, pp [36] X. Glorot and Y. Bengo, Understandng the dffculty of tranng deep feedforward neural networks, n Proceedngs of the Internatonal conference on Artfcal Intellgence and Statstcs, [37] Y. Bengo, Practcal recommendatons for gradent-based tranng of deep archtectures, n Neural Networks: Trcks of the Trade. Sprnger, 2012, pp [38] Q. V. Le, J. Ngam, A. Coates, A. Lahr, B. Prochnow, and A. Y. Ng, On optmzaton methods for deep learnng, n Proceedngs of the Internatonal Conference on Machne Learnng, [39] L. Zhang and X. Feng, Sparse representaton or collaboratve representaton: Whch helps face recognton? n proceedngs of the nternatonal conference on computer vson, [40] L. Wolf, T. Hassner, and Y. Tagman, The one-shot smlarty kernel, n Computer Vson, 2009 IEEE 12th Internatonal Conference on. IEEE, 2009, pp [41] H. Fan, Z. Cao, Y. Jang, Q. Yn, and C. Doudou, Learnng deep face representaton, arxv preprnt arxv: , 2014.

11 [42] T. Sm, S. Baker, and M. Bsat, The CMU Pose, Illumnaton, and Expresson (PIE) database, n FG, 2002. [43] A. Martnez and R. Benavente, The AR face database, no.

Kregman, From few to many: Illumnaton cone models for face recognton under varable lghtng and pose, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 23, no. 6, pp. 643 660, 2001.

Hnton, Tensor analyzers, n Proceedngs of the Internatonal Conference on Machne Learnng, 2013. [47] S. Gao, I. W.-H. Tsang, and L.-T.

Bengo, Deep sparse rectfer neural networks, n Proceedngs of the nternatonal conference on Artfcal Intellgence and Statstcs, 2011. [49] G. E. Hnton and R.

Tan, Dscrmnatve deep metrc learnng for face verfcaton n the wld, n Computer Vson and Pattern Recognton (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1875 1882. [51] N. Pnto, J. J. DCarlo, and D. D. Cox, How far can you get wth a modern face recognton test set usng only smple features?

11 11 [42] T. Sm, S. Baker, and M. Bsat, The CMU Pose, Illumnaton, and Expresson (PIE) database, n FG, [43] A. Martnez and R. Benavente, The AR face database, no. 24, June 1998, CVC Techncal Report. [44] A. Georghades, P. Belhumeur, and D. Kregman, From few to many: Illumnaton cone models for face recognton under varable lghtng and pose, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 23, no. 6, pp , [45] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, Mult-pe, Image and Vson Computng, vol. 28, no. 5, pp , [46] Y. Tang, R. Salakhutdnov, and G. Hnton, Tensor analyzers, n Proceedngs of the Internatonal Conference on Machne Learnng, [47] S. Gao, I. W.-H. Tsang, and L.-T. Cha, Sparse representaton wth kernels, Image Processng, IEEE Transactons on, vol. 22, no. 2, pp , [48] X. Glorot, A. Bordes, and Y. Bengo, Deep sparse rectfer neural networks, n Proceedngs of the nternatonal conference on Artfcal Intellgence and Statstcs, [49] G. E. Hnton and R. R. Salakhutdnov, Reducng the dmensonalty of data wth neural networks, Scence, vol. 313, no. 5786, pp , [50] J. Hu, J. Lu, and Y.-P. Tan, Dscrmnatve deep metrc learnng for face verfcaton n the wld, n Computer Vson and Pattern Recognton (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp [51] N. Pnto, J. J. DCarlo, and D. D. Cox, How far can you get wth a modern face recognton test set usng only smple features? n Computer Vson and Pattern Recognton, CVPR IEEE Conference on. IEEE, 2009, pp [52] H. L, G. Hua, Z. Ln, J. Brandt, and J. Yang, Probablstc elastc matchng for pose varant face verfcaton, n Computer Vson and Pattern Recognton (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp [53] S. R. Arashloo and J. Kttler, Effcent processng of mrfs for unconstraned-pose face recognton, n Bometrcs: Theory, Applcatons and Systems (BTAS), 2013 IEEE Sxth Internatonal Conference on. IEEE, 2013, pp [54] K. Smonyan, O. M. Parkh, A. Vedald, and A. Zsserman, Fsher Vector Faces n the Wld, n Brtsh Machne Vson Conference, [55] H. L, G. Hua, X. Shen, Z. Ln, and J. Brandt, Egen-pep for vdeo face recognton, Submtted to ECCV and under revew, vol. 2104, [56] S. Rahmzadeh Arashloo and J. Kttler, Class-specfc kernel fuson of multple descrptors for face verfcaton usng multscale bnarsed statstcal mage features. [57] M. D. Zeler and R. Fergus, Vsualzng and understandng convolutonal networks, n Computer Vson ECCV Sprnger, 2014, pp Shenghua Gao s an assstant professor n ShanghaTech Unversty, Chna. He receved the B.E. degree from the Unversty of Scence and Technology of Chna n 2008 (outstandng graduates), and receved the Ph.D. degree from the Nanyang Technologcal Unversty n From Jun 2012 to Aug 2014, he worked as a research scentst n Advanced Dgtal Scences Center, Sngapore. Hs research nterests nclude computer vson and machne learnng. He has publshed more than 20 papers on object and face recognton related topcs n many nternatonal conferences and journals, ncludng IEEE T-PAMI,IJCV, IEEE TIP, IEEE TNNLS, IEEE TMM, IEEE TCSVT, CVPR, ECCV, etc. He was awarded the Mcrosoft Research Fellowshp n 2010, and ACM Shangha Young Scentst n Yutng Zhang s a Ph.D. canddate n the Department of Computer Scence at Zhejang Unversty, P.R. Chna, advsed by Gang Pan. He s currently a vstng student n the Department of Electronc Engneerng and Computer Scence at the Unversty of Mchgan, USA. He was also a junor research assstant n the Advanced Dgtal Scences Center (Sngapore), Unversty of Illnos at Urbana-Champagn, durng He receved hs B.E. degree n Computer Scence from Zhejang Unversty n Hs research nterests nclude machne learnng, computer vson, and partcularly n the applcaton of deep graph model. Ku Ja receved the B.Eng. degree n marne engneerng from Northwestern Polytechnc Unversty, Chna, n 2001, the M.Eng. degree n electrcal and computer engneerng from Natonal Unversty of Sngapore n 2003, and the Ph.D. degree n computer scence from Queen Mary, Unversty of London, London, U.K., n He s currently a Vstng Assstant Professor at Unversty of Macau, Macau SAR, Chna. He s also holdng a Research Scentst poston at Advanced Dgtal Scences Center, Sngapore. Hs research nterests are n computer vson, machne learnng, and mage processng. Jwen Lu (S 10-M 11) receved the B.Eng. degree n mechancal engneerng and the M.Eng. degree n electrcal engneerng from the X an Unversty of Technology, X an, Chna, and the Ph.D. degree n electrcal engneerng from the Nanyang Technologcal Unversty, Sngapore, respectvely. He s currently a Research Scentst at the Advanced Dgtal Scences Center (ADSC), Sngapore. Hs research nterests nclude computer vson, pattern recognton, and machne learnng. He has authored/co-authored over 100 scentfc papers n these areas, where 23 papers are publshed n the IEEE Transactons journals (TPAMI/TIP/TIFS/TCSVT/TMM) and 12 papers are publshed n the top-ter computer vson conferences (ICCV/CVPR/ECCV). He serves as an Area Char for ICME 2015 and ICB 2015, and a Specal Sesson Char for 2 VCIP Dr. Lu was a recpent of the Frst-Prze Natonal Scholarshp and the Natonal Outstandng Student Award from the Mnstry of Educaton of Chna n 2002 and 2003, the Best Student Paper Award from PREMIA of Sngapore n 2012, and the Top 10% Best Paper Award from MMSP 2014, respectvely. Recently, he gves tutorals at some conferences such as CVPR 2015, FG 2015, ACCV 2014, ICME 2014 and IJCB Yngyng Zhang Yngyng Zhang receved the B.E. degree from X an Jaotong Unversty,Chna, n He currently s workng as a master canddate n ShanghaTech UnverstyChna. Hs research nterest s computer vson and machne learnng.

Discriminative Dictionary Learning with Pairwise Constraints

Discriminative Dictionary Learning with Pairwise Constraints Dscrmnatve Dctonary Learnng wth Parwse Constrants Humn Guo Zhuoln Jang LARRY S. DAVIS UNIVERSITY OF MARYLAND Nov. 6 th, Outlne Introducton/motvaton Dctonary Learnng Dscrmnatve Dctonary Learnng wth Parwse