IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY"

Karen Thornton
5 years ago
Views:

1 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY Dscrmnatve Shared Gaussan Processes for Multvew and Vew-Invarant Facal Expresson Recognton Stefanos Eleftherads, Student Member, IEEE, Ognjen Rudovc, Member, IEEE, and Maja Pantc, Fellow, IEEE Abstract Images of facal expressons are often captured from varous vews as a result of ether head movements or varable camera poston. Exstng methods for multvew and/or vew-nvarant facal expresson recognton typcally perform classfcaton of the observed expresson usng ether classfers learned separately for each vew or a sngle classfer learned for all vews. However, these approaches gnore the fact that dfferent vews of a facal expresson are just dfferent manfestatons of the same facal expresson. By accountng for ths redundancy, we can desgn more effectve classfers for the target task. To ths end, we propose a dscrmnatve shared Gaussan process latent varable model (DS-GPLVM) for multvew and vewnvarant classfcaton of facal expressons from multple vews. In ths model, we frst learn a dscrmnatve manfold shared by multple vews of a facal expresson. Subsequently, we perform facal expresson classfcaton n the expresson manfold. Fnally, classfcaton of an observed facal expresson s carred out ether n the vew-nvarant manner (usng only a sngle vew of the expresson) or n the multvew manner (usng multple vews of the expresson). The proposed model can also be used to perform fuson of dfferent facal features n a prncpled manner. We valdate the proposed DS-GPLVM on both posed and spontaneously dsplayed facal expressons from three publcly avalable datasets (MultPIE, labeled face parts n the wld, and statc facal expressons n the wld). We show that ths model outperforms the state-of-the-art methods for multvew and vew-nvarant facal expresson classfcaton, and several stateof-the-art methods for multvew learnng and feature fuson. Index Terms Vew-nvarant, mult-vew learnng, facal expresson recognton, Gaussan Processes. I. INTRODUCTION FACIAL expresson recognton (FER) has attracted sgnfcant research attenton because of ts usefulness n many applcatons, such as human-computer nteracton, securty and analyss of socal nteractons, among others [1], [2]. Most exstng methods deal wth magery n whch the depcted Manuscrpt receved February 27, 2014; revsed August 19, 2014 and November 4, 2014; accepted November 8, Date of publcaton November 26, 2014; date of current verson December 12, Ths work was supported by the European Commsson wthn the 7th Framework Programme [FP7/ ] under Grant through the TERESA Project. The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Prof. Jong Chul Ye. S. Eleftherads and O. Rudovc are wth the Department of Computng, Imperal College London, London SW7 2AZ, U.K. (e-mal: s.eleftherads@mperal.ac.uk; o.rudovc@mperal.ac.uk). M. Pantc s wth the Department of Computng, Imperal College London, London SW7 2AZ, U.K., and also wth the Faculty of Electrcal Engneerng, Mathematcs and Computer Scence, Unversty of Twente, Enschede 7522 NB, The Netherlands (e-mal: m.pantc@mperal.ac.uk). Color versons of one or more of the fgures n ths paper are avalable onlne at Dgtal Object Identfer /TIP persons are relatvely stll and exhbt posed expressons n a nearly frontal pose [3]. However, many real-world applcatons relate to spontaneous nteractons (e.g., meetng summarzaton, poltcal debates analyss, etc.), n whch people tend to move ther head whle beng recorded. Furthermore, dependng on the camera poston, facal mages can be taken from multple vews. For these reasons, there s an ever growng need for automated systems that can accurately perform multvew and vew-nvarant facal expresson recognton. The man challenge here s to perform decouplng of the rgd facal changes due to the head-pose and non-rgd facal changes due to the expresson, as they are non-lnearly coupled n 2D mages [4]. Another challenge s how to effectvely explot the nformaton from multple vews (or dfferent facal features) n order to facltate the expresson classfcaton. Thus, accountng for the fact that each vew of a facal expresson s just a dfferent manfestaton of the same underlyng facal expresson related content s expected to result n more effectve classfers for the target task. To date, only a few works that deal wth mult-vew and/or vew-nvarant FER have been proposed. These focus manly on recognton of facal expressons of the sx basc emotons [5]. Based on how they deal wth varaton n head-pose (vew) and expressons n 2D mages, they can be dvded nto: () methods that perform vew-nvarant,.e., per-vew, FER [6] [8], () methods that perform the vew normalzaton before performng FER [9], [10], and () methods that learn a sngle classfer usng data from multple vews [11], [12]. However, the man downsde of these approaches s that they fal to explctly model relatonshps between dfferent vews. Ths, n turn, results n classfers that are less robust for the target task, but also more complex n the case of large number of vews/expressons. All ths can effcently be amelorated usng the modelng strategy of mult-vew leanng methods (see [13], [14]). In ths work, we ntroduce the Dscrmnatve Shared Gaussan Process Latent Varable Model (DS-GPLVM) for mult-vew and vew-nvarant FER. We adopt the multvew learnng strategy n order to represent mult-vew facal expresson data on a common expresson manfold. To ths end, we use the noton of Shared GPs [15], [16], the generatve framework for dscoverng a non-lnear subspace shared across dfferent observaton spaces (e.g., the facal vews or feature representatons). Snce our ultmate goal s the expresson classfcaton, we place a dscrmnatve pror, nformed by the expresson labels, over the manfold. The classfcaton IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

190 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 Fg. 1. The overvew of the proposed DS-GPLVM.

The class separaton n the shared manfold s enforced by the dscrmnatve shared pror p(x), nformed by the data labels.

smultaneously from multple vews for mult-vew approach. The classfcaton of the query mage s then performed usng the knn classfer.

The proposed model s a generalzaton of the dscrmnatve GP Latent Varable Models (D-GPLVM) [17] for non-lnear dmensonalty reducton and classfcaton of data from a sngle observaton space.

Classfcaton of an observed facal expresson, however, can be carred out ether n the vewnvarant manner (n case only a sngle vew of the observed expresson s avalable at runtme) or n the mult-vew manner

In order to keep the model computatonally tractable n the presence of large number of vews, we propose a learnng algorthm that splts the learnng nto dfferent sub-problems (for each vew), and then

2 190 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 Fg. 1. The overvew of the proposed DS-GPLVM. The dscrmnatve shared manfold X of facal expressons captured at dfferent vews (Y, = 1...V) s learned usng the framework of shared GPs (GP ). The class separaton n the shared manfold s enforced by the dscrmnatve shared pror p(x), nformed by the data labels. Durng nference, the facal mages from dfferent vews are projected onto the shared manfold by usng the kernel-based regresson, learned for each vew separately (g(y )) for vewnvarant approach, or smultaneously from multple vews for mult-vew approach. The classfcaton of the query mage s then performed usng the knn classfer. of an observed expresson s then performed n the learned manfold usng the knn classfer. The proposed model s a generalzaton of the dscrmnatve GP Latent Varable Models (D-GPLVM) [17] for non-lnear dmensonalty reducton and classfcaton of data from a sngle observaton space. The learnng of DS-GPLVM s carred out usng the expresson data from multple vews. Classfcaton of an observed facal expresson, however, can be carred out ether n the vewnvarant manner (n case only a sngle vew of the observed expresson s avalable at runtme) or n the mult-vew manner (n case multple vews of the observed expresson are avalable at runtme). The proposed model can also perform fuson of dfferent facal features n order to mprove vew-nvarant facal expresson classfcaton. In order to keep the model computatonally tractable n the presence of large number of vews, we propose a learnng algorthm that splts the learnng nto dfferent sub-problems (for each vew), and then employs the Alternatng Drecton Method (ADM) [18] to optmze each sub-problem separately. The outlne of the proposed approach s gven n Fg. 1. The contrbutons of ths work can be summarzed as follows. 1) We propose the DS-GPLVM for mult-vew and/or vewnvarant FER. The proposed model s a generalzaton of exstng dscrmnatve dmensonalty reducton methods from sngle to multple observaton spaces. Ths s, also, the frst approach that explots the mult-vew learnng strategy n the context of mult-vew FER. 2) We propose a novel learnng algorthm for effcent optmzaton of the model parameters that s based on the ADM strategy. Ths allows us to solve the model parameters optmzaton problem for each-vew, as a separate sub-problem, to perform parameter optmzaton for each vew separately, resultng n the model beng computatonally effcent even n the case of a large number of vews. 3) The proposed DS-GPLVM s applcable to a varety of tasks (mult-vew classfcaton, multple-feature fuson, pose-wse classfcaton, etc.). Compared to state-of-the-art methods for mult-vew learnng, whch employ lnear technques to algn dfferent vews on a manfold, the DS-GPLVM s a kernel-based method, beng able to dscover non-lnear correlatons between dfferent vews. In contrast to state-of-the-art methods for vew-nvarant and/or mult-vew FER, the DS-GPLVM explots dependences between dfferent vews, mprovng the FER performance. Note that an earler verson of ths work appeared n [19]. There are two major extensons ntroduced: 1) n [19], the projectons of data from dfferent vews to the shared space are learned ndependently of the manfold, whle n the DS-GPLVM proposed here they are learned smultaneously. We show n our experments that ths results n mproved recognton of the target facal expressons. 2) Our prevous work n [19] s capable only of vew-nvarant FER, whle here we generalze t to the mult-vew and feature fuson settngs. Fnally, we use the GPs as a bass for our (non-parametrc) mult-vew learnng framework because, n contrast to majorty of parametrc models, t allows us to capture subtle detals of facal expressons and preserve them on the expresson manfold that s largely robust to the vew/subject dfferences. Furthermore, due to the probablstc nature of GPs, dfferent types of prors can seamlessly be ntegrated nto the model for mult-vew learnng (n our case, dscrmnatve prors over the expresson manfold). Last but not least, GPs are known for ther ablty to generalze qute well even from a small number of tranng data (on the order of several hundreds) [17]. Whle ths may not seem a bg advantage when data are abundant, t s of crucal mportance for mult-vew FER due to the scarcty of exstng datasets contanng annotated expressons and poses. The remander of the paper s organzed as follows. Secton II gves an overvew of the related work. In Secton III we present the theoretcal background of the base GPLVM and the D-GPLVM. In Secton IV, we ntroduce the proposed Dscrmnatve Shared Gaussan Process Latent Varable Model for mult-vew FER. Secton V descrbes the conducted experments and shows the results obtaned. Fnally, n Secton VI we conclude the paper. II. RELATED WORK A. Mult-Vew and Vew-Invarant FER As mentoned above, recent advances toward mult-vew facal expresson recognton can be dvded nto three groups. A representatve of the frst group s [6], where the authors used Local Bnary Patterns (LBP) [20] (and ts varants) to perform a two-step facal expresson classfcaton. In the frst step, they select the closest head-pose to the (dscrete) tranng pose/vew by usng the Support Vectors Machne (SVM) [21] classfer. Once the vew s known, they apply the vew-specfc SVM to perform facal-expresson classfcaton. In [7], dfferent appearance features, e.g., Scale Invarant Feature Transform (SIFT) [22], Hstogram of Orented Gradents (HOG) [23], LBP, are extracted around the locatons of characterstc facal ponts, and used to tran varous posespecfc classfers. Smlarly, [8] used per-vew-traned 2D

3 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 191 Actve Appearance Models (AAMs) [24] to locate a set of characterstc facal ponts, and extract LBP, SIFT and Dscrete Cosne Transform (DCT) [25] features around them. By learnng separate classfers for each vew, these approaches gnore correlatons across dfferent vews, whch makes them suboptmal for the target task. As shown by [6] and [7], classfcaton of some facal expressons can be performed better n 15 vew than n the frontal vew, for nstance. Hence, the data from more dscrmnatve vews for expresson classfcaton can be used durng learnng to mprove the underperformng expresson classfcaton n the other vews. In the proposed DS-GPLVM, we do so by performng the classfcaton n a dscrmnatve feature space shared across vews. The approaches n the second group [9], [10] frst perform vew normalzaton, and then apply facal expresson classfcaton n the canoncal vew, usually chosen to be the frontal. For the vew normalzaton, the authors propose the Coupled GP (CGP) regresson model that explots parwse correlatons between the vews n order to learn robust mappngs for projectng facal features (.e., a set of facal ponts) from non-frontal to the frontal vew. A lmtaton of ths approach s that the vew normalzaton and learnng of the expresson classfer are done ndependently, thus boundng the accuracy of the expresson classfcaton by that of the vew normalzaton. Also, snce the vew normalzaton s performed drectly n the observed space, errors n the vew normalzaton step can adversely affect the classfcaton. Ths s even more so due to the hgh-dmensonal nose affectng the vew normalzed features. Furthermore, the canoncal vew has to be selected n advance. Ths can further lmt the accuracy of the expresson classfcaton as such vew may not be the most dscrmnatve for classfcaton of certan facal expresson categores, as mentoned above. These lmtatons are addressed by the proposed DS-GPLVM, whch avods the need for a canoncal vew as t performs the classfcaton on a shared manfold of facal expressons from multple vews, the topology of whch s optmzed for classfcaton of the target expressons. In the thrd group of methods [11], [12], a sngle classfer s learned usng the expresson data from multple vews. Specfcally, [11] used varants of dense SIFT [26] features extracted from mult-vew facal expresson mages. Lkewse, [12] used the Generc Sparse Codng scheme [27] to learn a dctonary that sparsely encodes the SIFT features extracted from facal mages n dfferent vews. However, because of hgh varaton n appearance of facal expressons n dfferent vews and of dfferent subjects, the complexty of the learned classfer ncreases sgnfcantly wth the number of vews/expressons. Ths can easly lead to overfttng, and, n turn, poor generalzaton of the classfer to unseen data. On the other hand, the complexty of the classfer n DS-GPLVM s reduced by accountng for underlyng structure of the data (e.g., the correspondences between the vews) va the shared manfold. B. Mult-Vew Learnng In what follows, we make a short overvew of the most popular mult-vew learnng methods that can be appled to the mult-vew FER. A common approach n mult-vew classfcaton s to learn the vew-specfc projecton usng pared samples from dfferent vews, and to project those samples onto a common latent space, followed by ther classfcaton. The pared samples usually refer to samples that come from the same subject (e.g., face mages of a person n two dfferent vews). The goal here s to learn a latent space where the pared samples are placed close f they come from the same class/subject, and far apart otherwse. A wdely used unsupervsed approach to learn such latent spaces s Canoncal Correlaton Analyss (CCA) [28] and ts non-lnear varant Kernel CCA (KCCA) [29]. The goal of these methods s to fnd projecton to a common subspace where the correlaton between the low-dmensonal embeddngs s maxmzed. These methods can handle data only n the par-wse manner (thus, only two vews at the tme), whch makes them unft for mult-vew classfcaton problems wth more than two vews. A generalzaton of CCA to the multvew settng, Multvew CCA (MCCA), has been proposed n [30]. The man dea of MCCA s to fnd a common subspace where the correlaton between the low-dmensonal embeddngs of any two vews s maxmzed. Apart from CCA-based methods, there are a few works that extend the sngle-vew subspace learnng to the mult-vew case. [31] s a representatve of ths approach. It s a spectral clusterng approach for the mult-vew settng. In partcular, the spectral embeddng from one vew s used to constran the data of the other vew. Note that the methods mentoned above are proposed for unsupervsed learnng. Thus, n the context of the mult-vew FER, they are not expected to perform well as the vew algnment by these methods s not optmzed for classfcaton. Another group of methods performs supervsed mult-vew analyss. For nstance, Mult-vew Fsher Dscrmnant Analyss (MFDA) [32] learns classfers n dfferent vews, by maxmzng the agreement between the predcted labels of these classfers. However, MFDA can only be used for bnary problems. In [14], the authors extended Lnear Dscrmnant Analyss (LDA) [33] to the multvew case, named Multvew Dscrmnant Analyss (MvDA). Ths model maxmzes the between-class and mnmzes the wthn-class varatons, across all the vews, n the common subspace. Generalzed Multvew Analyss (GMA) [13] has also been proposed for extendng dmensonalty reducton technques for sngle vews to multple vews. An nstance of GMA, the Generalzed Multvew LDA (GMLDA), fnds a set of projectons n each vew that attempt to separate the content of dfferent classes and unte dfferent vews of the same class n a common subspace. Another example of GMA s the GM Localty Preservng Projectons (GMLPP), that extends the LPP [34] model, whch can be used to fnd a dscrmnatve data manfold usng the labels. Although effectve n some tasks, these models are all based on lnear projecton functons. Ths can lmt ther performance when dealng wth hgh-dmensonal nput features (.e., appearance based facal features), as well as ther ablty to successfully unravel non-lnear manfold(s) of multple vews. All ths s addressed by the proposed DS-GPLVM model.

4 192 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 III. THEORETICAL BACKGROUND: GAUSSIAN PROCESS LATENT VARIABLE MODELS (GPLVM) In ths secton, we frst gve a bref overvew of the GPLVM [35] for learnng a non-lnear low-dmensonal manfold of a sngle observaton space (e.g., the facal expresson data from a sngle vew). We then descrbe two types of dscrmnatve prors for the manfold, whch are used to obtan the dscrmnatve GPLVMs [17], [36] for data classfcaton. A. GPLVM The GPLVM [35] s a probablstc model for non-lnear dmensonalty reducton. It learns a low dmensonal manfold X =[x 1,...,x N ] T R N q, wth q D, correspondng to the hgh-dmensonal observaton space Y =[y 1,...,y N ] T R N D. The learnng of the manfold and ts mappng to the observaton space s modeled usng the framework of Gaussan Processes (GP) [37]. Specfcally, by usng the covarance functon k(x, x j ) of GPs, the lkelhood of the observed data, gven the manfold, s defned as 1 p(y X,θ)= (2π) ND K exp( 1 D 2 tr(k 1 YY T )), (1) where K s the kernel matrx, the elements of whch are obtaned by applyng the covarance functon k(x, x j ), to each tranng data-par (, j) {1...N}. The covarance functon s usually chosen as the sum of the Radal Bass Functon (RBF) kernel, bas and nose terms k(x, x j ) = θ 1 exp( θ 2 2 x x j 2 ) + θ 3 + δ, j, (2) θ 4 where δ, j s the Kronecker delta functon, and θ = (θ 1,θ 2,θ 3,θ 4 ) are the kernel parameters [37]. The manfold X s then obtaned as the mean of the posteror dstrbuton p(x,θ Y) p(y X,θ)p(X) (3) where the sphercal Gaussan pror s usually placed over the manfold. Ths pror prevents the GPLVM from placng latent ponts nfntely far apart,.e. latent postons close to the orgn are preferred [17]. The learnng of the manfold s accomplshed by mnmzng the negatve log-lkelhood of the posteror n Eq. (3), w.r.t. the latent coordnates n X, whch s gven by L = D 2 ln K tr(k 1 YY T ) log(p(x)). (4) To enforce the latent postons to be a smooth functon of the data space, [38] proposed to back-constran the GPLVM. Ths ensures that the ponts that are close n the data space are also close on the manfold. More mportantly, these constrants allow us to learn the nverse mappngs, whch are used durng the nference step to map the query ponts from the data space onto the manfold. Specfcally, each datum y s backconstraned so that t satsfes N x j = g j (y ; A j ) = a mj k bc (y, y m ), (5) m=1 where x j s the j-th dmenson of x R q, g j s the kernel based regresson over Y, and A s the matrx that holds the parameters for the regresson. Dfferent projecton vectors A j are used for each feature dmenson n order to be able to learn dfferent weghts for each feature dmenson, as n the standard lnear kernel regresson. To obtan a smooth nverse mappng n the back-constrants, we use the RBF kernel k bc (y, y m ) = exp( γ 2 y y m 2 ), (6) where γ s the nverse wdth parameter. Wth such defned back constrants, the model learnng s accomplshed ether by mnmzng the lkelhood n Eq.(4) s.t. the back constrants, or by pluggng the expresson n Eq.(5) nto the lkelhood functon, and solvng the unconstraned optmzaton problem. B. Dscrmnatve GPLVM (D-GPLVM) The GPLVM s a generatve model of the data, where a smple sphercal Gaussan pror s placed over the manfold [17]. However, ths model can be adapted for classfcaton by usng a dscrmnatve pror that encourages the latent postons of the examples of the same class to be close and those of dfferent classes to be far on the manfold. Ths has frstly been explored n [17], where a pror based on Lnear Dscrmnant Analyss (LDA) s proposed. LDA tres to maxmze betweenclass separablty and mnmze wthn-class varablty by maxmzng J(X) = tr(s 1 w S b), (7) where S w and S b are wthn- and between-class matrces, respectvely, defned as L N S w = 1 N (x () k M )(x () k M ) T, (8) N N =1 k=1 L N S b = N (M M 0 )(M M 0 ) T. (9) =1 Here, N tranng ponts from class are stored n X () =[x () () 1,...,x N ], M s the mean of examples of class, and M 0 s the mean of examples of all the classes. The energy functon n Eq. (7) s used to defne dscrmnatve pror over the manfold as { p(x) = 1 exp 1 Z d σd 2 J }, 1 (10) where Z d s a normalzaton constant, and σ d represents a global scalng of the pror. Then, the Dscrmnatve GPLVM (D-GPLVM) [17] s obtaned by replacng the Gaussan pror n Eq. (3) wth the pror n Eq. (10). The authors also proposed a verson of the pror based on Generalzed Dscrmnant Analyss (GDA). A more general pror based on the noton of the graph Laplacan matrx [39] has been used to derve a dscrmnatve GPLVM model named Gaussan Process Latent Random Feld (GPLRF) [36]. To defne the pror, an undrected graph G = (V, E) s frst constructed, where V = {V 1, V 2,...,V N } s the node set, wth node V correspondng to a tranng example x,ande ={(V, V j ), j=1...n = j,

5 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 193 x and x j belong to the same class} s the edge set. By parng each node wth the random vector X k = (X 1k, X 2k,...,X Nk ) T (for k = 1, 2,...,q), we obtan a Gaussan Markov Random Feld (GMRF) [40] w.r.t. graph G. Next, each edge n the graph s assocated wth a weght (n ths case, 1), and the weghts are stored n the weght matrx defned as { 1 f x W j = and x j, = j, belong to the same class 0 otherwse. (11) The graph Laplacan matrx s then defned as L = D W, where D s a dagonal matrx wth D = j W j. Fnally, usng L, the dscrmnatve GMRF pror s defned as q p(x) = p(x k ) = 1 [ exp β ] Z q 2 tr(xt LX), (12) k=1 where Z q s a normalzaton constant and β>0 s a scalng parameter. The term tr(x T LX) n the dscrmnatve pror n Eq. (12) reflects the sum of the dstances between the latent postons of the examples from the same class. Thus, the latent postons from the same class that are closer wll be gven hgher probablty. Ths pror can be seen as a more general verson of the LDA pror n Eq. (10), wthout the restrcton on the sze of the manfold. Also, the weghts used to compute L can be defned usng not only the labels, but also the observed data, resultng n addtonal smoothng constrants. Fnally, the cost functon of the GPLRF model s obtaned by pluggng the pror n Eq. (12) nto Eq. (4). IV. DISCRIMINATIVE SHARED GPLVM (DS-GPLVM) The D-GPLVM from Secton III-B s desgned for a sngle observaton space. In ths secton, we generalze the D-GPLVM so that t can smultaneously learn a dscrmnatve manfold of multple observaton spaces. Ths s attaned by usng the framework of Shared GPs [15], [16]. In our approach, we assume that the multple observaton spaces (e.g., dfferent vews of facal expressons) are dependent, and that they can be algned on a dscrmnatve shared manfold. In what follows, we frst ntroduce the Shared GP model for algnment (fuson) of multple observaton spaces n the shared manfold, and defne the dscrmnatve shared-space pror for the manfold. We then descrbe learnng and nference n the proposed model. A. Shared-Space GPLVM Gven a set of correspondng features Y ={Y (1),...,Y (V ) }, extracted from V vews, nstead of learnng ndependent manfold of data from each vew as done n GPLVM, we learn a sngle manfold X that s assumed to be shared among the vews. Wthn the Shared GPs framework, the jont lkelhood of Y, gven the shared manfold X, s factorzed as follows p(y X,θ s ) = p(y 1 X,θ (1) )...p(y V X,θ (V ) ), (13) where θ s = {θ (1),...,θ (V ) } are the kernel parameters for each observaton space, and the kernel functon s defned as n Eq. (2). It s assumed here that each observaton space s generated from the shared manfold va separate GP. The shared latent space X s then found by mnmzng the jont negatve log-lkelhood penalzed wth the pror placed over the shared manfold, and s gven by L s = L (v) log(p(x)) (14) v where L (v) s the negatve log-lkelhood of data from vew v = 1,...,V, and s gven by L (v) = D 2 ln K(v) tr[(k(v) ) 1 Y (v) (Y (v) ) T ]+ ND ln 2π, 2 (15) where K (v) s the kernel matrx assocated wth the nput data Y (v). In Eq. (15), the sphercal Gaussan pror s placed over the manfold. To obtan a shared manfold for multvew classfcaton, n the followng we defne a dscrmnatve shared-space pror. B. Dscrmnatve Shared-Space Pror To defne dscrmnatve shared-space pror for mult-vew learnng, we generalze the GMRF pror for the sngle vew gven by Eq. (11). To ths end, we frst construct the vew-specfc weght matrces W (v), v = 1,...,V. Instead of usng only the class labels, we also use the data-dependent weghts. Specfcally, the elements of the weght matrx are obtaned by applyng the RBF kernel to the data from each vew as ( ) W (v) exp y(v) y (v) j 2 f = j and c j = t (v) = c j, (16) 0 otherwse. where y (v) s the -th sample (row) n Y (v), c s the class label, and t (v) s the kernel wdth whch s set to the mean squared dstance between the tranng nputs as n [41]. Then, the graph Laplacan for vew v s L (v) = D (v) W (v),whered (v) s a dagonal matrx wth D (v) = j W(v) j. Because the graph Laplacans from dfferent vews vary n ther scale, we use the normalzed graph Laplacan, defned as L (v) N = (D(v) ) 1/2 L (v) (D (v) ) 1/2, (17) Subsequently, we defne the (regularzed) jont Laplacan as L = L (1) N v=1 + L(2) N L(V ) N + ξi = v L (v) N + ξi, (18) wth I the dentty matrx, and ξ a regularzaton parameter (typcally set to a small value e.g., 10 4 ), whch ensures that L s postve-defnte [42]. Ths, n turn, allows us to defne the dscrmnatve shared-space pror as V [ p(x)= p(x Y (v) ) V 1 1 = exp β ] V Z q 2 tr(xt LX). (19) Here, Z q s a normalzaton constant and β>0 s a scalng parameter. The dscrmnatve shared-space pror n (19) ams at maxmzng the class separaton n the manfold learned from data from all the vews, and t can be regarded as a

6 194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 mult-vew kernel extenson of the parametrc LDA/LPP pror defned for a sngle vew n [17] and [36]. Usng ths pror, the negatve log-lkelhood of the proposed DS-GPLVM model s gven by L s (X) = L (v) + β 2 tr(xt LX), (20) v where L (v) s defned by Eq. (15). C. Back-Constrants In the GPLVM from Secton III-A, the back-constrants, defned by the nverse mappngs, ensure that topology of the output space s preserved on the manfold. In DS-GPLVM, ths s acheved by the dscrmnatve shared-space pror snce the weght matrx used to defne the pror s bult from nput data. However, to perform nference wth DS-GPLVM we stll need to learn the nverse mappngs that project data from dfferent vews onto the shared manfold. For ths, we consder two scenaros. In the frst, we defne v sets of constrants (one for each vew), whch are enforced by separate nverse mappngs from each vew to the shared space. In the second, we defne one set of constrants (for all the vews), and whch are enforced by a sngle nverse mappng from all the vews to the shared space. We refer to the former as ndependent back-projectons (IBP), and the latter as sngle back-projecton (SBP). These are gven by IBP from each vew v = 1,...,V X = g(y (v), A (v) ) = K (v) bc A(v). (21) SBP from V vews ( V ) X = g(y, A) = w v K (v) bc A = KA, (22) v=1 where g(, ) represents the mappng functon(s) learned usng the kernel regresson. The elements of K (v) bc are gven by Eq. (6) and w v s the (scalar) weght for vew v. Note that for a sngle vew, the model can be re-parametrzed to obtan an unconstraned optmzaton problem (see Secton III-A). Yet, n the case of multple vews, ths s not possble as t would result n dfferent X for each vew. Therefore, we need to solve a constraned optmzaton problem, the complexty of whch ncreases wth the number of vews. To effcently solve ths, n the followng secton we propose an teratve learnng algorthm for smultaneous learnng of the shared space and nverse mappngs n the proposed model. D. DS-GPLVM: Learnng and Inference Learnng of the model parameters X,θ s and A, conssts of mnmzng the negatve log-lkelhood gven by Eq. (20) subject to ether the IBP or SBP constrants. Formally, we am to solve the followng mnmzaton problem: arg mn L s (X) + R(g) X,θ s,a s.t. { IBP(X, A (v) ) X K (v) bc A(v) = 0,v = 1,...,V SBP(X, A) X KA = 0, V v=1 w v = 1,w v 0, (23) where R(g) s a regularzaton term. To obtan the functon form for R(g), we frst derve the soluton of the regularzed kernel regresson from the mappng functon of the nfntedmensonal feature space g(x ) = φ(x ) T w, as n [43]. The soluton to ths problem s of the form of w = N =1 a φ(x ). Hence, by applyng the Representer Theorem [44] on ths space, and by usng the Tkhonov regularzaton for the parameters w, we arrve at the optmal functonal form for R(g) as { λ (v) 2 r(g(v) ), r(g (v) ) = tr((a (v) ) T K (v) bc A(v) ), for IBP λ 2 tr(a T KA), for SBP (24) IBP: Parameter Optmzaton. We frst present the learnng procedure for the more general case nvolvng the IBP constrants, and then provde the soluton for the SBP case. From Eq. (23), we see that the back-mappng from each vew s represented by an ndependent set of lnear constrants. We explot ths to fnd the model parameters by teratvely solvng a set of sub-problems. To ths end, we frst ncorporate the IBP constrants nto the regularzed log-lkelhood n Eq. (23) by usng the Lagrange multplers. As a result, we obtan the followng augmented Lagrangan functon: L IBP (X, {A (v), (v) } v=1 V ) = L s(x) + R(g) V + (v), IBP(X, A (v) ) + μ V IBP(X, A (v) ) 2 F 2, (25) v=1 v=1 where (v) are the Lagrange multplers for vew v,, s the nner product, and μ>0 s the penalty parameter. We can see from Eq. (25) that the lnear constrant has been ncorporated nto the cost functon as a quadratc penalty term wthout affectng the soluton to the problem. The role of the Lagrange multplers (nner product term) s to acheve effcency n obtanng the soluton wthout the requrement of sequentally ncreasng the penalty parameter to nfnty [18]. The standard approach s to mnmze the objectve n Eq. (25) w.r.t. all the model s parameters smultaneously. Yet, ths s mpractcal, as the fact that the objectve functon s separable, s not exploted to smplfy the problem. To remedy ths, we employ the Alternatng Drecton Method (ADM) [18] to decompose the mnmzaton nto subproblems, each of whch can be solved separately w.r.t. to a subset of the model parameters. More specfcally, we splt the learnng of the parameters of the shared space and the back-mappngs from each vew, by defnng the teratons of ADM as follows. We frst solve for X and θ s as {X,θ s } t+1 = arg mn X,θ s L s (X) + 2 V v=1 IBP(X, A (v) t ) + (v) t 2 F μ. (26) t Then, for each vew v = 1,...,V,wesolveforA (v) as A (v) t+1 = arg mn r(a(v) ) + A (v) 2 IBP(X t+1, A (v) ) + (v) t 2 F μ, t (27)

7 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 195 and fnally update the Lagrangan and the penalty parameter as (v) t+1 = (v) t + IBP(X t+1, A (v) t+1 ) (28) +1 = mn(μ max,ρ ), (29) respectvely. Note that n Eq. (29), ρ s kept constant (t s typcally set to ρ = 1.1). Snce there s not a closed-form soluton for the problem n Eq. (26), we use the conjugate gradent algorthm (CG) [37] to mnmze the objectve w.r.t. the latent postons X and the kernel parameters θ s. 1 On the other hand, the problem n Eq. (27) s smlar to that of Kernel Rdge Regresson (KRR), and t has a closed-form soluton, whch s gven by A (v) = (K (v) bc + λ(v) I) 1 (X + (v) t ) (30) However, ths soluton depends on the parameters γ (v) and λ (v), whch need to be tuned through costly cross-valdaton procedures. To allevate ths, we reformulate the optmzaton problem n Eq. (27). For ths, we use the noton of the Leave- One-Out (LOO) cross-valdaton procedure for the KRR [45] to defne the learnng of the parameters γ (v) and λ (v).once estmated, these parameters are used to compute A (v). The dea of the LOO learnng procedure s based on the fact that gven any tranng set and the correspondng learned regresson model, f we add a sample to the tranng set wth the target equal to the output predcted by the model, the latter wll not change snce the cost functon wll not ncrease [45]. Thus, gven the tranng set wth the sample y (v) left out, the predcted outputs ˆX ( ) (the superscrpt denotes that the -th sample was left out) wll not change f the sample y (v) wth target ˆx ( ) s added to the set. Then, the goal of LOO s to mnmze the dfference between the predctons ˆx ( ) and the actual outputs x for all the samples. To compute ths, we frst need to defne the matrx [ m m M T ] = (K (v) m M bc + λ(v) I), (31) where we parttoned the nverse matrx from Eq. (36) so that the elements correspondng to the -th sample appear only n the frst row and column of M (the same s done for X and (v) t n order to place the -th row on the top). Furthermore, M s the kernel matrx formed from the remanng elements as M = (K (v) bc\ + λ(v) I N 1 ). Then, usng Eq. (36), the predcton and the actual target for sample are gven by ˆx ( ) = m T M 1 m A (v) + m T A(v) (32) x = m A (v) + m T A(v) (v) /. (33) We can now defne the cost for the LOO procedure, whch s E LOO = 1 2 N =1 x ˆx ( ) 2 = 1 2 N A(v) [M 1 (v) 2 ] =1 (34) 1 The dervatves of the objectve w.r.t. the model parameters are gven n the appendx. Mnmzaton of E LOO w.r.t. γ (v) and λ (v) s accomplshed usng the CG algorthm agan. 2 By pluggng these parameters nto Eq. (36), we obtan A (v). Note that by adoptng the LOO learnng approach, we: () avod the burden of the standard cross-valdaton procedures, whch are tme-consumng, and () reduce the chances of overfttng the model parameters by usng the addtonal cost defned n Eq. (34). At ths pont, t s mportant to clarfy that under the proposed ADM-based optmzaton scheme we are able to automatcally learn the majorty of the model s parameters (.e., X θ, μ, λ, γ ), avodng the need of ther tunng va valdaton procedures. The only parameter learned by means of crossvaldaton s the weght of the pror, β, whle we also need to explore the effect of the dmensonalty, q, of the manfold. SBP: Parameter Optmzaton. Analogous to the IBP case, we defne the Augmented Lagrangan functon for the SBP case usng the regularzed negatve log-lkelhood and the SBP constrants from Eq. (23). The resultng functon has the form as n Eq. (25), but after droppng the dependences on v, and replacng the IBP by SBP constrants. The model parameters are then found by applyng the proposed ADM to the Augmented Lagrangan functon. For ths, the objectves n each teraton of the ADM for the IBP case descrbed above are adjusted accordngly. To acheve effcency, when applyng the CG algorthm to the objectve n each teraton of the ADM, wth ether IBP or SBP constrants, we stop at the frst lne search of CG, update the correspondng parameters, and go to the next teraton. The ADM cycle s repeated untl convergence of the Augmented Lagrangan functon. Inference n the DS-GPLVM s straghtforward. The test data y (whch for the vew-nvarant case come from a sngle vew v, and for the mutl-vew case from all avalable vews) are frst projected to the shared space usng the back-mappngs defned by Eq. (21) for the IBP, or Eq. (22) for the SBP case. In the second step, classfcaton of the target facal expresson s accomplshed by usng a sngle classfer traned on the dscrmnatve shared manfold. For ths, we use the knn classfer. 3 Alg.1 summarzes the learnng and nference of the proposed DS-GPLVM. V. EXPERIMENTS A. Datasets and Expermental Procedure We evaluate the performance of the proposed DS-GPLVM on expressve face mages from three publcly avalable datasets: MultPIE [46], Labeled Face Parts n the Wld 2 The exact dervaton of Eq. (32)-(33) along wth the gradents of Eq. (34) w.r.t. γ (v) and λ (v) are gven n the appendx. 3 In the model as defned, the resultng posteror s the manfold and not the class nformaton, so t cannot be used for the classfcaton. For ths reason, we need to apply a classfer to the nputs projected onto ths manfold durng nference. A reasonable choce would be to opt for the GP classfer, however, n our case ths would be mpractcal for two reasons: () n the case of more than two classes, the computaton complexty of GPC ncreases sgnfcantly snce we have to learn a dfferent kernel for each class, makng t less applcable to the large number of classes/vews. () More mportantly, snce we are not nterested n the classfcaton uncertanty, the GPC s expected to perform smlarly to the standard kernel regresson, as noted n [37]. Thus, we opt for the determnstc knn classfer whch s the commonly employed classfer n the GPLVM dscrmnatve models (e.g., see GPLRF [36]).

196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24

15 Algorthm 1 DS-GPLVM Learnng and Inference Fg. 2.

(LFPW) [47] and Statc Facal Expressons n the Wld (SFEW) [48]. Fg. 2 shows sample mages from these datasets.

Surprse (SU), Smle (SM), Scream (SC) and Squnt (SQ), captured at pan angles 30, 15, 0, 15 and 30, resultng n 1531 mages

For all mages, we selected the flash from the vew of the correspondng camera n order to have the same llumnaton condtons.

com, depctng spontaneous facal expressons (manly smles), n large varaton of poses, llumnaton and occluson.

We manually annotated the mages n terms of the poses used n MultPIE.

head poses, occlusons and llumnaton condtons. The mages have been labeled n terms of sx basc emoton expressons,.e., Anger (AN), Dsgust (DI), Fear (FE), Happness (HA), Sadness (SA), Surprse (SU) and Neutral (NE).

of 68 facal landmark ponts were provded by [49], whch were used to algn the facal mages n each pose usng an affne

Smlarly, the mages from SFEW were cropped (112 164 pxels) and algned usng 5 facal landmark ponts (center of the eyes, tp

For the experments on MultPIE, we used three sets of features: (I) facal ponts, (II) LBPs [20], and (III) DCT [25].

patch. We then concatenated the results from all the patches, to form the feature vectors.

whle the latter preserves the spatal correlaton of the pxels nsde the neghborhood.

reduce the dmensonalty of the orgnal feature vectors (especally the appearance based).

The pont features have around 20D, whle both the appearance features have around 100D.

descrptors as n [48],.e., Local Phase Quantzaton (LPQ) [50] and Pyramd of HOG (PHOG) [51].

Ths results n 47- and 220D feature vectors respectvely. The conducted experments are organzed as follows.

8 196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 Algorthm 1 DS-GPLVM Learnng and Inference Fg. 2. Example mages from MultPIE (top), LFPW (mddle) and SFEW (bottom) datasets wth the facal pont annotatons for the frst two. (LFPW) [47] and Statc Facal Expressons n the Wld (SFEW) [48]. Fg. 2 shows sample mages from these datasets. From the MultPIE dataset we used mages of 270 subjects depctng acted facal expressons of Neutral (NE), Dsgust (DI), Surprse (SU), Smle (SM), Scream (SC) and Squnt (SQ), captured at pan angles 30, 15, 0, 15 and 30, resultng n 1531 mages per pose. For all mages, we selected the flash from the vew of the correspondng camera n order to have the same llumnaton condtons. The LFPW dataset contans mages downloaded from google.com, flckr.com, and yahoo.com, depctng spontaneous facal expressons (manly smles), n large varaton of poses, llumnaton and occluson. We used 200 mages of NE and SM expressons from the test set provded by [47]. We manually annotated the mages n terms of the poses used n MultPIE. Lastly, the SFEW dataset conssts of 700 mages of 95 subjects, extracted from moves contanng facal expressons wth varous head poses, occlusons and llumnaton condtons. The mages have been labeled n terms of sx basc emoton expressons,.e., Anger (AN), Dsgust (DI), Fear (FE), Happness (HA), Sadness (SA), Surprse (SU) and Neutral (NE). The mages from both MultPIE and LFPW were cropped so as to have equal sze ( pxels), and annotatons of the locatons of 68 facal landmark ponts were provded by [49], whch were used to algn the facal mages n each pose usng an affne transform. Smlarly, the mages from SFEW were cropped ( pxels) and algned usng 5 facal landmark ponts (center of the eyes, tp of the nose, and corners of the mouth) provded by [48]. For the experments on MultPIE, we used three sets of features: (I) facal ponts, (II) LBPs [20], and (III) DCT [25]. More specfcally, from each algned facal mage we extracted LBPs and DCT features from local patches of sze around the facal landmarks. For LBPs, we used 8 neghbors wth radus 2, and n the case of DCT we kept the frst 15 coeffcents (zg-zag method) of each patch. We then concatenated the results from all the patches, to form the feature vectors. Note that LBP and DCT are complementary features, snce the former captures local nformaton between neghborhood of pxels, whle the latter preserves the spatal correlaton of the pxels nsde the neghborhood. Fnally, we appled PCA on the three feature sets, keepng 95% of the total energy, to remove unwanted nose and artfacts and reduce the dmensonalty of the orgnal feature vectors (especally the appearance based). The resultng dmensonalty of each set vares among the vews. The pont features have around 20D, whle both the appearance features have around 100D. In the experments conducted on LFPW, we used only feature set (I), whle for SFEW we extracted the same local texture descrptors as n [48],.e., Local Phase Quantzaton (LPQ) [50] and Pyramd of HOG (PHOG) [51]. To reduce the dmensonalty, we appled agan PCA by keepng the same amount of energy,.e., 95%. Ths results n 47- and 220D feature vectors respectvely. The conducted experments are organzed as follows. In Secton V-B, we perform a qualtatve analyss of the DS-GPLVM usng the MultPIE dataset. In Secton V-C, we evaluate the effectveness of the proposed DS-GPLVM n the task of mult-vew FER on MultPIE. Specfcally, we consder two settngs: the standard mult-vew settng, where mages from all the vews are avalable durng tranng/nference, and vew-nvarant settng, where mages from all the vews are avalable durng tranng but only a sngle vew s avalable durng nference. Furthermore, we also evaluate the model on the feature fuson task, where dfferent types of features extracted wthn the same vew are used. In addton, we challenge the robustness of the model under dfferent llumnaton, where we evaluate the performance of the model on mages wth dfferent lghtng condtons wthn the same vew. In Secton V-E, we test the ablty of the DS-GPLVM to generalze to spontaneously dsplayed facal expressons. For ths, we perform the cross-dataset evaluaton of the model, where mages of SM and NE class from MultPIE are used for tranng, and mages of the correspondng classes from LFPW for testng. Fnally, n Secton V-F, we evaluate DS-GPLVM on the feature fuson task usng real-world mages from the SFEW dataset.

9 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 197 Fg. 3. DS-GPLVM. Upper row shows mean classfcaton rate across all 5 poses from the MultPIE dataset usng feature-set (I) as a functon of: (a) the number of tranng data per pose, (b) the dmensonalty of the latent space, and (c) the pror scale parameter β. Lower row depcts: (d) the negatve Log-Lkelhood, (e) the norms of the constrants n the DS-GPLVM, and (f) the mean classfcaton rate, as a functon of the number of the ADM cycles. In the experments mentoned above, we compare the DS-GPLVM to the state-of-the-art vew-nvarant and mult-vew learnng methods. As the baselne method, we use the 1-nearest neghbor (1-NN) classfer traned/tested n the orgnal feature space. Smlarly, we apply 1-NN classfer to the subspace obtaned by LDA [33], supervsed LPP [52], and ther kernel counterparts, the D-GPLVM [17] wth the LDA-based pror, and the GPLRF [36]. These are well-known methods for supervsed dmensonalty reducton, and we show ther performance n the vew-nvarant verson of the experments. We also compare to our prevous work n [19], where the latent space and the back-mappngs are learned ndependently. We denote ths model as DS-GPLVM (nd.) to dstngush t from the model proposed here. In the experments conducted n the mult-vew/feature fuson settngs, we compare DS-GPLVM to the baselne methods: CCA [28] and KCCA [29]. Snce they are desgned to deal wth only two modaltes (feature sets), we follow the par-wse (PW) evaluaton approach, as n [14],.e., the methods were traned on all combnatons of vew pars, and ther results were averaged. We also compared DS-GPLVM to the state-of-theart methods for mult-vew learnng, namely, the MvDA [14], and the mult-vew extensons of LDA (GMLDA), and LPP (GMLPP), proposed n [13]. In all our experments we performed 5-fold subject ndependent cross-valdaton. We used a separate valdaton set to tune the parameters of each model. More specfcally, for all the GPLVM-based methods (.e., DS-GPLVM, GPLRF and D-GPLVM) the optmal weght for the pror β was set usng a grd search. For the GPLRF and D-GPLVM we performed addtonally an extra grd search to tune the parameter of the kernel of the back-mappng (RBF kerenel was used) as n [17]. For the GMA-based methods (.e., GMLDA and GMLPP) we tuned the parameter that controls the algnment of the subspaces as suggested n [13]. Fnally, n KCCA the wdth of the employed RBF kernel was cross-valdated, whle LPP, LDA and MvDA had no parameters to tune. To report the accuracy of FER, we use the classfcaton rate, where the classfcaton was performed on the test set usng the 1-NN classfer n all the subspacebased models. The fve folds wth the correspondng tran, valdaton and test sets have been generated once and kept fxed durng all the experments for all the methods, n order to acheve a far comparson. For the experments on MultPIE the sze of the tran, valdaton and test set was 600, 600, and 300 mages per vew respectvely. For the cross dataset experment, snce we used only mages wth SM and NE expressons from MultPIE to tran the models, the resultng tran and valdaton sets were slghtly smaller, and n partcular, 500 and 100 mages per pose respectvely. The test set was the 200 mages from LFPW and t vared dependng the pose from mages. Fnally, for the experments on SFEW we adopted the confguraton proposed by the creators of the dataset n [48]. The data were already splt nto two folds, for tranng and testng. Each tme the tranng fold was further splt n 5 folds, to tune the parameters of the models wth 5-fold subject ndependent cross-valdaton. The sze of the resultng sets was 280, 70 and 350 mages respectvely. For ths experment, due to the small sze of the dataset, after tunng the parameters wth the cross valdaton, each model was re-traned on the whole tran and valdaton set (the one of the two orgnal folds of the dataset) wth the optmal parameters, before reportng the results on the test set. B. DS-GPLVM: Qualtatve Analyss In ths secton, we evaluate the performance of the proposed DS-GPLVM w.r.t. the varous parameter values. For ths, we use the feature set (I),.e., the facal ponts, extracted from

10 198 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 the MultPIE dataset. Fg. 3 shows average classfcaton rate (across the vews) of the DS-GPLVM for dfferent number of tranng samples per vew, the sze of the shared-space, and parameter β ={1, 3, 10, 30, 100, 300, 1000, 10000}. Fg.3(a) shows performance of SBP and IBP versons of DS-GPLVM, the parameters of whch are learned usng a varyng number of tranng data, whle the manfold sze s fxed to 5. We see that the SBP versons of DS-GPLVM (mult-vew settng) acheves a hgh classfcaton rate ( 87%) when usng a relatvely small number of tranng data (.e., 100 mages per vew). On the other hand, the IBP verson of DS-GPLVM (vew-nvarant settng) requres more tranng data ( 500 mages per vew) to acheve a smlar performance. Ths s a consequence of not usng the mages from all avalable vews durng the nference step. However, wth the ncreased number of tranng data, the model effectvely learns the correlatons among the vews, renderng the nformaton from some vews redundant durng the nference. From Fg. 3(b), we see how the sze of the shared space affects the accuracy of the learned model. It s clear that both SBP and IBP varants of the model fnd the 5D shared-space optmal for classfcaton. Lower dmensonal manfolds fal to explan the correlatons among the vews, whle manfolds wth more than 5D do not nclude any addtonal dscrmnatve nformaton. Fg. 3(c) llustrates the nfluence of the shared-space dscrmnatve pror on the classfcaton task. In the case of both SBP and IBP, β = 300 results n the best performance of the model, whle ts further ncrease leads to a drop n the performance. Ths s expected, as for hgh values of β the lkelhood term n the DS-GPLVM s fully gnored, resemblng LPP. Evdently, such model s prone to overfttng manly because of the strong nfluence of the labels durng tranng. On the other hand, for small values of β the shared-space s not suffcently nformed about the class labels, resultng agan n a lower performance. In what follows, we set for both the SBP and IBP varants of the model the number of tranng examples to 500, sze of the shared space to 5, and β = 300. Fg. 3(d)-(f) llustrate the convergence propertes of the DS-GPLVM. We see from Fg. 3(d) that the regularzed negatve log-lkelhood of the model reaches a local mnmum n less than 25 cycles of the ADM. Fg. 3(e) shows the Frobenus norm [33] of the constrants for the SBP and IBP varants, the dfference between the estmated shared space and the back-mappngs. Note that the DS-GPLVM s always ntalzed n the 15 vew (t s found to be the most nformatve vew). Hence, we can see that the norm of ths vew (black curve) starts from a low value when IBP s used. However, wth more cycles of the ADM, the DS-GPLVM learns the shared manfold by takng nto account all vews, and thus, the error of back projectons from the remanng vews to the shared subspace decreases, whle the one from the ntalzed vew,.e., the 15, ncreases slghtly the consequence of the model tryng to algn the manfolds of dfferent vews. The red curve represents the error between the learned subspace and the back projectons n the case of SBP. It s clear that the SBP varant outperforms the IBP varant of the model, snce the former acheves a closer backprojecton to the shared dscrmnatve manfold, resultng n TABLE I AVERAGE CLASSIFICATION RATE ACROSS FIVE VIEWS FROM THE MULTIPIEDATASET FOR THREE FEATURE SETS.IBPVERSION OF DS-GPLVMWAS TRAINED USING ALL AVAILABLE VIEWS, AND TESTED PER VIEW.THE REPORTED STANDARD DEVIATION IS ACROSS FIVE VIEWS a better classfcaton performance. Ths comes wth a larger number of the ADM cycles durng learnng of the DS-GPLVM wth SBP snce t uses all vews smultaneously to learn the back-mappng. Fnally, from Fgs. 3(e)-(f), we observe strong correlaton between the norms of the model varants and the classfcaton rate. In all cases, the ncreased classfcaton performance s acheved by decreasng the gap between the shared-space and back-mappngs, wth both measures convergng synchronously. C. Comparsons Wth Other Mult-Vew Learnng Methods 1) Same Facal Features n Multple Vews: We evaluate the proposed DS-GPLVM model across vews n both vewnvarant and mult-vew settng. The former refers to the scenaro where data from all vews are used for tranng, whle testng s performed usng data from each vew separately, and the latent space s back-constraned usng the IBP. The latter refers to the scenaro where data from all vews are used durng tranng and testng, and the latent space backconstraned usng the SBP. The same strategy was used for evaluaton of other mult-vew technques.e., GMLDAand GMLPP. Table I summarzes the results for the three sets of features, averaged across the fve vews from MultPIE. We see that the facal ponts (feature set (I)) result n a more dscrmnatve descrptor for all methods, although we end up wth hgher standard devaton compared to the appearance features (feature sets (II) and (III)). Evdently, DS-GPLVM outperforms the other vew-nvarant and mult-vew models on all three feature sets, showng that t can successfully unravel the dscrmnatve shared-space that s better suted for FER. Interestngly, n ths experment LDA- and LPP-based lnear methods acheve hgh accuracy, whch s comparable to that of D-GPLVM and GPLRF. Moreover, GMLDA and GMLPP perform smlarly to ther sngle vew traned counterparts, ndcatng that they were not able to fully beneft from the presence of addtonal vews. We also observe a smlar performance of the MvDA and the standard LDA. Note that, the accuracy of DS-GPLVM s hgher by 3% than that of GPLRF, whch s a specal case of DS-GPLVM. We attrbute ths to the ablty of the DS-GPVLM to ntegrate the dscrm-

11 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 199 TABLE II VIEW-INVARIANT CLASSIFICATION RATE ON MULTIPIE DATASET FOR THE BEST FEATURE SET (.e., FACIAL POINTS (I)). IBP VERSION OF DS-GPLVMIS TRAINED USING ALL AVAILABLE VIEWS, AND TESTED PER VIEW. THE REPORTED STANDARD DEVIATION IS ACROSS 5FOLDS TABLE III CLASSIFICATION RATE FOR THE MULTI-VIEW TESTING SCENARIO USING THE SBP VERSION OF DS-GPLVM. THE REPORTED STANDARD DEVIATION IS ACROSS THE 5FOLDS TABLE IV ACCURACY OF THE AUGMENTED CLASSIFICATION IN THE FRONTAL POSE.FEATURE FUSION IS ATTAINED WITH THE SBP VERSION OF DS-GPLVM natve nformaton from multple vews nto the shared space. We draw smlar conclusons from the comparson between DS-GPLVM and DS-GPLVM (nd.), where the latter fals to mpose the vew constrants on the shared manfold. Table II shows the performance of the models tested across all vews, when feature set (I) (the best for all the models from Table I) s used. It s evdent that the proposed DS-GPLVM performs consstently better than the compared models across all vews. Note that all models acheve the lowest classfcaton rate n the frontal vew. However, the DS-GPLVM sgnfcantly mproves the performance attaned by the other models n ths vew. We attrbute ths to the fact that DS-GPLVM performs the classfcaton n the shared space, where the classfcaton of the expressons from the frontal vew s facltated due to the dscrmnatve nformaton learned from the other vews. Furthermore, t s worth notng that the models accuracy on the negatve pan angles (the left sde of the face) s hgher than on the correspondng postve pan angles (the rght sde of the face). Snce MultPIE contans more examples of negatve emoton expressons, ths confrms recent fndngs n [53] showng that the left hemsphere of the face s more nformatve when t comes to expressng negatve emotons (e.g., Dsgust). The rght hemsphere s more nformatve for postve emotons (e.g., Happness). In other words, due to the mbalance of the emoton categores n the used dataset, the learned classfers were based toward negatve emoton expressons, and, hence, to the negatve pan angles. Table III compares the performance of the SBP varant of DS-GPLVM wth other mult-vew learnng methods on three feature sets. The poor performance of KCCA can be attrbuted to ts nherent propensty to overfttng tranng data, as also observed n, see [29]. In addton, both CCA and KCCA do not use any supervsory nformaton durng the subspace learnng, whch further explans ther low performance. By comparng GPLRF (wth concatenated features from dfferent vews) and DS-GPLVM, we see that the former, although not a multvew method, performs comparably to our DS-GPLVM n the case of feature set (I). We attrbute ths to the fact that GPLRF can effectvely explan varaton n facal ponts from multple vews usng a sngle GP. Yet, because of the large varaton n the appearance of facal expressons from dfferent vews, the same s not the case when feature sets (II) and (III) are used. When compared to the state-of-the-art methods for mult-vew learnng (GMA and MvDA), DS-GPLVM performs smlarly or better on all three feature sets. Furthermore, the SBP verson of DS-GPLVM durng nference succeeds to model complementary nformaton from all avalable vews, resultng n a hgher accuracy compared to the best performng vew,.e., 15, of the IBP varant of DS-GPLVM (see Table II). 2) Feature Fuson: We next evaluate DS-GPLVM n the feature fuson task, where the goal s to augment vewnvarant facal expresson classfcaton by fusng dfferent feature sets. Specfcally, we traned the SBP verson of DS-GPLVM usng the three feature sets extracted from the frontal vew only. Ths choce has been made because the frontal vew s not the most nformatve one ( 15 s), and hence, there s a lot of space for mprovement. From Table IV, we see that the accuracy of DS-GPLVM n the frontal vew outperforms that acheved by the GPLRF by more than 3%, where the features are smply concatenated and used as nput. Ths s because GPLRF cannot fully account for varaton n all three feature sets usng a sngle GP. By contrast, DS-GPLVM learns separate GPs for each feature set, resultng n mproved classfcaton performance n the frontal vew.

200 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 TABLE V CLASSIFICATION RATE ON THE FRONTAL VIEW UNDER DIFFERENT ILLUMINATION FOR FEATURE SET (III).

THE IBP VERSION OF DS-GPLVM WITH FEATURE SET (III), OUTPERFORMS THE STATE-OF-THE-ART METHODS FOR VIEW-INVARIANTFER.

12 200 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 TABLE V CLASSIFICATION RATE ON THE FRONTAL VIEW UNDER DIFFERENT ILLUMINATION FOR FEATURE SET (III). THE IBP VARIANT OF DS-GPLVM WAS USED.THE REPORTED STANDARD DEVIATION IS ACROSS THE 5FOLDS TABLE VI COMPARISON OF TESTED METHODS ON THE MULTIPIE DATABASE.THE IBP VERSION OF DS-GPLVM WITH FEATURE SET (III), OUTPERFORMS THE STATE-OF-THE-ART METHODS FOR VIEW-INVARIANTFER. THE REPORTED STANDARD DEVIATION IS ACROSS 5FOLDS It s also mportant to menton that by tranng GPLRF usng each feature set separately, we obtaned the followng classfcaton rates: 77.6%, 81.3% and 82.1%, for feature sets (I), (II), and (III), respectvely. Compared to the accuracy of DS-GPLVM n Table IV (87.1%), the proposed feature fuson sgnfcantly outperforms each of the feature sets used ndependently. Ths s expected snce the appearance features (LBPs and DCT), extracted from local patches, do not encode global nformaton about face geometry, whch s effcently encoded by facal ponts. On the other hand, facal ponts are not nformatve about transent changes n facal appearance (e.g., wrnkles and bulges) whch are successfully captured by the appearance features. Thus, the combnaton of these features wthn the proposed framework turn out to be hghly effectve. The rest of mult-vew methods also acheve sgnfcant ncrease n ther performance (apart from GMLDA). However, DS-GPLVM outperforms (although margnally n some cases) all these state-of-the-art models. 3) Same Facal Features n Dfferent Illumnaton: Heren, we evaluate the proposed DS-GPLVM under dfferent llumnaton on MultPIE, where the goal s to learn an llumnatonfree manfold for FER. For the purposes of ths experment, we used only mages from the frontal vew wth two dfferent lghtng condtons: () no lghtng source (dark vew), and () lghtng from the flash of the correspondng camera (brght vew). Each lghtng condton has been consdered as a separate vew to tran the IBP varant of DS-GPVLM wth feature set III. DCT features were selected, snce they are less robust to llumnaton varatons than LBPs, and thus a dfference n the performance between the two llumnaton condtons s expected. From Table V we see that ths dfference s present n the results of the sngle-vew method,.e., the GPLRF. The latter was traned separately for each lghtng condton, and hence, the two learned manfolds falsely encoded the llumnaton as mportant nformaton, resultng n a consderable gap between the performance of the brght and the dark vew. Contrary to that, the compared mult-vew methods,.e., GMLDA, GMLPP and MvDA, managed to remove, to some extent, the lghtng condton of the vews under the common space. Ths s evdenced by the mprovement on the performance of the dark vew, although a notable dfference between the performance of the two vews stll exsts. On the other hand, the proposed DS-GPLVM, not only acheved better results under both llumnaton condtons, but t also managed to algn them by dscardng the llumnaton under the shared space. Note that the DS-GPLVM reports smlar classfcaton rate, regardless the orgnal lghtng condton of the vew. D. Comparsons Wth Other Mult-Vew Methods We compare DS-GPLVM (wth the IBP varant usng feature set (III)) to the state-of-the-art methods for vewnvarant FER. The results for the LGBP-based method, where the LBP features are extracted from Gabor mages, are obtaned from [6]. For the method n [12], we extracted the Sparse SIFT (SSIFT) features from the same mages we used from MultPIE. In both of the aforementoned methods, the target features (LGBP and SSIFT) are extracted per-vew, and then fed nto the vew-specfc SVM classfers. We also compared our model to the Coupled GP (CGP) model [9], where frst vew-normalzaton s performed by projectng a set of facal ponts (feature set (I)) from non-frontal vews to the canoncal vew. In our experments wth CGP, we set the canoncal vew to the most dscrmnatve vew among the postve pan angles (.e., 15 ). Ths was followed by classfcaton usng the SVM learned n ths vew. Table VI shows comparatve results. We observe frst that all methods (except [12]) acheve the best results for the 15 vew, ndcatng that regardless of the method/features employed, ths vew s more dscrmnatve (among the postve pan angles) for the target task. We also note that DS-GPLVM outperforms on average the other two methods, whch are based on the appearance features. Ths dfference s n part due to the features used and n part due to the fact that the methods n [6] and [12] both fal to model correlatons between dfferent vews. By contrast, the CGP method accounts for the relatons between the vews n a par-wse manner, whle DS-GPLVM and DS-GPLVM (nd.) do so for all the vews smultaneously. However, the proposed DS-GPLVM shows superor performance to that of DS-GPLVM (nd.), whch n turn, outperforms CGP. Ths s because CGP performs vew algnment () drectly n the observaton space, and () wthout usng any dscrmnatve crteron durng ths process. Thus, the effects of hghdmensonal nose and the errors of vew-normalzaton adversely affect ts performance n the classfcaton task. On the other hand, DS-GPLVM (nd.) algns the vews drectly

13 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 201 Fg. 4. Comparatve confuson matrces for FER over all angles of vew for the (a) DS-GPLVM, (b) CGP, (c) SSIFT and (d) LGBP. TABLE VII SMILE DETECTION IN IMAGES FROM LFPW DATASET.THE METHODS WERE TRAINED ON MULTIPIEDATASET USING FEATURE SET (I). WE USED THE IBP VERSION OF DS-GPLVM FOR THE VIEW-INVARIANT FER wth non-frontal poses do not exactly belong to the dscrete set of poses, but rather a contnuous range from 0 to ±30. Thus, the accuracy of the pose regstraton sgnfcantly affects the performance of the models. Nevertheless, the proposed DS-GPLVM outperforms the other models by a large margn n all poses except 30. To explan ths, we checked the number of test examples of smles n ths pose, and found that only few were avalable (contrary to other poses, whch contaned far more examples). Therefore, the msclassfcaton of some resulted n a sgnfcant drop n the performance of both DS-GPLVM and DS-GPLVM (nd.). n the shared space optmzed for expresson classfcaton, whle the proposed DS-GPLVM mposes further constrants on the shared manfold, resultng n a better performance on the target task. Ths s also reflected n the confuson matrces n Fg. 4. Note that the man source of confuson are the facal expressons of Dsgust and Squnt. Ths s because they are characterzed by smlar facal changes n the regon of the eyes. However, the proposed DS-GPLVM mproves sgnfcantly the accuracy on Squnt, compared to the other models. E. Cross Dataset Experments on MultPIE and LFPW In ths secton, we test the ablty of DS-GPLVM (the IBP varant) to generalze to unseen real-world spontaneous data. To ths end, we evaluate dfferent models on the smle detecton task, where the feature set (I) extracted from mages from MultPIE s used for tranng. Images from LFPW are used for testng. Ths s a rather challengng task manly because the test mages are captured n an uncontrolled envronment, whch s characterzed by large varaton n head-poses and llumnaton, and occlusons of parts of the face. Also, the models are traned usng data of posed (delberately dsplayed as opposed to spontaneous and n the wld ) expressons, whch can dffer consderably n subtlety compared to the spontaneous expressons used for testng. The dffculty of the task s evdenced by the results n Table VII, where we observe a sgnfcant drop n accuracy of all methods. Furthermore, we observe that the most nformatve vews for smle detecton are the ones wth postve degrees (the rght sde of the face). Ths, agan, s for the reasons explaned n Secton V-C1. However, all methods attan the hgher accuracy n the frontal pose. We attrbute ths to the fact that the faces F. Expresson Recognton on Real World Images From SFEW Fnally, we evaluate the models on the feature fuson task, where the features are extracted from mages of spontaneously dsplayed facal expressons n real-world envronment. Specfcally, we used LPQ [50] and PHOG [51] features from expressve mages from the SFEW dataset. Contrary to the cross-dataset evaluaton from the prevous secton, here both tranng and testng are performed usng real-world spontaneous expresson data. Note that LPQ s a texture descrptor that captures local nformaton over a neghborhood of pxels, resultng n ts beng robust to llumnaton changes. On the other hand, PHOG s a local descrptor whch s capable of preservng the spatal layout of the local shapes n an mage. Thus, we expect the fuson of these two to acheve mproved performance on the target task. The provded mages of SFEW were orgnally dvded nto two subject ndependent folds, and we report the average results over the folds. Table VIII shows the results obtaned for dfferent methods. We used the SBP varant of the DS-GPLVM. As the baselne we use the results obtaned by the database creators [48]. The authors used non-lnear SVM classfer on the concatenaton of the features to report the classfcaton rate on the feature fuson task. We see that all employed mult-vew learnng methods outperform the baselne on average. Ths s due to ther ablty to effectvely explot the dscrmnatve nformaton embedded n both feature spaces. However, n most cases, the lnear mult-vew learnng methods are outperformed by the proposed DS-GPLVM. We attrbute ths to the fact that the lnear models are unable to fully unravel the non-lnear dscrmnatve manfold of the used feature spaces. By contrast, ths s handled better by the non-lnear mappngs n the DS-GPLVM, resultng n ts average performance beng the best among the tested models. Note, however, that n the case

14 202 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 TABLE VIII CLASSIFICATION RATES PER EXPRESSION CATEGORY OBTAINED BY DIFFERENT MODELS TRAINED/TESTED USING THE SFEW DATASET of Surprse, Fear and Neutral, ts performance s lower than that of the lnear models. By nspectng the back-projected test examples of these two expressons on the shared manfold, we observed that Neutral was spread around other emoton categores. Ths s because the varyng level of expressveness of dfferent subjects, resultng n examples of Neutral beng categorzed as other expressons wth low-ntensty levels. As for the Surprse and Fear, the learned shared manfold ndcated overfttng of these expressons. Ths s manly due to subject dfferences, whch adversely affected the ablty of the back-mappngs to correctly map these expressons onto the shared manfold. Nevertheless, DS-GPLVM outperformed the rest of the models on the remanng expressons, wth a consderable mprovement on Dsgust, Happness and Sadness. VI. CONCLUSION In ths paper, we proposed the DS-GPLVM model for learnng a dscrmnatve shared manfold of facal expressons from multple vews, that s optmzed for the expresson classfcaton. Ths model s a generalzaton of latent varable models for learnng a dscrmnatve subspace of a sngle observaton space. As such, t presents a complete non-parametrc multvew learnng framework that can nstantate the rest of the compared non-lnear sngle-vew methods (.e. D-GPLVM [17] and GPLRF [36]). As evdenced by our results on posed and spontaneously dsplayed facal expressons, when compared to the state-of-the-art methods for supervsed mult-vew learnng and facal expresson recognton, modelng of the manfold shared across dfferent vews and/or features usng the proposed framework consderably mproves both multand per- vew/feature classfcaton of facal expressons. APPENDIX A DERIVATIVES Durng the optmzaton, we need to update X and θ s by solvng the problem n Eq. (26). The latter s a sum of two terms, the negatve log-lkelhood gven by Eq. (20), and the norm term whch, for convenence, we denote as C = 2 V v=1 IBP(X, A (v) t ) + (v) t 2 F (35) Because of the lkelhood term, the defned problem does not have an exact soluton, and thus, we need to apply the CG algorthm. Hence, we have to compute the gradents of Eq. (20), (35) w.r.t. the latent postons X and the kernel parameters θ s L s X = L v s = C θ s X = v C = 0. θ s L (v) X + β LX [ ] L (1) T L(V ) θ (1) θ (V ) (X A (v) t ) + (v) t The lkelhood term L (v) s a functon of the kernel K (v), thus, we need to apply the chan rule n order to fnd the dervatves w.r.t X and θ (v) L(v) x j L(v) θ (v) [ = tr ( L(v) K [ = tr ( L(v) K K(v) )T (v) x j K(v) )T (v) θ (v) L(v) K v = D 2 (K(v) ) (K(v) ) 1 Y v Y T v (K(v) ) 1. Fnally, the dervatves of the selected kernel are and k(v) (x, x j ) θ (v) 1 k(v) (x, x j ) θ (v) 2 k(v) (x, x j ) θ (v) 3 k(v) (x, x j ) θ (v) 4 k (v) (x ) x j = ] ] = exp( θ 2 2 x x j 2 ) = θ (v) 1 2 x x j 2 exp( θ 2 2 x x j 2 ) = 1 = 1 (θ (v) 4 )2 δ, j θ 2 (x j x 1 j ) k (v) (x, x 1 ). θ 2 (x j x Nj ) k (v) (x, x N ) APPENDIX B LOO SOLUTION OF THE REGRESSION STEP IN ADM Heren, we derve the soluton for the more general form of the IBP case. The same steps can be followed to arrve at the soluton of the SBP case. The optmal values of parameters A (v) are gven by the soluton of the lnear equaton: (K (v) bc + λ(v) I)A (v) = (X + (v) t ). (36) The system of lnear equatons defned by Eq. (36) s nsenstve to permutatons of the orderng of the equatons and the varables. Thus, at each teraton of the LOO, the -th left out sample and the correspondng equaton can be placed on

15 ELEFTHERIADIS et al.: DISCRIMINATIVE SHARED GAUSSIAN PROCESSES FOR MULTIVIEW AND VIEW-INVARIANT FER 203 top, wthout affectng the result. Ths enables us to defne the matrx M as n Eq. (31). By placng M back n Eq. (36), we end up wth the followng lnear system of equatons: [ m m T m M ] A (v) = [ x + (v) / X ( ) + (v) / ] (37) Now, the soluton of the parameters of the regresson wth the -th sample excluded s A (v) = M 1 (X ( ) + (v) ), and the LOO predcton of the -th sample s gven by ˆx ( ) = m T A(v) = m T M 1 = m T [ ] M 1 m M A (v) = m T M 1 From Eq. (37) we have x + (v) (X ( ) + (v) [ ] [ ] A (v) m M A (v) = m T M 1 m A (v) + m T A(v). = [ m m T ] [ ] A (v) A (v) ) = m A (v) + m T A(v) (38) and thus, the error between the predcton ˆx ( ) and the actual output x s x ˆx ( ) = (m m T M 1 m )A (v) = A(v) [M 1 (v), ] (v) / where on the last equaton we used the Shur complement from the block matrx nverson lemma, and M denotes the -th dagonal element of the matrx M. Fnally, we end up wth the cost of the LOO for all samples, E LOO,asdefnedn Eq. (34). For the SBP case we follow exact the same steps, wth the dfference that we drop from all the equatons the dependences on the vew v and we replace the K (v) bc wth K = V v=1 w v K (v) bc. Our fnal goal s to fnd the optmal parameters γ (v) and λ (v) that mnmze the error of the LOO cross valdaton, defned by Eq. (34). For ths, we need to calculate the dervatves of E LOO w.r.t. γ (v) and λ (v). We frst defne the dagonal matrx [M 1 ] 11 D = [M 1 ] NN that allows us to reformulate Eq. (34) nto E LOO = 1 2 DA(v) (v) 2. (39) Usng the chan rule, the dervatves of Eq. (39) are gven by [ ( ) E T LOO ELOO A (v) ( ) ] T λ (v) = tr A (v) λ (v) + ELOO D D λ (v) and E LOO γ (v) = tr [ ( ELOO A (v) ) T A (v) γ (v) + ( ELOO D ) ] T D γ (v), whle the detaled dervatves nsde the trace terms are E LOO = D T (DA (v) (v) A (v) ) [ ] DA (v) (A (v) ) T μ 1 t (v) (A (v) ) T I E LOO D = A(v) λ (v) A(v) γ (v) = M 1 M M 1 (X + (v) λ (v) t ) = μ 1 t M 1 A (v) = M 1 M M 1 (X + (v) γ (v) t ) = M 1 K(v) bc A (v) γ (v) D λ (v) = (D D) M 1 = (D D) (M 1 M 1 ) λ (v) D = (D D) M 1 γ (v) γ (v) = (D D) (M 1 K(v) bc γ (v) M 1 ) where the value of K(v) bc for each element of the kernel s γ (v) gven n Appendx A and denotes the Hadamard product of two matrces. Once we have obtaned the optmal parameters γ (v) and λ (v), we can compute A (v) from Eq. (36). REFERENCES [1] M. Pantc, A. Njholt, A. Pentland, and T. S. Huanag, Human-centred ntellgent human? Computer nteracton (HCI 2 ): How far are we from attanng t? Int. J. Auto. Adapt. Commun. Syst., vol. 1, no. 2, pp , [2] A. Vncarell, M. Pantc, and H. Bourlard, Socal sgnal processng: Survey of an emergng doman, Image Vs. Comput., vol. 27, no. 12, pp , Nov [3] Z. Zeng, M. Pantc, G. I. Rosman, and T. S. Huang, A survey of affect recognton methods: Audo, vsual, and spontaneous expressons, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp , Jan [4] Z. Zhu and Q. J, Robust real-tme face pose and facal expresson recovery, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., vol. 1. Jun. 2006, pp [5] P. Ekman and W. V. Fresen, Unmaskng the Face: A Gude to Recognzng Emotons from Facal Clues. Los Altos, CA, USA: Ishk, [6] S. Moore and R. Bowden, Local bnary patterns for mult-vew facal expresson recognto, Comput. Vs. Image Understand., vol. 115, no. 4, pp , Apr [7] Y. Hu, Z. Zeng, L. Yn, X. We, J. Tu, and T. S. Huang, A study of non-frontal-vew facal expressons recognton, n Proc. 19th Int. Conf. Pattern Recognt., Dec. 2008, pp [8] N. Hesse, T. Gehrg, H. Gao, and H. K. Ekenel, Mult-vew facal expresson recognton usng local appearance features, n Proc. 21st Int. Conf. Pattern Recognt., Nov. 2012, pp [9] O. Rudovc, M. Pantc, and I. Patras, Coupled Gaussan processes for pose-nvarant facal expresson recognton, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp , Jun [10] O. Rudovc, I. Patras, and M. Pantc, Regresson-based multvew facal expresson recognton, n Proc. 20th Int. Conf. Pattern Recognt. (ICPR), Istanbul, Turkey, Aug. 2010, pp [11] W. Zheng, H. Tang, Z. Ln, and T. S. Huang, Emoton recognton from arbtrary vew facal mages, n Proc. Eur. Conf. Comput. Vs., 2010, pp [12] U. Tarq, J. Yang, and T. S. Huang, Mult-vew facal expresson recognton analyss wth generc sparse codng feature, n Proc. Eur. Conf. Comput. Vs. (ECCV), Oct. 2012, pp [13] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, Generalzed multvew analyss: A dscrmnatve latent space, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt., Jun. 2012, pp [14] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, Mult-vew dscrmnant analyss, n Proc. Eur. Conf. Comput. Vs., Oct. 2012, pp

204 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 [15] A. P. Shon, K. Grochow, A. Hertzmann, and R. P. N. Rao, Learnng shared latent structure for mage synthess and robotc mtaton, n Proc.

, Oxford Brookes Unv., Headngton, U.K., 2009. [17] R. Urtasun and T. Darrell, Dscrmnatve Gaussan process latent varable model for classfcaton, n Proc. 24th Int. Conf. Mach. Learn., Jun. 2007, pp.

Rudovc, and M. Pantc, Shared Gaussan process latent varable model for mult-vew facal expresson recognton, n Proc. ISVC, Jul. 2013, pp. 527 538. [20] T. Ojala, M. Petkanen, and T.

16 204 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015 [15] A. P. Shon, K. Grochow, A. Hertzmann, and R. P. N. Rao, Learnng shared latent structure for mage synthess and robotc mtaton, n Proc. Adv. Neural Inf. Process. Syst., vol , pp [16] C. H. Ek and P. H. T. N. D. Lawrence, Shared Gaussan process latent varable models, Ph.D. dssertaton, Dept. Comput. Sc., Oxford Brookes Unv., Headngton, U.K., [17] R. Urtasun and T. Darrell, Dscrmnatve Gaussan process latent varable model for classfcaton, n Proc. 24th Int. Conf. Mach. Learn., Jun. 2007, pp [18] D. P. Bertsekas, Constraned Optmzaton and Lagrange Multpler Methods (Computer Scence and Appled Mathematcs), vol. 1. Boston, MA, USA: Academc, [19] S. Eleftherads, O. Rudovc, and M. Pantc, Shared Gaussan process latent varable model for mult-vew facal expresson recognton, n Proc. ISVC, Jul. 2013, pp [20] T. Ojala, M. Petkanen, and T. Maenpaa, Multresoluton gray-scale and rotaton nvarant texture classfcaton wth local bnary patterns, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp , Jul [21] C. Cortes and V. Vapnk, Support-vector networks, Mach. Learn., vol. 20, no. 3, pp , Sep [22] D. G. Lowe, Dstnctve mage features from scale-nvarant keyponts, Int. J. Comput. Vs., vol. 60, no. 2, pp , Nov [23] N. Dalal and B. Trggs, Hstograms of orented gradents for human detecton, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt., vol. 1. Jun. 2005, pp [24] T. F. Cootes, G. J. Edwards, and C. J. Taylor, Actve appearance models, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp , Jun [25] N. Ahmed, T. Natarajan, and K. R. Rao, Dscrete cosne transform, IEEE Trans. Comput., vol. C-23, no. 1, pp , Jan [26] D. G. Lowe, Object recognton from local scale-nvarant features, n Proc. 7th IEEE Int. Conf. Comput. Vs., vol. 2. Sep. 1999, pp [27] J. Yang, K. Yu, Y. Gong, and T. Huang, Lnear spatal pyramd matchng usng sparse codng for mage classfcaton, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt., Jun. 2009, pp [28] H. Hotellng, Relatons between two sets of varates, Bometrka, vol. 28, nos. 3 4, pp , [29] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, Canoncal correlaton analyss: An overvew wth applcaton to learnng methods, Neural Comput., vol. 16, no. 12, pp , Dec [30] J. Rupnk and J. Shawe-Taylor, Mult-vew canoncal correlaton analyss, n Proc. Conf. Data Mnng Data Warehouses, 2010, pp [31] A. Kumar and H. Daumé, III, A co-tranng approach for multvew spectral clusterng, n Proc. Int. Conf. Mach. Learn., 2011, pp [32] T. Dethe, D. R. Hardoon, and J. Shawe-Taylor, Constructng nonlnear dscrmnants from multple data vews, n Machne Learnng and Knowledge Dscovery n Databases. Berln, Germany: Sprnger-Verlag, 2010, pp [33] C. M. Bshop, Pattern Recognton and Machne Learnng, vol. 4. New York, NY, USA: Sprnger-Verlag, [34] X. Nyog, Localty preservng projectons, n Proc. Neural Inf. Process. Syst., vol , p [35] N. Lawrence, Probablstc non-lnear prncpal component analyss wth Gaussan process latent varable models, J. Mach. Learn. Res., vol. 6, pp , Nov [36] G. Zhong, W.-J. L, D.-Y. Yeung, X. Hou, and C.-L. Lu, Gaussan process latent random feld, n Proc. 24th AAAI Conf. Artf. Intell., 2010, pp [37] C. E. Rasmussen and C. K. I. Wllams, Gaussan Processes for Machne Learnng, vol. 1. Cambrdge, MA, USA: MIT Press, [38] N. D. Lawrence and J. Quñonero-Candela, Local dstance preservaton n the GP-LVM through back constrants, n Proc. Int. Conf. Mach. Learn., vol , pp [39] F. Chung, Spectral Graph Theory. Provdence, RI, USA: AMS, [40] H. Rue and L. Held, Gaussan Markov Random Felds: Theory and Applcatons, vol London, U.K.: Chapman & Hall, [41] M. Salzmann and R. Urtasun, Implctly constraned Gaussan process regresson for monocular non-rgd pose estmaton, n Proc. Adv. Neural Inf. Process. Syst., 2010, pp [42] X. Zhu, J. D. Lafferty, and Z. Ghahraman, Sem-supervsed learnng: From Gaussan felds to Gaussan processes, School Comput. Sc., Carnege Mellon Unv., Pttsburgh, PA, USA, Tech. Rep. CMU-CS , [43] J. Hanmueller and C. Hazlett, Kernel regularzed least squares: Reducng msspecfcaton bas wth a flexble and nterpretable machne learnng approach, Poltcal Anal., vol. 22, no. 2, pp , [44] B. Schölkopf, R. Herbrch, and A. J. Smola, A generalzed representer theorem, n Computatonal Learnng Theory. Berln, Germany: Sprnger-Verlag, 2001, pp [45] S. Sundararajan and S. Keerth, Predctve approaches for choosng hyperparameters n Gaussan processes, Neural Comput., vol. 13, no. 5, pp , May [46] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, Mult-pe, Image Vs. Comput., vol. 28, no. 5, pp , May [47] P. N. Belhumeur, D. W. Jacobs, D. Kregman, and N. Kumar, Localzng parts of faces usng a consensus of exemplars, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt., Jun. 2011, pp [48] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, Statc facal expresson analyss n tough condtons: Data, evaluaton protocol and benchmark, n Proc. Int. Conf Comput. Vs. Workshops, Nov. 2011, pp [49] C. Sagonas, G. Tzmropoulos, S. Zaferou, and M. Pantc, A semautomatc methodology for facal landmark annotaton, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt. Workshops (CVPRW), Jun. 2013, pp [50] V. Ojansvu and J. Hekklä, Blur nsenstve texture classfcaton usng local phase quantzaton, n Image and Sgnal Processng. Berln, Germany: Sprnger-Verlag, 2008, pp [51] A. Bosch, A. Zsserman, and X. Munoz, Representng shape wth a spatal pyramd kernel, n Proc. Int. Conf. Image Vdeo Retr., 2007, pp [52] Z. Zheng, F. Yang, W. Tan, J. Ja, and J. Yang, Gabor feature-based face recognton usng supervsed localty preservng projecton, Sgnal Process., vol. 87, no. 10, pp , Oct [53] M. Pantc and I. Patras, Dynamcs of facal expresson: Recognton of facal actons and ther temporal segments from face profle mage sequences, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 2, pp , Apr Stefanos Eleftherads receved the Dploma degree n electrcal and computer engneerng from the Arstotle Unversty of Thessalonk, Thessalonk, Greece, n He was a recpent of the Natonal Award n Mcrosoft s Imagne Cup Software Development Competton, n He s currently pursung the Ph.D. degree wth the Department of Computng, Imperal College London, London, U.K. Hs research nterests are n automatc human behavor analyss, machne learnng, and computer vson. Ognjen Rudovc receved the Ph.D. degree from the Department of Computng, Imperal College London, London, U.K, n 2014, and the M.Sc. degree n computer vson and artfcal ntellgence from the Computer Vson Center, Barcelona, Span, n He s currently a Research Assocate wth the Department of Computng, Imperal College London. Hs research nterests are n automatc recognton of human affect, machne learnng, and computer vson. Maja Pantc s currently a Professor of Affectve and Behavoral Computng wth the Department of Computng, Imperal College London, London, U.K, and the Department of Computer Scence, Unversty of Twente, Enschede, The Netherlands. She was a recpent of varous awards for her work on automatc analyss of human behavor, ncludng the European Research Councl Startng Grant Fellowshp n 2008 and the Roger Needham Award n She serves as the Edtor-n-Chef of Image and Vson Computng journal, and an Assocate EdtoroftheIEEETRANSACTIONS ON SYSTEMS,MAN, AND CYBERNETICS PART B and the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.