Disentangling factors of variation for facial expression recognition

Size: px

Start display at page:

Download "Disentangling factors of variation for facial expression recognition"

Rudolf Theodore Hart
5 years ago
Views:

1 Disentangling factors of variation for facial expression recognition Salah Rifai, Yoshua Bengio, Aaron Courville, Pascal Vincent, and Mehdi Mirza Université de Montréal Department of Computer Science and Operations Research Abstract. We propose a semi-supervised approach to solve the task of emotion recognition in 2D face images using recent ideas in deep learning for handling the factors of variation present in data. An emotion classification algorithm should be both robust to (1) remaining variations due to the pose of the face in the image after centering and alignment, (2) the identity or morphology of the face. In order to achieve this invariance, we propose to learn a hierarchy of features in which we gradually filter the factors of variation arising from both (1) and (2). We address (1) by using a multi-scale contractive convolutional network (CCNET) in order to obtain invariance to translations of the facial traits in the image. Using the feature representation produced by the CCNET, we train a Contractive Discriminative Analysis (CDA) feature extractor, a novel variant of the Contractive Auto-Encoder (CAE), designed to learn a representation separating out the emotion-related factors from the others (which mostly capture the subject identity, and what is left of pose after the CCNET). This system beats the state-of-the-art on a recently proposed dataset for facial expression recognition, the Toronto Face Database, moving the state-of-art accuracy from 82.4% to 85.0%, while the CCNET and CDA improve accuracy of a standard CAE by 8%. Key words: emotion recognition, contractive, convolution, deep learning, auto-encoder, TFD 1 Introduction A central challenge in computer vision is to disentangle the various factors of variation that explain an image, such as object pose, identity, or various other attributes. This is particularly important for facial expression recognition, the central topic of this paper. Disentangling can be done by exploiting two sources of information, a priori knowledge about these factors and examples, combined by learning algorithms tailored to help performing this kind of disentangling. While our central contribution is in such learning algorithms, this paper is really about how we can combine the two. It exploits advances in machine learning of representations (i.e., sets of features) at multiple levels, i.e., a form of automatic feature extraction pipeline called deep learning [1]. It is based on unsupervised learning that captures the leading local directions of variation present in the data,

2 2 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. a kind of non-linear manifold learning [2]. The specific unsupervised learning algorithm that we build upon here is the Contractive Auto-Encoder (CAE) [3]. The proposed system is basically semi-supervised: it uses the emotion label as a hint about the factor of variation of interest and combines that hint with an unsupervised training criterion to separate out one factor from the other. At the top level of the feature hierarchy, two blocks of features are trained, with the features in one block being sensitive to emotion and more invariant to the other factors, while the features in the other block are trained to be insensitive to emotion changes.this is achieved by adding together the discriminant criterion (predicting emotion), a reconstruction error, the CAE contractive penalty, and a novel local orthogonality penalty that encourages the two blocks of features to vary in directions orthogonal to each other, i.e., such that features specialize to some factors (such as emotions) while being insensitive to others. The resulting algorithm is termed the Contractive Discriminant Analysis (CDA). The proposed facial expression recognition system is evaluated on a recently proposed benchmark dataset, the Toronto Face Database [4], and yields results that beat the state-of-the-art on this dataset, showing the improvement brought by CDA. 2 Background 2.1 Facial Expression Recognition Despite receiving considerable attention over the past 15 years [5, 6], the automatic recognition of facial expression remains a very challenging problem domain. One of the major obstacles to performance is that aspects of the data associated with facial expressions are tightly intertwined with other factors evident in the data. These other factors are primarily associated with subject identity (facial morphology) and pose. With respect to the task of facial expression detection these can be considered nuisance factors. The challenge with facial expression recognition is that these nuisance factors often dominate the representation of the image in pixel space. Two images of different individuals with the same facial expression are likely to be well separated in pixel space; while two images of the same individual showing different expressions may well be found very close together. 2.2 The Toronto Face Database Beyond the inherent challenges in learning to recognize facial expressions, the relative paucity of easily accessible data has been a significant barrier to progress. Until recently, datasets have been limited to a relatively small number of subjects displaying different expressions, with exogenous factors such as illumination and pose being carefully controlled. The recent introduction of the Toronto Face Database (TFD) [4] is a significant step forward in our ability to build robust recognition systems (examples shown in Fig. 3(a)). The TFD is a conglomeration of a large number of smaller face datasets, with each image aligned and rescaled a uniform size of pixels. The dataset consists of 4,178 expression-labeled images, 3,874 of which also possess subject identity labels. There is also a very

3 Disentangling factors of variation for facial expression recognition 3 large (112,234 image) unlabeled dataset that, while missing expression label information, is preprocessed in the same way as the labeled data. Because we exploit unsupervised and semi-supervised learning procedures, our approach can take advantage of additional unlabeled data to learn better features. 2.3 Invariant Features: the Standard Pipeline Facial expression recognition is certainly not unique among vision tasks in having to deal with nuisance factors of variation in the data. In other object and scene recognition settings, nuisance factors are typically related to object pose and illumination conditions. In recent years, solution strategies for these tasks have largely converged toward a multistage processing pipeline [7, 8]. First, local lowlevel features, such as SIFT [9], HoG [10] or, in the case of facial expression recognition, e.g., oriented Gabor filter banks [11], are extracted from patches of the image. Next, these features are spatially aggregated, or pooled, over different regions of the image and sometimes at different spatial resolutions. The output of each pooling unit is a sum, mean or maximum over the outputs of a filter bank over a small spatial area. The aggregate features are then mapped into a vector image representation that is used as input to a general purpose classifier such as a linear support vector machine (SVM). The success of this approach can be attributed in good part to the quality of the feature representation used as input to the classifier. If this representation is invariant to the nuisance factors of variation while maintaining sensitivity along the relevant factors, then one would expect the system to generalize well to new examples. 2.4 Invariant Features: Unsupervised Feature Learning The features used in the above pipeline can be hand-crafted or can be learned. Much research has been done in recent years in generic as well as image-specific feature learning algorithms. Most of these algorithms exploit unsupervised learning and can therefore be applied even in the absence of labels. When a hierarchy of features is trained, these are called deep models [1]. These hierarchical feature learning approaches are based on the unsupervised training of single layer models such as an RBM, sparse coding or auto-encoder variants. One particularly successful recent variant of the auto-encoder is the contractive auto-encoder (CAE) [3]. The principle underlying the CAE is that locally invariant features can be induced through activity dependent regularization. The regularization penalty discourages changes in the features associated with small changes in the input image. The mathematical details of the CAE are provided below (Section 4). When compared to the standard multistage pipeline, the feature learning strategy pursues a different and complementary strategy toward constructing invariant representations with good generalization properties. While the standard pipeline builds invariance to known nuisance factors of variation by aggregating over lower-level features that vary across these factors, unsupervised feature learning includes all significant factors, including the nuisance factors, but to the extent that these factors are statistically independent in the data, it tends to represent these factors separately. In the ideal scenario, while the learned

4 4 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. representation still contains the nuisance factors, they are disentangled from the relevant factors and can thus be more easily ignored by the subsequent classifier. Both the hand-built features and feature-learning approaches have their advantages. The standard pipeline exploits domain knowledge about the spatial relationship between pixels to construct features that are invariant to simple transformations such as translation. On the other hand, a feature-learning strategy may be more successful at building features that are invariant to factors that do not correspond to simple transformations. In the case of facial expression recognition, many of the factors associated with facial morphology fall into this category. While invariance to simple transformations such as local translations, rotations and scaling go some way to remove sensitivity to facial morphology, they stop well short of capturing all variations in facial characteristics across the human population. Yet, unfortunately, due to the tight coupling of the factors of variation underlying facial expression and morphology, even our most successful unsupervised learning-based approaches are unlikely to satisfactorily disentangle these factors on their own. 3 Proposed Approach In this work, we deal directly with this issue of entangled factors of variation by developing a semi-supervised feature learning strategy that combines the advantages of both a feature learning approach and the feature pooling pipeline. Ours is a hierarchical (or deep) modeling approach. At each layer, the features become increasingly invariant to nuisance factors while maintaining discriminative information with respect to the task of facial expression recognition. Our approach can be broken down into the following three stages of the learning procedure. 1. We use the CAE algorithm (described in Section 4) to learn locally invariant image features from image patches at multiple resolutions. The CAE-derived feature-extractors are applied convolutionally to the entire image to form a series of feature maps, each corresponding to a single learned CAE feature, and to a convolution kernel. These feature maps are then decimated via max-pooling in regular non-overlapping regions to form a local-translationinvariant (LTI) representation. This first stage is termed the Contractive Convolutional Network (CCNET) and is described in more details in Section 5 below. 2. The LTI features are then used as input to a novel semi-supervised featurelearning CAE-based approach we call the Contractive Discriminative Analysis (CDA). The basic approach is to divide the features to be learned at this layer into two blocks. While the blocks are trained to cooperate to reconstruct their mutual input, one of these blocks (the discriminative feature block) is also trained to predict the facial expression class on examples where label information is available. Our objective in segregating the features in this way is to tease apart the discriminative features that learn to encode useful information about facial expressions from nuisance features (that are complementary but not task-discriminative). We further include a

5 Disentangling factors of variation for facial expression recognition 5 K feature maps y 1 s 1 h (d) L L s K feature maps W SVM s 2 L L s Fig. 1. Classification pipeline. At left: input image at different resolutions. The next stage contains the output of the K convolutional feature maps, followed by the maxpooling (y). The CDA produces the last stage of (discriminant) features h (d) through the weight matrix W, which are then fed to a linear SVM. novel CAE-inspired penalty that locally encourages the discriminative features and non-discriminative features to encode distinct directions of variation in the input. The resulting CDA learning algorithm is described formally in Section Finally, following the standard pipeline, the discriminative features are used as input to train a linear SVM on the labeled training data. Once the system has been trained, the learned features form the basis of a multistage classification pipeline, similar to commonly used classification pipelines [7, 8]. To summarize, the computational stages of the classification pipeline (which closely follow the training pipeline outlined above) are as follows. 1. The multi-resolution CAE features are convolved over the entire image. This produces a set of feature maps (one for each CAE feature). 2. The convolutional CAE feature maps are decimated (via max-pooling) to a coarse grid (2 2 or 3 3) over the image. 3. These decimated features are then concatenated and the CDA discriminative feature encoding is applied to this concatenation. Note that when performing classification, we no longer need to compute the block of non-discriminative CDA features. 4. Finally, the CDA encoding (the discriminative block) is passed to the linear SVM to obtain a class prediction. In the case of facial expression recognition, these class predictions correspond to one of the seven recognized expressions: happy, sad, scared, surprised, anger, disgust and fear. Fig. 1 illustrates the classification pipeline. 4 Contractive Auto-Encoder (CAE) In this section, we briefly describe the CAE algorithm that is used for unsupervised feature learning. We closely follow the description of [3] for unsupervised

6 6 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. learning of a non-linear feature extractor from a dataset D = {x 1,..., x n }. Examples x i IR d are i.i.d. samples from an unknown distribution p(x). 4.1 Auto-Encoders The auto-encoder framework is one of the oldest and simplest techniques for the unsupervised learning of non-linear feature extractors. It learns an encoder function h, that maps an input x IR d to a feature vector h(x) IR d h, jointly with a decoder function g, that maps h back to the input space as r = g(h(x)), the reconstruction of x. The encoder and decoder s parameters θ can be optimized by stochastic gradient descent to minimize the average reconstruction error L(x, g(h(x))) for the examples of the training set. The objective being minimized is: J AE (θ) = x D L(x, g(h(x))). (1) We will will use the most common forms of encoder, decoder, and reconstruction error: Encoder: h(x) = s(ax+α), where s is the element-wise logistic sigmoid s(z) = 1 1+e. Parameters are a d z h d weight matrix A and bias vector α IR d h. Decoder: ˆx = g(h(x)) = s 2 (A T h(x) + β). Parameters are A T (tied weights, shared with the encoder) and bias vector β IR d. Activation function s 2 is either a logistic sigmoid (s 2 = s) or the identity (linear decoder). Loss function: Squared error: L RECON (x, ˆx) = x ˆx 2. The set of parameters of such an auto-encoder is θ = {A, α, β}. 4.2 Contractive Regularization For auto-encoders to learn something meaningful, they must have low reconstruction error on the training examples but large reconstruction error for most other input configurations. One way to achieve this is with the contractive penalty of the Contractive Auto-Encoder (CAE), introduced by [3]. This penalty term encourages robustness of feature vector h(x) to small variations of a training input x, by penalizing its sensitivity to that input, measured as the Frobenius norm of the encoder s Jacobian J(x) = h x (x). The regularized objective minimized by the CAE is the following: J CAE (θ) = x D L(x, g(h(x))) + λ J(x) 2 F, (2) where λ is a non-negative regularization hyper-parameter that controls how strongly the norm of the Jacobian is penalized, and A 2 F is the Frobenius matrix norm (sum of the square of the matrix elements). Note that, with the traditional sigmoid encoder form given above, one can easily obtain the Jacobian of the encoder. Its j th row is obtained form the j th row A j of A as: J(x) j = h j(x) x = h j (x)(1 h j (x))a j. (3)

7 Disentangling factors of variation for facial expression recognition 7 Computing the extra penalty term (and its contribution to the gradient) is similar to computing the reconstruction error term (and its contribution to the gradient), thus relatively cheap. The effect of training a CAE is that the resulting features tends to be sparsely active: only a few of the features have a significantly non-zero derivative (i.e. when the sigmoid is neither saturated near 0 nor saturated near 1). The set of active features depends on the current input x. Those active features respond almost linearly to changes in the input, and they provide a local basis for the variations around x, in some privileged directions which are those to which they respond (corresponding to their weight vectors, e.g., A j for feature h j ). When the data congregate near a low-dimensional manifold around x, only a few features are active in the neighborhood of x. Hence the locally active features form a coordinate system for a region in input space, corresponding to a chart, and the overall set of such regions forms an atlas of charts [3] mapping out a non-linear manifold near which the estimated input density concentrates. 5 Contractive Convolutional Network (CCNET) Convolutional neural networks are generalizations of neural networks which have been particularly successful in computer vision [12 16]. In a way similar to what has been done to generalize sparse coding and RBMs to the convolutional setting, we generalize CAEs to the convolutional setting, i.e. each convolutional feature output sees to a spatially local region in the input image (called the receptive field), while sharing the parameters of that feature (convolution kernel) with other features that have a receptive field located elsewhere. This is equivalent to replacing the neural network matrix multiplication found in linear feature extraction by a series of convolutions, which correspond to sparse structured matrices. Whereas in the usual applications of convolutions for convolutional networks a single common receptive field (convolutional kernel size) is used, we consider multiple sizes, allowing the model to capture structure at different scales (see also [17] for a similar approach, with two scales). As done previously [14, 8], we chose to initialize a convolutional neural network whose filters have been pre-trained by unsupervised learning (here as a CAE) patch-wise, i.e., an ordinary CAE was trained with patches extracted randomly at different locations, and whose size matches that of the convolutional kernels being learned. We pre-train independently a CAE for each kernel size. Assuming that we have a set of n different sizes s i 1, we denote as follows the output on patches x of the CAE trained with the i-th patch-size: h i (x) = s(a i x + α i ) We compute the corresponding feature maps on the whole image by applying h i to each s i -by-s i patch of the input image, 1 The results on TFD were obtained with (14, 14) and (18, 18) scales

8 8 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. f i (x) = σ(conv(a i, x) + α i ) followed by a maximum-pooling on p uniformly divided non-overlapping regions Q = {q 1,..., q p }: F i j (x) = max k q j (f i k(x)) The final output of the convolutional layer is defined as the stack of all the pooled features of each CAE into one long vector: F (x) = [ F 0 (x),..., F n (x) ] with F i (x) = [ F i 1(x),..., F i p(x) ] where we take y = F (x) to be the LTI representation given as input of the higher level CAEs that will perform the disentangling of the remaining factors of variation. 6 Contractive Discriminative Analysis In this section we describe our major technical contribution, the Contractive Discriminative Analysis (CDA). CDA is a semi-supervised version of CAE that promotes the disentangling of discriminative factors of variation in the data from other prominent factors that may well dominate the discriminative factors. Our goal, in deriving CDA, is to separate the factors of the image that are discriminative with respect to the facial expression recognition task from factors that characterize facial morphology and pose. CDA is an extension of the CAE framework (introduced in Sec. 4). While the standard CAE encodes the data into a single feature vector h(x), CDA learns an encoder function that maps an input into two (or more) distinct blocks of features: one that encodes discriminative factors of its input, h (d) (y) = s(w y+c), and one (or more) that encode all other factors, h (o) (y) = s(v y+b). Both feature blocks are trained to cooperate to reconstruct their common input y with a reconstruction loss function, e.g., L RECON (y, ŷ) = y ŷ 2 (4) where ŷ is the CDA reconstruction, given by a linear combination of learned features: ŷ = g([h (d) (y), h (o) (y)]) = s 2 (W T h (d) (y) + V T h (o) (y) + ρ). (5) where ρ i is an offset to capture the average value of y i. In addition, the h d (y) block is also trained to predict the facial expression label z(y) when that information is available. The class prediction is given by the sigmoid function (s( )) over an affine transformation of the discriminative block, ẑ i = P (z = i y), similarly to logistic regression: ) ẑ i = s (U i h (d) (y) + a i. (6)

9 Disentangling factors of variation for facial expression recognition 9 where the feature vector U i maps the discriminative block h (d) (y) to the prediction for class i, and a i is the class specific bias. The corresponding discriminant component of the overall loss function is: L DISC (z, ẑ) = C z i log ẑ i + (1 z i ) log(1 ẑ i ) (7) i=1 with (x, z) L, the labeled training set with input image x and expression label z (represented as a one-hot vector), and with the x s in L a subset of the set of all input examples D (some of which are unlabeled). To obtain semi-supervised training we add a CAE-inspired contractive penalty J CDA (y): J CDA (y) = h (d) (y) y 2 F + h (o) 2 (y) y + γ F i,j ( h (d) i (y) y h(o) j ) 2 (y). (8) y The first two terms penalize sensitivity in h (d) (y) and h (o) (y) respectively to local variations in y (as in the standard CAE) but crucially the third term encourages h (d) (y) and h (o) (y) to represent different directions of variation in the input y, (y) y by asking each sensitivity vector h(d) i to prefer being orthogonal to every sensitivity vector h(o) j of the i-th discriminant feature h (d) i (y) y associated with the j-th non-discriminant feature h (o) j. The addition of this term to the CDA cost function is crucial in achieving our performance results. As we discuss later it regularizes the CDA discriminative features in a manner analogous to how partial least squares can be interpreted as a regularized variant of canonical components analysis [18]. The coefficient γ modulates the relative contribution of the orthogonalization penalty to the overall CDA contractive penalty. Putting all the components of the CDA loss function together we get: L CDA (θ) = x D,y=F (x) L RECON (y, ŷ) + ηj CDA (y) + (x,z) L,y=F (x) L DISC (z, ẑ) (9) The coefficient η weighs the contribution of the contractive penalty. The set of CDA parameters is θ = {U, V, W, a, b, c, ρ}. The CDA training procedure is illustrated in Fig. 2. As expressed here, CDA strictly disentangles discriminative factors from other prominent factors in the data. However, one could easily generalize the method to incorporate any form of additional side information that could be used to further disentangle factors of variation. This would be achieved by creating additional blocks and associating each of them with a set of predicitive parameters helping to map the features in the block to the values of the factor of interest (like U and a above).

10 10 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. h (o) h (d) Logistic Regression y Fig. 2. Illustration of architecture and training procedure for CDA, that learns and separates two blocks of features (emotion-discriminant features h (d) and features h (o) capturing the other factors). 7 Connections to Previous Work The name Contractive Discriminative Analysis (CDA) was inspired by the connection between our CAE-based approach and earlier linear methods such as linear discriminative analysis (LDA) [19], canonical components analysis (CCA)[20] and partial least squares (PLS) [21]. In fact, in the case of linear activation functions (with h(y) = W y) and no orthogonality penalty (η = 0), the discriminative features that would be discovered by CDA would span the same subspace (in the non-overcomplete setting) as LDA and CCA 2. Our use of the orthogonality inducing contraction penalty (with CAE contraction coefficient λ > 0) has an important effect on the learned features. As previously mentioned, it acts as an additional regularization term on the discriminative features learned by CDA. Interestingly, in the linear setting (with h(y) = W y), the effect of this penalty is to encourage the discriminative and non-discriminative features to be mutually orthogonal. This is reminiscent of the difference between CCA that seeks a linear projection of the input that maximizes correlation with the label encoding and PLS that seeks an equivalent linear projection but rather maximizes covariance. PLS is considered a regularized form of CCA since it forces the projection to preserve additional information the input, specifically in the covariance directions. By penalizing non-orthogonality in the projecting matrix, our CDA penalty acts in a way very similar to PLS. Another interesting connection can be drawn to deep learning techniques that combine a supervised objective with an unsupervised objective when learning a feature set. This started with the idea of partially supervised training in [23] where the RBM or auto-encoder gradient (or estimated gradient) is added to the gradient of a global supervised objective for the deep network. A related idea was proposed in [24] which allowed to train fairly deep networks in a semisupervised setting (where only a few examples have a label, i.e., the supervised gradient is only added up on these). A hybrid of discriminant (conditional loglikelihood) and generative (joint log-likelihood) gradients was also used to train discriminant RBMs [25]. The most significant difference between the CDA and these other semi-supervised feature learning strategies is that CDA explicitly deals with nuisance factors by relegating them to the non-discriminative feature set. These other approaches use the labels to encourage discriminative features 2 In the discriminative setting, where one of the two projected matrices contains only label information, the CCA and LDA directions are the same [22]. ŷ

Disentangling factors of variation for facial expression recognition 11 Fig. 3. Left: Example images from the Toronto face database [4]. Center-Right: Convolutional kernels learnt by the CAE.

In the CAE there is a deliberate and controlled loss of information (in directions of variance that correspond to these nuisance factors) in the discriminative feature block.

We use the same 5 standard splits (folds) of the Toronto Face Dataset, to repeatedly train our model and evaluate its performance for emotion classification.

11 Disentangling factors of variation for facial expression recognition 11 Fig. 3. Left: Example images from the Toronto face database [4]. Center-Right: Convolutional kernels learnt by the CAE. center: 14x14, right:18x18. Smaller kernel sizes tends to learn features that are more local in the 2D image space. while relying on model capacity limitations to filter out the nuisance factors. In the CAE there is a deliberate and controlled loss of information (in directions of variance that correspond to these nuisance factors) in the discriminative feature block. 8 Experiments and Results For our experiments, we use the same setup as [4] and [14]. We use the same 5 standard splits (folds) of the Toronto Face Dataset, to repeatedly train our model and evaluate its performance for emotion classification. 3 For the CCNET training stage, since it is entirely unsupervised, we used the 112,234 unlabeled faces (48 48 grayscale images). More specifically, the CCNET was used to learn 512 convolutional kernels of size and 512 of size Figure 3 shows some of the learned kernels. Each post-sigmoid feature map obtained by applying one of these 1024 kernels was max-pooled within 3 3 regions, yielding = 9216 features in total. For the following CDA stage, training examples were sampled with 50% probability from the TFD unlabeled set and 50% probability from the TFD labeled training set (of the considered split). This is to make sure that the less numerous labeled examples get seen often enough during training, since they contain crucial information that we do not want to swamp under the signals brought by the unlabeled examples. For each TFD standard split, the CDA was trained to extract 1000 discriminative features and 1000 non-discriminative features. These features were then fed to a linear SVM for assessing final classification performance. Performance averaged over the 5 splits was 43.01% accuracy when using 3 Due to time constraints, we concentrated on the first fold for tuning the model s hyper-parameters. We retained the values of the hyper-parameters that yielded best linear SVM performance on that first fold s validation set, and used them unchanged for the other folds. This strategy was used to select both the CCNET hyper-parameters (kernel sizes, CAE regularization strength λ, and pooling regions), and the CDA hyper-parameters (number of discriminative and non-discriminative features and γ) that are reported in the main text.

12 12 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. Fig. 4. Singular vectors associated with largest singular values of the (left) emotion Jacobian h(e), (right) other factors Jacobian h(o). We can notice that x x h(d) is mostly sensitive to face parts associated with emotions, while h (o) captures face structure more likely to correspond to identity. the non-discriminative features versus 85.06% when using the discriminative features. This indicates that the CDA criterion was indeed able to disentangle features most relevant for emotion classification from other aspects of the faces. To qualitatively compare the discriminative to the non-discriminative features learned, we extracted the input directions to which they were most sensitive. This was achieved by extracting the 10 leading singular vectors of the derivative of either discriminative or non-discriminative features with respect to the image input. From Fig. 4 we see that, in general, the sensitivity directions for the expression-discriminate feature block, h (d), are more localized and contain less identity specific information relative to the non-discriminative feature block h (o). We also see that the expression-discriminate feature block contains expression targeted detectors such as corner of the mouth smile detectors, toothy-grin detectors, grimace detectors and wide-eye (surprise) detectors. Classification performance obtained with features extracted after each of the two stages of our model (CCNET+SVM and CCNET+CDA+SVM) are reported in Table 1, and compared to simpler single non-convolutional one-layer CAE (CAE-1+SVM) and a stack of two CAEs (CAE-2+SVM). These results confirm that each successive layer we add helps to disentangle discriminative features, yielding good classification performance. Table 2 compares the performance of our approach to that of established models[14, 11]. Table 1. Test classification accuracy of several models, trained on TFD, averaged over 5 folds (reported with standard deviation). Model CAE-1+SVM CAE-2+SVM CCNET+SVM CCNET+CDA+SVM Accuracy ± ± ± ± 0.47

13 Disentangling factors of variation for facial expression recognition 13 Fig. 5. The effect of the CDA term on the generalisation performance for different values of η. The optimum is found for a non-negligible value of η = 7. Table 2. Test classification accuracy of established models trained on TFD. Model Accuracy SVM RBF-SVM SC+SVM GFB+PCA+SVM [11] mpot+dbn CCNET+CDA+SVM 9 Discussion In this paper, we have investigated an approach to facial expression recognition based on a feature hierarchy trained to disentangle the factors of variation that give rise to facial expressions from other factors such as those responsible for subject identity, specific facial morphology and subject pose. We introduce contractive discriminative analysis (CDA), a novel semi-supervised learning paradigm that incorporates available label information to define discriminative features while regularizing the feature set with a CAE-inspired penalty to promote good generalization properties. By combining prior knowledge of the spatial topology of images together with feature learning schemes designed to recover robust features of facial expressions, we significantly surpass the previous state-of-the-art on the Toronto face database [4], achieving a generalization accuracy of 85.0%. We also show how the features recovered by our CDA scheme are invariant to factors such as subject identity and pose while remaining sensitive to changes in facial expression. References 1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (2009) Also published as a book. Now Publishers, Saul, L., Roweis, S.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4 (2002)

14 14 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. 3. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting autoencoders: Explicit invariance during feature extraction. In: ICML (2011) 4. Susskind, J., Anderson, A., Hinton, G.E.: The Toronto face dataset. Technical Report UTML TR , U. Toronto (2010) 5. Ranzato, M., Susskind, J., Mnih, V., Hinton, G.E.: On deep generative models with applications to recognition. In: CVPR 11. (2011) Padgett, C., Cottrell, G.W.: A simple neural network models categorical perception of facial expressions. In: In Proceedings of the Twentieth Annual Cognitive Science Conference, Erlbaum (1998) Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multistage architecture for object recognition? In: ICCV 09. (2009) 8. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011). (2011) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. CVPR (2005) 11. Dailey, M.N., Cottrell, G.W., Padgett, C., Adolphs, R.: EMPATH: A neural network that categorizes facial expressions. J. cognitive neuroscience (2002) LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (1989) Wolf, R., Platt, J.: Postal address block location using a convolutional locator network. In: NIPS 93. (1994) Ranzato, M., Huang, F., Boureau, Y., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: CVPR 07. (2007) 15. Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatiotemporal features. In: ECCV 10. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, Y.L., Gregor, K., Mathieu, M., LeCun, Y.: Learning convolutional feature hierarchies for visual recognition. In: NIPS (2010) 17. Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spikeand-slab RBMs. In: ICML (2011) 18. Barker, M., Rayens, W.: Partial least squares for discrimination. Journal of Chemometrics 17 (2003) Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (1936) Hotelling, H.: Relations between two sets of variates. Biometrika 28 (1936) Wold, S., Ruhe, A., Wold, H., Dunn, W.J.: The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5 (1984) Bartlett, M.S.: Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society 34 (1938) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS 06. MIT Press (2007) Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: ICML (2008) 25. Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: ICML (2008)

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract