Disentangling factors of variation for facial expression recognition

Size: px
Start display at page:

Download "Disentangling factors of variation for facial expression recognition"

Transcription

1 Disentangling factors of variation for facial expression recognition Salah Rifai, Yoshua Bengio, Aaron Courville, Pascal Vincent, and Mehdi Mirza Université de Montréal Department of Computer Science and Operations Research Abstract. We propose a semi-supervised approach to solve the task of emotion recognition in 2D face images using recent ideas in deep learning for handling the factors of variation present in data. An emotion classification algorithm should be both robust to (1) remaining variations due to the pose of the face in the image after centering and alignment, (2) the identity or morphology of the face. In order to achieve this invariance, we propose to learn a hierarchy of features in which we gradually filter the factors of variation arising from both (1) and (2). We address (1) by using a multi-scale contractive convolutional network (CCNET) in order to obtain invariance to translations of the facial traits in the image. Using the feature representation produced by the CCNET, we train a Contractive Discriminative Analysis (CDA) feature extractor, a novel variant of the Contractive Auto-Encoder (CAE), designed to learn a representation separating out the emotion-related factors from the others (which mostly capture the subject identity, and what is left of pose after the CCNET). This system beats the state-of-the-art on a recently proposed dataset for facial expression recognition, the Toronto Face Database, moving the state-of-art accuracy from 82.4% to 85.0%, while the CCNET and CDA improve accuracy of a standard CAE by 8%. Key words: emotion recognition, contractive, convolution, deep learning, auto-encoder, TFD 1 Introduction A central challenge in computer vision is to disentangle the various factors of variation that explain an image, such as object pose, identity, or various other attributes. This is particularly important for facial expression recognition, the central topic of this paper. Disentangling can be done by exploiting two sources of information, a priori knowledge about these factors and examples, combined by learning algorithms tailored to help performing this kind of disentangling. While our central contribution is in such learning algorithms, this paper is really about how we can combine the two. It exploits advances in machine learning of representations (i.e., sets of features) at multiple levels, i.e., a form of automatic feature extraction pipeline called deep learning [1]. It is based on unsupervised learning that captures the leading local directions of variation present in the data,

2 2 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. a kind of non-linear manifold learning [2]. The specific unsupervised learning algorithm that we build upon here is the Contractive Auto-Encoder (CAE) [3]. The proposed system is basically semi-supervised: it uses the emotion label as a hint about the factor of variation of interest and combines that hint with an unsupervised training criterion to separate out one factor from the other. At the top level of the feature hierarchy, two blocks of features are trained, with the features in one block being sensitive to emotion and more invariant to the other factors, while the features in the other block are trained to be insensitive to emotion changes.this is achieved by adding together the discriminant criterion (predicting emotion), a reconstruction error, the CAE contractive penalty, and a novel local orthogonality penalty that encourages the two blocks of features to vary in directions orthogonal to each other, i.e., such that features specialize to some factors (such as emotions) while being insensitive to others. The resulting algorithm is termed the Contractive Discriminant Analysis (CDA). The proposed facial expression recognition system is evaluated on a recently proposed benchmark dataset, the Toronto Face Database [4], and yields results that beat the state-of-the-art on this dataset, showing the improvement brought by CDA. 2 Background 2.1 Facial Expression Recognition Despite receiving considerable attention over the past 15 years [5, 6], the automatic recognition of facial expression remains a very challenging problem domain. One of the major obstacles to performance is that aspects of the data associated with facial expressions are tightly intertwined with other factors evident in the data. These other factors are primarily associated with subject identity (facial morphology) and pose. With respect to the task of facial expression detection these can be considered nuisance factors. The challenge with facial expression recognition is that these nuisance factors often dominate the representation of the image in pixel space. Two images of different individuals with the same facial expression are likely to be well separated in pixel space; while two images of the same individual showing different expressions may well be found very close together. 2.2 The Toronto Face Database Beyond the inherent challenges in learning to recognize facial expressions, the relative paucity of easily accessible data has been a significant barrier to progress. Until recently, datasets have been limited to a relatively small number of subjects displaying different expressions, with exogenous factors such as illumination and pose being carefully controlled. The recent introduction of the Toronto Face Database (TFD) [4] is a significant step forward in our ability to build robust recognition systems (examples shown in Fig. 3(a)). The TFD is a conglomeration of a large number of smaller face datasets, with each image aligned and rescaled a uniform size of pixels. The dataset consists of 4,178 expression-labeled images, 3,874 of which also possess subject identity labels. There is also a very

3 Disentangling factors of variation for facial expression recognition 3 large (112,234 image) unlabeled dataset that, while missing expression label information, is preprocessed in the same way as the labeled data. Because we exploit unsupervised and semi-supervised learning procedures, our approach can take advantage of additional unlabeled data to learn better features. 2.3 Invariant Features: the Standard Pipeline Facial expression recognition is certainly not unique among vision tasks in having to deal with nuisance factors of variation in the data. In other object and scene recognition settings, nuisance factors are typically related to object pose and illumination conditions. In recent years, solution strategies for these tasks have largely converged toward a multistage processing pipeline [7, 8]. First, local lowlevel features, such as SIFT [9], HoG [10] or, in the case of facial expression recognition, e.g., oriented Gabor filter banks [11], are extracted from patches of the image. Next, these features are spatially aggregated, or pooled, over different regions of the image and sometimes at different spatial resolutions. The output of each pooling unit is a sum, mean or maximum over the outputs of a filter bank over a small spatial area. The aggregate features are then mapped into a vector image representation that is used as input to a general purpose classifier such as a linear support vector machine (SVM). The success of this approach can be attributed in good part to the quality of the feature representation used as input to the classifier. If this representation is invariant to the nuisance factors of variation while maintaining sensitivity along the relevant factors, then one would expect the system to generalize well to new examples. 2.4 Invariant Features: Unsupervised Feature Learning The features used in the above pipeline can be hand-crafted or can be learned. Much research has been done in recent years in generic as well as image-specific feature learning algorithms. Most of these algorithms exploit unsupervised learning and can therefore be applied even in the absence of labels. When a hierarchy of features is trained, these are called deep models [1]. These hierarchical feature learning approaches are based on the unsupervised training of single layer models such as an RBM, sparse coding or auto-encoder variants. One particularly successful recent variant of the auto-encoder is the contractive auto-encoder (CAE) [3]. The principle underlying the CAE is that locally invariant features can be induced through activity dependent regularization. The regularization penalty discourages changes in the features associated with small changes in the input image. The mathematical details of the CAE are provided below (Section 4). When compared to the standard multistage pipeline, the feature learning strategy pursues a different and complementary strategy toward constructing invariant representations with good generalization properties. While the standard pipeline builds invariance to known nuisance factors of variation by aggregating over lower-level features that vary across these factors, unsupervised feature learning includes all significant factors, including the nuisance factors, but to the extent that these factors are statistically independent in the data, it tends to represent these factors separately. In the ideal scenario, while the learned

4 4 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. representation still contains the nuisance factors, they are disentangled from the relevant factors and can thus be more easily ignored by the subsequent classifier. Both the hand-built features and feature-learning approaches have their advantages. The standard pipeline exploits domain knowledge about the spatial relationship between pixels to construct features that are invariant to simple transformations such as translation. On the other hand, a feature-learning strategy may be more successful at building features that are invariant to factors that do not correspond to simple transformations. In the case of facial expression recognition, many of the factors associated with facial morphology fall into this category. While invariance to simple transformations such as local translations, rotations and scaling go some way to remove sensitivity to facial morphology, they stop well short of capturing all variations in facial characteristics across the human population. Yet, unfortunately, due to the tight coupling of the factors of variation underlying facial expression and morphology, even our most successful unsupervised learning-based approaches are unlikely to satisfactorily disentangle these factors on their own. 3 Proposed Approach In this work, we deal directly with this issue of entangled factors of variation by developing a semi-supervised feature learning strategy that combines the advantages of both a feature learning approach and the feature pooling pipeline. Ours is a hierarchical (or deep) modeling approach. At each layer, the features become increasingly invariant to nuisance factors while maintaining discriminative information with respect to the task of facial expression recognition. Our approach can be broken down into the following three stages of the learning procedure. 1. We use the CAE algorithm (described in Section 4) to learn locally invariant image features from image patches at multiple resolutions. The CAE-derived feature-extractors are applied convolutionally to the entire image to form a series of feature maps, each corresponding to a single learned CAE feature, and to a convolution kernel. These feature maps are then decimated via max-pooling in regular non-overlapping regions to form a local-translationinvariant (LTI) representation. This first stage is termed the Contractive Convolutional Network (CCNET) and is described in more details in Section 5 below. 2. The LTI features are then used as input to a novel semi-supervised featurelearning CAE-based approach we call the Contractive Discriminative Analysis (CDA). The basic approach is to divide the features to be learned at this layer into two blocks. While the blocks are trained to cooperate to reconstruct their mutual input, one of these blocks (the discriminative feature block) is also trained to predict the facial expression class on examples where label information is available. Our objective in segregating the features in this way is to tease apart the discriminative features that learn to encode useful information about facial expressions from nuisance features (that are complementary but not task-discriminative). We further include a

5 Disentangling factors of variation for facial expression recognition 5 K feature maps y 1 s 1 h (d) L L s K feature maps W SVM s 2 L L s Fig. 1. Classification pipeline. At left: input image at different resolutions. The next stage contains the output of the K convolutional feature maps, followed by the maxpooling (y). The CDA produces the last stage of (discriminant) features h (d) through the weight matrix W, which are then fed to a linear SVM. novel CAE-inspired penalty that locally encourages the discriminative features and non-discriminative features to encode distinct directions of variation in the input. The resulting CDA learning algorithm is described formally in Section Finally, following the standard pipeline, the discriminative features are used as input to train a linear SVM on the labeled training data. Once the system has been trained, the learned features form the basis of a multistage classification pipeline, similar to commonly used classification pipelines [7, 8]. To summarize, the computational stages of the classification pipeline (which closely follow the training pipeline outlined above) are as follows. 1. The multi-resolution CAE features are convolved over the entire image. This produces a set of feature maps (one for each CAE feature). 2. The convolutional CAE feature maps are decimated (via max-pooling) to a coarse grid (2 2 or 3 3) over the image. 3. These decimated features are then concatenated and the CDA discriminative feature encoding is applied to this concatenation. Note that when performing classification, we no longer need to compute the block of non-discriminative CDA features. 4. Finally, the CDA encoding (the discriminative block) is passed to the linear SVM to obtain a class prediction. In the case of facial expression recognition, these class predictions correspond to one of the seven recognized expressions: happy, sad, scared, surprised, anger, disgust and fear. Fig. 1 illustrates the classification pipeline. 4 Contractive Auto-Encoder (CAE) In this section, we briefly describe the CAE algorithm that is used for unsupervised feature learning. We closely follow the description of [3] for unsupervised

6 6 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. learning of a non-linear feature extractor from a dataset D = {x 1,..., x n }. Examples x i IR d are i.i.d. samples from an unknown distribution p(x). 4.1 Auto-Encoders The auto-encoder framework is one of the oldest and simplest techniques for the unsupervised learning of non-linear feature extractors. It learns an encoder function h, that maps an input x IR d to a feature vector h(x) IR d h, jointly with a decoder function g, that maps h back to the input space as r = g(h(x)), the reconstruction of x. The encoder and decoder s parameters θ can be optimized by stochastic gradient descent to minimize the average reconstruction error L(x, g(h(x))) for the examples of the training set. The objective being minimized is: J AE (θ) = x D L(x, g(h(x))). (1) We will will use the most common forms of encoder, decoder, and reconstruction error: Encoder: h(x) = s(ax+α), where s is the element-wise logistic sigmoid s(z) = 1 1+e. Parameters are a d z h d weight matrix A and bias vector α IR d h. Decoder: ˆx = g(h(x)) = s 2 (A T h(x) + β). Parameters are A T (tied weights, shared with the encoder) and bias vector β IR d. Activation function s 2 is either a logistic sigmoid (s 2 = s) or the identity (linear decoder). Loss function: Squared error: L RECON (x, ˆx) = x ˆx 2. The set of parameters of such an auto-encoder is θ = {A, α, β}. 4.2 Contractive Regularization For auto-encoders to learn something meaningful, they must have low reconstruction error on the training examples but large reconstruction error for most other input configurations. One way to achieve this is with the contractive penalty of the Contractive Auto-Encoder (CAE), introduced by [3]. This penalty term encourages robustness of feature vector h(x) to small variations of a training input x, by penalizing its sensitivity to that input, measured as the Frobenius norm of the encoder s Jacobian J(x) = h x (x). The regularized objective minimized by the CAE is the following: J CAE (θ) = x D L(x, g(h(x))) + λ J(x) 2 F, (2) where λ is a non-negative regularization hyper-parameter that controls how strongly the norm of the Jacobian is penalized, and A 2 F is the Frobenius matrix norm (sum of the square of the matrix elements). Note that, with the traditional sigmoid encoder form given above, one can easily obtain the Jacobian of the encoder. Its j th row is obtained form the j th row A j of A as: J(x) j = h j(x) x = h j (x)(1 h j (x))a j. (3)

7 Disentangling factors of variation for facial expression recognition 7 Computing the extra penalty term (and its contribution to the gradient) is similar to computing the reconstruction error term (and its contribution to the gradient), thus relatively cheap. The effect of training a CAE is that the resulting features tends to be sparsely active: only a few of the features have a significantly non-zero derivative (i.e. when the sigmoid is neither saturated near 0 nor saturated near 1). The set of active features depends on the current input x. Those active features respond almost linearly to changes in the input, and they provide a local basis for the variations around x, in some privileged directions which are those to which they respond (corresponding to their weight vectors, e.g., A j for feature h j ). When the data congregate near a low-dimensional manifold around x, only a few features are active in the neighborhood of x. Hence the locally active features form a coordinate system for a region in input space, corresponding to a chart, and the overall set of such regions forms an atlas of charts [3] mapping out a non-linear manifold near which the estimated input density concentrates. 5 Contractive Convolutional Network (CCNET) Convolutional neural networks are generalizations of neural networks which have been particularly successful in computer vision [12 16]. In a way similar to what has been done to generalize sparse coding and RBMs to the convolutional setting, we generalize CAEs to the convolutional setting, i.e. each convolutional feature output sees to a spatially local region in the input image (called the receptive field), while sharing the parameters of that feature (convolution kernel) with other features that have a receptive field located elsewhere. This is equivalent to replacing the neural network matrix multiplication found in linear feature extraction by a series of convolutions, which correspond to sparse structured matrices. Whereas in the usual applications of convolutions for convolutional networks a single common receptive field (convolutional kernel size) is used, we consider multiple sizes, allowing the model to capture structure at different scales (see also [17] for a similar approach, with two scales). As done previously [14, 8], we chose to initialize a convolutional neural network whose filters have been pre-trained by unsupervised learning (here as a CAE) patch-wise, i.e., an ordinary CAE was trained with patches extracted randomly at different locations, and whose size matches that of the convolutional kernels being learned. We pre-train independently a CAE for each kernel size. Assuming that we have a set of n different sizes s i 1, we denote as follows the output on patches x of the CAE trained with the i-th patch-size: h i (x) = s(a i x + α i ) We compute the corresponding feature maps on the whole image by applying h i to each s i -by-s i patch of the input image, 1 The results on TFD were obtained with (14, 14) and (18, 18) scales

8 8 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. f i (x) = σ(conv(a i, x) + α i ) followed by a maximum-pooling on p uniformly divided non-overlapping regions Q = {q 1,..., q p }: F i j (x) = max k q j (f i k(x)) The final output of the convolutional layer is defined as the stack of all the pooled features of each CAE into one long vector: F (x) = [ F 0 (x),..., F n (x) ] with F i (x) = [ F i 1(x),..., F i p(x) ] where we take y = F (x) to be the LTI representation given as input of the higher level CAEs that will perform the disentangling of the remaining factors of variation. 6 Contractive Discriminative Analysis In this section we describe our major technical contribution, the Contractive Discriminative Analysis (CDA). CDA is a semi-supervised version of CAE that promotes the disentangling of discriminative factors of variation in the data from other prominent factors that may well dominate the discriminative factors. Our goal, in deriving CDA, is to separate the factors of the image that are discriminative with respect to the facial expression recognition task from factors that characterize facial morphology and pose. CDA is an extension of the CAE framework (introduced in Sec. 4). While the standard CAE encodes the data into a single feature vector h(x), CDA learns an encoder function that maps an input into two (or more) distinct blocks of features: one that encodes discriminative factors of its input, h (d) (y) = s(w y+c), and one (or more) that encode all other factors, h (o) (y) = s(v y+b). Both feature blocks are trained to cooperate to reconstruct their common input y with a reconstruction loss function, e.g., L RECON (y, ŷ) = y ŷ 2 (4) where ŷ is the CDA reconstruction, given by a linear combination of learned features: ŷ = g([h (d) (y), h (o) (y)]) = s 2 (W T h (d) (y) + V T h (o) (y) + ρ). (5) where ρ i is an offset to capture the average value of y i. In addition, the h d (y) block is also trained to predict the facial expression label z(y) when that information is available. The class prediction is given by the sigmoid function (s( )) over an affine transformation of the discriminative block, ẑ i = P (z = i y), similarly to logistic regression: ) ẑ i = s (U i h (d) (y) + a i. (6)

9 Disentangling factors of variation for facial expression recognition 9 where the feature vector U i maps the discriminative block h (d) (y) to the prediction for class i, and a i is the class specific bias. The corresponding discriminant component of the overall loss function is: L DISC (z, ẑ) = C z i log ẑ i + (1 z i ) log(1 ẑ i ) (7) i=1 with (x, z) L, the labeled training set with input image x and expression label z (represented as a one-hot vector), and with the x s in L a subset of the set of all input examples D (some of which are unlabeled). To obtain semi-supervised training we add a CAE-inspired contractive penalty J CDA (y): J CDA (y) = h (d) (y) y 2 F + h (o) 2 (y) y + γ F i,j ( h (d) i (y) y h(o) j ) 2 (y). (8) y The first two terms penalize sensitivity in h (d) (y) and h (o) (y) respectively to local variations in y (as in the standard CAE) but crucially the third term encourages h (d) (y) and h (o) (y) to represent different directions of variation in the input y, (y) y by asking each sensitivity vector h(d) i to prefer being orthogonal to every sensitivity vector h(o) j of the i-th discriminant feature h (d) i (y) y associated with the j-th non-discriminant feature h (o) j. The addition of this term to the CDA cost function is crucial in achieving our performance results. As we discuss later it regularizes the CDA discriminative features in a manner analogous to how partial least squares can be interpreted as a regularized variant of canonical components analysis [18]. The coefficient γ modulates the relative contribution of the orthogonalization penalty to the overall CDA contractive penalty. Putting all the components of the CDA loss function together we get: L CDA (θ) = x D,y=F (x) L RECON (y, ŷ) + ηj CDA (y) + (x,z) L,y=F (x) L DISC (z, ẑ) (9) The coefficient η weighs the contribution of the contractive penalty. The set of CDA parameters is θ = {U, V, W, a, b, c, ρ}. The CDA training procedure is illustrated in Fig. 2. As expressed here, CDA strictly disentangles discriminative factors from other prominent factors in the data. However, one could easily generalize the method to incorporate any form of additional side information that could be used to further disentangle factors of variation. This would be achieved by creating additional blocks and associating each of them with a set of predicitive parameters helping to map the features in the block to the values of the factor of interest (like U and a above).

10 10 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. h (o) h (d) Logistic Regression y Fig. 2. Illustration of architecture and training procedure for CDA, that learns and separates two blocks of features (emotion-discriminant features h (d) and features h (o) capturing the other factors). 7 Connections to Previous Work The name Contractive Discriminative Analysis (CDA) was inspired by the connection between our CAE-based approach and earlier linear methods such as linear discriminative analysis (LDA) [19], canonical components analysis (CCA)[20] and partial least squares (PLS) [21]. In fact, in the case of linear activation functions (with h(y) = W y) and no orthogonality penalty (η = 0), the discriminative features that would be discovered by CDA would span the same subspace (in the non-overcomplete setting) as LDA and CCA 2. Our use of the orthogonality inducing contraction penalty (with CAE contraction coefficient λ > 0) has an important effect on the learned features. As previously mentioned, it acts as an additional regularization term on the discriminative features learned by CDA. Interestingly, in the linear setting (with h(y) = W y), the effect of this penalty is to encourage the discriminative and non-discriminative features to be mutually orthogonal. This is reminiscent of the difference between CCA that seeks a linear projection of the input that maximizes correlation with the label encoding and PLS that seeks an equivalent linear projection but rather maximizes covariance. PLS is considered a regularized form of CCA since it forces the projection to preserve additional information the input, specifically in the covariance directions. By penalizing non-orthogonality in the projecting matrix, our CDA penalty acts in a way very similar to PLS. Another interesting connection can be drawn to deep learning techniques that combine a supervised objective with an unsupervised objective when learning a feature set. This started with the idea of partially supervised training in [23] where the RBM or auto-encoder gradient (or estimated gradient) is added to the gradient of a global supervised objective for the deep network. A related idea was proposed in [24] which allowed to train fairly deep networks in a semisupervised setting (where only a few examples have a label, i.e., the supervised gradient is only added up on these). A hybrid of discriminant (conditional loglikelihood) and generative (joint log-likelihood) gradients was also used to train discriminant RBMs [25]. The most significant difference between the CDA and these other semi-supervised feature learning strategies is that CDA explicitly deals with nuisance factors by relegating them to the non-discriminative feature set. These other approaches use the labels to encourage discriminative features 2 In the discriminative setting, where one of the two projected matrices contains only label information, the CCA and LDA directions are the same [22]. ŷ

11 Disentangling factors of variation for facial expression recognition 11 Fig. 3. Left: Example images from the Toronto face database [4]. Center-Right: Convolutional kernels learnt by the CAE. center: 14x14, right:18x18. Smaller kernel sizes tends to learn features that are more local in the 2D image space. while relying on model capacity limitations to filter out the nuisance factors. In the CAE there is a deliberate and controlled loss of information (in directions of variance that correspond to these nuisance factors) in the discriminative feature block. 8 Experiments and Results For our experiments, we use the same setup as [4] and [14]. We use the same 5 standard splits (folds) of the Toronto Face Dataset, to repeatedly train our model and evaluate its performance for emotion classification. 3 For the CCNET training stage, since it is entirely unsupervised, we used the 112,234 unlabeled faces (48 48 grayscale images). More specifically, the CCNET was used to learn 512 convolutional kernels of size and 512 of size Figure 3 shows some of the learned kernels. Each post-sigmoid feature map obtained by applying one of these 1024 kernels was max-pooled within 3 3 regions, yielding = 9216 features in total. For the following CDA stage, training examples were sampled with 50% probability from the TFD unlabeled set and 50% probability from the TFD labeled training set (of the considered split). This is to make sure that the less numerous labeled examples get seen often enough during training, since they contain crucial information that we do not want to swamp under the signals brought by the unlabeled examples. For each TFD standard split, the CDA was trained to extract 1000 discriminative features and 1000 non-discriminative features. These features were then fed to a linear SVM for assessing final classification performance. Performance averaged over the 5 splits was 43.01% accuracy when using 3 Due to time constraints, we concentrated on the first fold for tuning the model s hyper-parameters. We retained the values of the hyper-parameters that yielded best linear SVM performance on that first fold s validation set, and used them unchanged for the other folds. This strategy was used to select both the CCNET hyper-parameters (kernel sizes, CAE regularization strength λ, and pooling regions), and the CDA hyper-parameters (number of discriminative and non-discriminative features and γ) that are reported in the main text.

12 12 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. Fig. 4. Singular vectors associated with largest singular values of the (left) emotion Jacobian h(e), (right) other factors Jacobian h(o). We can notice that x x h(d) is mostly sensitive to face parts associated with emotions, while h (o) captures face structure more likely to correspond to identity. the non-discriminative features versus 85.06% when using the discriminative features. This indicates that the CDA criterion was indeed able to disentangle features most relevant for emotion classification from other aspects of the faces. To qualitatively compare the discriminative to the non-discriminative features learned, we extracted the input directions to which they were most sensitive. This was achieved by extracting the 10 leading singular vectors of the derivative of either discriminative or non-discriminative features with respect to the image input. From Fig. 4 we see that, in general, the sensitivity directions for the expression-discriminate feature block, h (d), are more localized and contain less identity specific information relative to the non-discriminative feature block h (o). We also see that the expression-discriminate feature block contains expression targeted detectors such as corner of the mouth smile detectors, toothy-grin detectors, grimace detectors and wide-eye (surprise) detectors. Classification performance obtained with features extracted after each of the two stages of our model (CCNET+SVM and CCNET+CDA+SVM) are reported in Table 1, and compared to simpler single non-convolutional one-layer CAE (CAE-1+SVM) and a stack of two CAEs (CAE-2+SVM). These results confirm that each successive layer we add helps to disentangle discriminative features, yielding good classification performance. Table 2 compares the performance of our approach to that of established models[14, 11]. Table 1. Test classification accuracy of several models, trained on TFD, averaged over 5 folds (reported with standard deviation). Model CAE-1+SVM CAE-2+SVM CCNET+SVM CCNET+CDA+SVM Accuracy ± ± ± ± 0.47

13 Disentangling factors of variation for facial expression recognition 13 Fig. 5. The effect of the CDA term on the generalisation performance for different values of η. The optimum is found for a non-negligible value of η = 7. Table 2. Test classification accuracy of established models trained on TFD. Model Accuracy SVM RBF-SVM SC+SVM GFB+PCA+SVM [11] mpot+dbn CCNET+CDA+SVM 9 Discussion In this paper, we have investigated an approach to facial expression recognition based on a feature hierarchy trained to disentangle the factors of variation that give rise to facial expressions from other factors such as those responsible for subject identity, specific facial morphology and subject pose. We introduce contractive discriminative analysis (CDA), a novel semi-supervised learning paradigm that incorporates available label information to define discriminative features while regularizing the feature set with a CAE-inspired penalty to promote good generalization properties. By combining prior knowledge of the spatial topology of images together with feature learning schemes designed to recover robust features of facial expressions, we significantly surpass the previous state-of-the-art on the Toronto face database [4], achieving a generalization accuracy of 85.0%. We also show how the features recovered by our CDA scheme are invariant to factors such as subject identity and pose while remaining sensitive to changes in facial expression. References 1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (2009) Also published as a book. Now Publishers, Saul, L., Roweis, S.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4 (2002)

14 14 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. 3. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting autoencoders: Explicit invariance during feature extraction. In: ICML (2011) 4. Susskind, J., Anderson, A., Hinton, G.E.: The Toronto face dataset. Technical Report UTML TR , U. Toronto (2010) 5. Ranzato, M., Susskind, J., Mnih, V., Hinton, G.E.: On deep generative models with applications to recognition. In: CVPR 11. (2011) Padgett, C., Cottrell, G.W.: A simple neural network models categorical perception of facial expressions. In: In Proceedings of the Twentieth Annual Cognitive Science Conference, Erlbaum (1998) Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multistage architecture for object recognition? In: ICCV 09. (2009) 8. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011). (2011) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. CVPR (2005) 11. Dailey, M.N., Cottrell, G.W., Padgett, C., Adolphs, R.: EMPATH: A neural network that categorizes facial expressions. J. cognitive neuroscience (2002) LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (1989) Wolf, R., Platt, J.: Postal address block location using a convolutional locator network. In: NIPS 93. (1994) Ranzato, M., Huang, F., Boureau, Y., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: CVPR 07. (2007) 15. Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatiotemporal features. In: ECCV 10. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, Y.L., Gregor, K., Mathieu, M., LeCun, Y.: Learning convolutional feature hierarchies for visual recognition. In: NIPS (2010) 17. Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spikeand-slab RBMs. In: ICML (2011) 18. Barker, M., Rayens, W.: Partial least squares for discrimination. Journal of Chemometrics 17 (2003) Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (1936) Hotelling, H.: Relations between two sets of variates. Biometrika 28 (1936) Wold, S., Ruhe, A., Wold, H., Dunn, W.J.: The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5 (1984) Bartlett, M.S.: Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society 34 (1938) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS 06. MIT Press (2007) Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: ICML (2008) 25. Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: ICML (2008)

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Multi-Task Learning of Facial Landmarks and Expression

Multi-Task Learning of Facial Landmarks and Expression Multi-Task Learning of Facial Landmarks and Expression Terrance Devries 1, Kumar Biswaranjan 2, and Graham W. Taylor 1 1 School of Engineering, University of Guelph, Guelph, Canada N1G 2W1 2 Department

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction : Explicit Invariance During Feature Extraction Salah Rifai (1) Pascal Vincent (1) Xavier Muller (1) Xavier Glorot (1) Yoshua Bengio (1) (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada

More information

Neural Networks: promises of current research

Neural Networks: promises of current research April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab

More information

Autoencoders, denoising autoencoders, and learning deep networks

Autoencoders, denoising autoencoders, and learning deep networks 4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Learning Two-Layer Contractive Encodings

Learning Two-Layer Contractive Encodings In Proceedings of International Conference on Artificial Neural Networks (ICANN), pp. 620-628, September 202. Learning Two-Layer Contractive Encodings Hannes Schulz and Sven Behnke Rheinische Friedrich-Wilhelms-Universität

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

A supervised strategy for deep kernel machine

A supervised strategy for deep kernel machine A supervised strategy for deep kernel machine Florian Yger, Maxime Berar, Gilles Gasso and Alain Rakotomamonjy LITIS EA 4108 - Université de Rouen/ INSA de Rouen, 76800 Saint Etienne du Rouvray - France

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Facial Expression Recognition Using Non-negative Matrix Factorization

Facial Expression Recognition Using Non-negative Matrix Factorization Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,

More information

Introduction to Deep Learning

Introduction to Deep Learning ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)

More information

Exploring Bag of Words Architectures in the Facial Expression Domain

Exploring Bag of Words Architectures in the Facial Expression Domain Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Extracting and Composing Robust Features with Denoising Autoencoders

Extracting and Composing Robust Features with Denoising Autoencoders Presenter: Alexander Truong March 16, 2017 Extracting and Composing Robust Features with Denoising Autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol 1 Outline Introduction

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 122 CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 5.1 INTRODUCTION Face recognition, means checking for the presence of a face from a database that contains many faces and could be performed

More information

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio Université de Montréal 13/06/2007

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle larocheh@iro.umontreal.ca Dumitru Erhan erhandum@iro.umontreal.ca Aaron Courville courvila@iro.umontreal.ca

More information

Learning Feature Hierarchies for Object Recognition

Learning Feature Hierarchies for Object Recognition Learning Feature Hierarchies for Object Recognition Koray Kavukcuoglu Computer Science Department Courant Institute of Mathematical Sciences New York University Marc Aurelio Ranzato, Kevin Jarrett, Pierre

More information

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine 2014 22nd International Conference on Pattern Recognition To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine Takayoshi Yamashita, Masayuki Tanaka, Eiji Yoshida, Yuji Yamauchi and Hironobu

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why? Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of

More information

3D Object Recognition with Deep Belief Nets

3D Object Recognition with Deep Belief Nets 3D Object Recognition with Deep Belief Nets Vinod Nair and Geoffrey E. Hinton Department of Computer Science, University of Toronto 10 King s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu

More information

Novel Lossy Compression Algorithms with Stacked Autoencoders

Novel Lossy Compression Algorithms with Stacked Autoencoders Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is

More information

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, Andrew Y. Ng Computer Science

More information

Learning Invariant Representations with Local Transformations

Learning Invariant Representations with Local Transformations Kihyuk Sohn kihyuks@umich.edu Honglak Lee honglak@eecs.umich.edu Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA Abstract Learning invariant representations

More information

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, 1603-1, Kamitomioka Nagaoka, Niigata

More information

Depth Image Dimension Reduction Using Deep Belief Networks

Depth Image Dimension Reduction Using Deep Belief Networks Depth Image Dimension Reduction Using Deep Belief Networks Isma Hadji* and Akshay Jain** Department of Electrical and Computer Engineering University of Missouri 19 Eng. Building West, Columbia, MO, 65211

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

arxiv: v1 [cs.lg] 20 Dec 2013

arxiv: v1 [cs.lg] 20 Dec 2013 Unsupervised Feature Learning by Deep Sparse Coding Yunlong He Koray Kavukcuoglu Yun Wang Arthur Szlam Yanjun Qi arxiv:1312.5783v1 [cs.lg] 20 Dec 2013 Abstract In this paper, we propose a new unsupervised

More information

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature 0/19.. Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature Usman Tariq, Jianchao Yang, Thomas S. Huang Department of Electrical and Computer Engineering Beckman Institute

More information

Multiview Feature Learning

Multiview Feature Learning Multiview Feature Learning Roland Memisevic Frankfurt, Montreal Tutorial at IPAM 2012 Roland Memisevic (Frankfurt, Montreal) Multiview Feature Learning Tutorial at IPAM 2012 1 / 163 Outline 1 Introduction

More information

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling [DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji

More information

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition Emotion Recognition In The Wild Challenge and Workshop (EmotiW 2013) Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition Mengyi Liu, Ruiping Wang, Zhiwu Huang, Shiguang Shan,

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Face Recognition using SURF Features and SVM Classifier

Face Recognition using SURF Features and SVM Classifier International Journal of Electronics Engineering Research. ISSN 0975-6450 Volume 8, Number 1 (016) pp. 1-8 Research India Publications http://www.ripublication.com Face Recognition using SURF Features

More information

Locally Scale-Invariant Convolutional Neural Networks

Locally Scale-Invariant Convolutional Neural Networks Locally Scale-Invariant Convolutional Neural Networks Angjoo Kanazawa Department of Computer Science University of Maryland, College Park, MD 20740 kanazawa@umiacs.umd.edu Abhishek Sharma Department of

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD. Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,

More information

Using Geometric Blur for Point Correspondence

Using Geometric Blur for Point Correspondence 1 Using Geometric Blur for Point Correspondence Nisarg Vyas Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA Abstract In computer vision applications, point correspondence

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance Telmo Amaral¹, Luís M. Silva¹², Luís A. Alexandre³, Chetak Kandaswamy¹, Joaquim Marques de Sá¹ 4, and Jorge M. Santos¹

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

Challenges motivating deep learning. Sargur N. Srihari

Challenges motivating deep learning. Sargur N. Srihari Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation

More information

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model Marc'Aurelio Ranzato C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun CIAR Summer School Toronto 2006 Why Extracting

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

Deep Similarity Learning for Multimodal Medical Images

Deep Similarity Learning for Multimodal Medical Images Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal

More information

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional

More information

A Keypoint Descriptor Inspired by Retinal Computation

A Keypoint Descriptor Inspired by Retinal Computation A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

On Deep Generative Models with Applications to Recognition

On Deep Generative Models with Applications to Recognition On Deep Generative Models with Applications to Recognition Marc Aurelio Ranzato Joshua Susskind Department of Computer Science University of Toronto ranzato,vmnih,hinton@cs.toronto.edu Volodymyr Mnih Geoffrey

More information

Learning Convolutional Feature Hierarchies for Visual Recognition

Learning Convolutional Feature Hierarchies for Visual Recognition Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute

More information

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Roberto Rigamonti, Matthew A. Brown, Vincent Lepetit CVLab, EPFL Lausanne, Switzerland firstname.lastname@epfl.ch

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning

Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning Mohammad Norouzi, Mani Ranjbar, and Greg Mori School of Computing Science Simon Fraser University Burnaby, BC

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Capsule Networks. Eric Mintun

Capsule Networks. Eric Mintun Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural Networks. Two goals: Replace max-pooling operation with something more intuitive. Keep more info about an activated

More information

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Computer Science Department, Stanford University,

More information

242 KHEDR & AWAD, Mat. Sci. Res. India, Vol. 8(2), (2011), y 2

242 KHEDR & AWAD, Mat. Sci. Res. India, Vol. 8(2), (2011), y 2 Material Science Research India Vol. 8(), 4-45 () Study of Fourier Descriptors and it s Laplace Transform for Image Recognition WAEL M. KHEDR and QAMAR A. AWAD Department of Mathematical, Faculty of Science,

More information

Cambridge Interview Technical Talk

Cambridge Interview Technical Talk Cambridge Interview Technical Talk February 2, 2010 Table of contents Causal Learning 1 Causal Learning Conclusion 2 3 Motivation Recursive Segmentation Learning Causal Learning Conclusion Causal learning

More information

CS 231A Computer Vision (Fall 2011) Problem Set 4

CS 231A Computer Vision (Fall 2011) Problem Set 4 CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based

More information

Sketchable Histograms of Oriented Gradients for Object Detection

Sketchable Histograms of Oriented Gradients for Object Detection Sketchable Histograms of Oriented Gradients for Object Detection No Author Given No Institute Given Abstract. In this paper we investigate a new representation approach for visual object recognition. The

More information

Automated Canvas Analysis for Painting Conservation. By Brendan Tobin

Automated Canvas Analysis for Painting Conservation. By Brendan Tobin Automated Canvas Analysis for Painting Conservation By Brendan Tobin 1. Motivation Distinctive variations in the spacings between threads in a painting's canvas can be used to show that two sections of

More information

Weighted Convolutional Neural Network. Ensemble.

Weighted Convolutional Neural Network. Ensemble. Weighted Convolutional Neural Network Ensemble Xavier Frazão and Luís A. Alexandre Dept. of Informatics, Univ. Beira Interior and Instituto de Telecomunicações Covilhã, Portugal xavierfrazao@gmail.com

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

Handwritten Hindi Numerals Recognition System

Handwritten Hindi Numerals Recognition System CS365 Project Report Handwritten Hindi Numerals Recognition System Submitted by: Akarshan Sarkar Kritika Singh Project Mentor: Prof. Amitabha Mukerjee 1 Abstract In this project, we consider the problem

More information

Supervised Translation-Invariant Sparse Coding

Supervised Translation-Invariant Sparse Coding Supervised Translation-Invariant Sparse Coding Jianchao Yang,KaiYu, Thomas Huang Beckman Institute, University of Illinois at Urbana-Champaign NEC Laboratories America, Inc., Cupertino, California {jyang29,

More information

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Learning Discrete Representations via Information Maximizing Self-Augmented Training A. Relation to Denoising and Contractive Auto-encoders Our method is related to denoising auto-encoders (Vincent et al., 2008). Auto-encoders maximize a lower bound of mutual information (Cover & Thomas,

More information

Tiled convolutional neural networks

Tiled convolutional neural networks Tiled convolutional neural networks Quoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang Wei Koh, Andrew Y. Ng Computer Science Department, Stanford University {quocle,jngiam,zhenghao,danchia,pangwei,ang}@cs.stanford.edu

More information

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations Modeling Visual Cortex V4 in Naturalistic Conditions with Invariant and Sparse Image Representations Bin Yu Departments of Statistics and EECS University of California at Berkeley Rutgers University, May

More information

Learning a Manifold as an Atlas Supplementary Material

Learning a Manifold as an Atlas Supplementary Material Learning a Manifold as an Atlas Supplementary Material Nikolaos Pitelis Chris Russell School of EECS, Queen Mary, University of London [nikolaos.pitelis,chrisr,lourdes]@eecs.qmul.ac.uk Lourdes Agapito

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Facial expression recognition using shape and texture information

Facial expression recognition using shape and texture information 1 Facial expression recognition using shape and texture information I. Kotsia 1 and I. Pitas 1 Aristotle University of Thessaloniki pitas@aiia.csd.auth.gr Department of Informatics Box 451 54124 Thessaloniki,

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Single Image Depth Estimation via Deep Learning

Single Image Depth Estimation via Deep Learning Single Image Depth Estimation via Deep Learning Wei Song Stanford University Stanford, CA Abstract The goal of the project is to apply direct supervised deep learning to the problem of monocular depth

More information

Static Gesture Recognition with Restricted Boltzmann Machines

Static Gesture Recognition with Restricted Boltzmann Machines Static Gesture Recognition with Restricted Boltzmann Machines Peter O Donovan Department of Computer Science, University of Toronto 6 Kings College Rd, M5S 3G4, Canada odonovan@dgp.toronto.edu Abstract

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Advanced Introduction to Machine Learning, CMU-10715

Advanced Introduction to Machine Learning, CMU-10715 Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics? Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto The representation used by the

More information

Deep Learning for Generic Object Recognition

Deep Learning for Generic Object Recognition Deep Learning for Generic Object Recognition, Computational and Biological Learning Lab The Courant Institute of Mathematical Sciences New York University Collaborators: Marc'Aurelio Ranzato, Fu Jie Huang,

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

An Analysis of Single-Layer Networks in Unsupervised Feature Learning An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates Honglak Lee Andrew Y. Ng Stanford University Computer Science Dept. 353 Serra Mall Stanford, CA 94305 University of Michigan

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction

Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction Wei Wang 1, Yan Huang 1, Yizhou Wang 2, Liang Wang 1 1 Center for Research on Intelligent Perception and Computing, CRIPAC

More information