Disentangling factors of variation for facial expression recognition
|
|
- Rudolf Theodore Hart
- 5 years ago
- Views:
Transcription
1 Disentangling factors of variation for facial expression recognition Salah Rifai, Yoshua Bengio, Aaron Courville, Pascal Vincent, and Mehdi Mirza Université de Montréal Department of Computer Science and Operations Research Abstract. We propose a semi-supervised approach to solve the task of emotion recognition in 2D face images using recent ideas in deep learning for handling the factors of variation present in data. An emotion classification algorithm should be both robust to (1) remaining variations due to the pose of the face in the image after centering and alignment, (2) the identity or morphology of the face. In order to achieve this invariance, we propose to learn a hierarchy of features in which we gradually filter the factors of variation arising from both (1) and (2). We address (1) by using a multi-scale contractive convolutional network (CCNET) in order to obtain invariance to translations of the facial traits in the image. Using the feature representation produced by the CCNET, we train a Contractive Discriminative Analysis (CDA) feature extractor, a novel variant of the Contractive Auto-Encoder (CAE), designed to learn a representation separating out the emotion-related factors from the others (which mostly capture the subject identity, and what is left of pose after the CCNET). This system beats the state-of-the-art on a recently proposed dataset for facial expression recognition, the Toronto Face Database, moving the state-of-art accuracy from 82.4% to 85.0%, while the CCNET and CDA improve accuracy of a standard CAE by 8%. Key words: emotion recognition, contractive, convolution, deep learning, auto-encoder, TFD 1 Introduction A central challenge in computer vision is to disentangle the various factors of variation that explain an image, such as object pose, identity, or various other attributes. This is particularly important for facial expression recognition, the central topic of this paper. Disentangling can be done by exploiting two sources of information, a priori knowledge about these factors and examples, combined by learning algorithms tailored to help performing this kind of disentangling. While our central contribution is in such learning algorithms, this paper is really about how we can combine the two. It exploits advances in machine learning of representations (i.e., sets of features) at multiple levels, i.e., a form of automatic feature extraction pipeline called deep learning [1]. It is based on unsupervised learning that captures the leading local directions of variation present in the data,
2 2 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. a kind of non-linear manifold learning [2]. The specific unsupervised learning algorithm that we build upon here is the Contractive Auto-Encoder (CAE) [3]. The proposed system is basically semi-supervised: it uses the emotion label as a hint about the factor of variation of interest and combines that hint with an unsupervised training criterion to separate out one factor from the other. At the top level of the feature hierarchy, two blocks of features are trained, with the features in one block being sensitive to emotion and more invariant to the other factors, while the features in the other block are trained to be insensitive to emotion changes.this is achieved by adding together the discriminant criterion (predicting emotion), a reconstruction error, the CAE contractive penalty, and a novel local orthogonality penalty that encourages the two blocks of features to vary in directions orthogonal to each other, i.e., such that features specialize to some factors (such as emotions) while being insensitive to others. The resulting algorithm is termed the Contractive Discriminant Analysis (CDA). The proposed facial expression recognition system is evaluated on a recently proposed benchmark dataset, the Toronto Face Database [4], and yields results that beat the state-of-the-art on this dataset, showing the improvement brought by CDA. 2 Background 2.1 Facial Expression Recognition Despite receiving considerable attention over the past 15 years [5, 6], the automatic recognition of facial expression remains a very challenging problem domain. One of the major obstacles to performance is that aspects of the data associated with facial expressions are tightly intertwined with other factors evident in the data. These other factors are primarily associated with subject identity (facial morphology) and pose. With respect to the task of facial expression detection these can be considered nuisance factors. The challenge with facial expression recognition is that these nuisance factors often dominate the representation of the image in pixel space. Two images of different individuals with the same facial expression are likely to be well separated in pixel space; while two images of the same individual showing different expressions may well be found very close together. 2.2 The Toronto Face Database Beyond the inherent challenges in learning to recognize facial expressions, the relative paucity of easily accessible data has been a significant barrier to progress. Until recently, datasets have been limited to a relatively small number of subjects displaying different expressions, with exogenous factors such as illumination and pose being carefully controlled. The recent introduction of the Toronto Face Database (TFD) [4] is a significant step forward in our ability to build robust recognition systems (examples shown in Fig. 3(a)). The TFD is a conglomeration of a large number of smaller face datasets, with each image aligned and rescaled a uniform size of pixels. The dataset consists of 4,178 expression-labeled images, 3,874 of which also possess subject identity labels. There is also a very
3 Disentangling factors of variation for facial expression recognition 3 large (112,234 image) unlabeled dataset that, while missing expression label information, is preprocessed in the same way as the labeled data. Because we exploit unsupervised and semi-supervised learning procedures, our approach can take advantage of additional unlabeled data to learn better features. 2.3 Invariant Features: the Standard Pipeline Facial expression recognition is certainly not unique among vision tasks in having to deal with nuisance factors of variation in the data. In other object and scene recognition settings, nuisance factors are typically related to object pose and illumination conditions. In recent years, solution strategies for these tasks have largely converged toward a multistage processing pipeline [7, 8]. First, local lowlevel features, such as SIFT [9], HoG [10] or, in the case of facial expression recognition, e.g., oriented Gabor filter banks [11], are extracted from patches of the image. Next, these features are spatially aggregated, or pooled, over different regions of the image and sometimes at different spatial resolutions. The output of each pooling unit is a sum, mean or maximum over the outputs of a filter bank over a small spatial area. The aggregate features are then mapped into a vector image representation that is used as input to a general purpose classifier such as a linear support vector machine (SVM). The success of this approach can be attributed in good part to the quality of the feature representation used as input to the classifier. If this representation is invariant to the nuisance factors of variation while maintaining sensitivity along the relevant factors, then one would expect the system to generalize well to new examples. 2.4 Invariant Features: Unsupervised Feature Learning The features used in the above pipeline can be hand-crafted or can be learned. Much research has been done in recent years in generic as well as image-specific feature learning algorithms. Most of these algorithms exploit unsupervised learning and can therefore be applied even in the absence of labels. When a hierarchy of features is trained, these are called deep models [1]. These hierarchical feature learning approaches are based on the unsupervised training of single layer models such as an RBM, sparse coding or auto-encoder variants. One particularly successful recent variant of the auto-encoder is the contractive auto-encoder (CAE) [3]. The principle underlying the CAE is that locally invariant features can be induced through activity dependent regularization. The regularization penalty discourages changes in the features associated with small changes in the input image. The mathematical details of the CAE are provided below (Section 4). When compared to the standard multistage pipeline, the feature learning strategy pursues a different and complementary strategy toward constructing invariant representations with good generalization properties. While the standard pipeline builds invariance to known nuisance factors of variation by aggregating over lower-level features that vary across these factors, unsupervised feature learning includes all significant factors, including the nuisance factors, but to the extent that these factors are statistically independent in the data, it tends to represent these factors separately. In the ideal scenario, while the learned
4 4 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. representation still contains the nuisance factors, they are disentangled from the relevant factors and can thus be more easily ignored by the subsequent classifier. Both the hand-built features and feature-learning approaches have their advantages. The standard pipeline exploits domain knowledge about the spatial relationship between pixels to construct features that are invariant to simple transformations such as translation. On the other hand, a feature-learning strategy may be more successful at building features that are invariant to factors that do not correspond to simple transformations. In the case of facial expression recognition, many of the factors associated with facial morphology fall into this category. While invariance to simple transformations such as local translations, rotations and scaling go some way to remove sensitivity to facial morphology, they stop well short of capturing all variations in facial characteristics across the human population. Yet, unfortunately, due to the tight coupling of the factors of variation underlying facial expression and morphology, even our most successful unsupervised learning-based approaches are unlikely to satisfactorily disentangle these factors on their own. 3 Proposed Approach In this work, we deal directly with this issue of entangled factors of variation by developing a semi-supervised feature learning strategy that combines the advantages of both a feature learning approach and the feature pooling pipeline. Ours is a hierarchical (or deep) modeling approach. At each layer, the features become increasingly invariant to nuisance factors while maintaining discriminative information with respect to the task of facial expression recognition. Our approach can be broken down into the following three stages of the learning procedure. 1. We use the CAE algorithm (described in Section 4) to learn locally invariant image features from image patches at multiple resolutions. The CAE-derived feature-extractors are applied convolutionally to the entire image to form a series of feature maps, each corresponding to a single learned CAE feature, and to a convolution kernel. These feature maps are then decimated via max-pooling in regular non-overlapping regions to form a local-translationinvariant (LTI) representation. This first stage is termed the Contractive Convolutional Network (CCNET) and is described in more details in Section 5 below. 2. The LTI features are then used as input to a novel semi-supervised featurelearning CAE-based approach we call the Contractive Discriminative Analysis (CDA). The basic approach is to divide the features to be learned at this layer into two blocks. While the blocks are trained to cooperate to reconstruct their mutual input, one of these blocks (the discriminative feature block) is also trained to predict the facial expression class on examples where label information is available. Our objective in segregating the features in this way is to tease apart the discriminative features that learn to encode useful information about facial expressions from nuisance features (that are complementary but not task-discriminative). We further include a
5 Disentangling factors of variation for facial expression recognition 5 K feature maps y 1 s 1 h (d) L L s K feature maps W SVM s 2 L L s Fig. 1. Classification pipeline. At left: input image at different resolutions. The next stage contains the output of the K convolutional feature maps, followed by the maxpooling (y). The CDA produces the last stage of (discriminant) features h (d) through the weight matrix W, which are then fed to a linear SVM. novel CAE-inspired penalty that locally encourages the discriminative features and non-discriminative features to encode distinct directions of variation in the input. The resulting CDA learning algorithm is described formally in Section Finally, following the standard pipeline, the discriminative features are used as input to train a linear SVM on the labeled training data. Once the system has been trained, the learned features form the basis of a multistage classification pipeline, similar to commonly used classification pipelines [7, 8]. To summarize, the computational stages of the classification pipeline (which closely follow the training pipeline outlined above) are as follows. 1. The multi-resolution CAE features are convolved over the entire image. This produces a set of feature maps (one for each CAE feature). 2. The convolutional CAE feature maps are decimated (via max-pooling) to a coarse grid (2 2 or 3 3) over the image. 3. These decimated features are then concatenated and the CDA discriminative feature encoding is applied to this concatenation. Note that when performing classification, we no longer need to compute the block of non-discriminative CDA features. 4. Finally, the CDA encoding (the discriminative block) is passed to the linear SVM to obtain a class prediction. In the case of facial expression recognition, these class predictions correspond to one of the seven recognized expressions: happy, sad, scared, surprised, anger, disgust and fear. Fig. 1 illustrates the classification pipeline. 4 Contractive Auto-Encoder (CAE) In this section, we briefly describe the CAE algorithm that is used for unsupervised feature learning. We closely follow the description of [3] for unsupervised
6 6 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. learning of a non-linear feature extractor from a dataset D = {x 1,..., x n }. Examples x i IR d are i.i.d. samples from an unknown distribution p(x). 4.1 Auto-Encoders The auto-encoder framework is one of the oldest and simplest techniques for the unsupervised learning of non-linear feature extractors. It learns an encoder function h, that maps an input x IR d to a feature vector h(x) IR d h, jointly with a decoder function g, that maps h back to the input space as r = g(h(x)), the reconstruction of x. The encoder and decoder s parameters θ can be optimized by stochastic gradient descent to minimize the average reconstruction error L(x, g(h(x))) for the examples of the training set. The objective being minimized is: J AE (θ) = x D L(x, g(h(x))). (1) We will will use the most common forms of encoder, decoder, and reconstruction error: Encoder: h(x) = s(ax+α), where s is the element-wise logistic sigmoid s(z) = 1 1+e. Parameters are a d z h d weight matrix A and bias vector α IR d h. Decoder: ˆx = g(h(x)) = s 2 (A T h(x) + β). Parameters are A T (tied weights, shared with the encoder) and bias vector β IR d. Activation function s 2 is either a logistic sigmoid (s 2 = s) or the identity (linear decoder). Loss function: Squared error: L RECON (x, ˆx) = x ˆx 2. The set of parameters of such an auto-encoder is θ = {A, α, β}. 4.2 Contractive Regularization For auto-encoders to learn something meaningful, they must have low reconstruction error on the training examples but large reconstruction error for most other input configurations. One way to achieve this is with the contractive penalty of the Contractive Auto-Encoder (CAE), introduced by [3]. This penalty term encourages robustness of feature vector h(x) to small variations of a training input x, by penalizing its sensitivity to that input, measured as the Frobenius norm of the encoder s Jacobian J(x) = h x (x). The regularized objective minimized by the CAE is the following: J CAE (θ) = x D L(x, g(h(x))) + λ J(x) 2 F, (2) where λ is a non-negative regularization hyper-parameter that controls how strongly the norm of the Jacobian is penalized, and A 2 F is the Frobenius matrix norm (sum of the square of the matrix elements). Note that, with the traditional sigmoid encoder form given above, one can easily obtain the Jacobian of the encoder. Its j th row is obtained form the j th row A j of A as: J(x) j = h j(x) x = h j (x)(1 h j (x))a j. (3)
7 Disentangling factors of variation for facial expression recognition 7 Computing the extra penalty term (and its contribution to the gradient) is similar to computing the reconstruction error term (and its contribution to the gradient), thus relatively cheap. The effect of training a CAE is that the resulting features tends to be sparsely active: only a few of the features have a significantly non-zero derivative (i.e. when the sigmoid is neither saturated near 0 nor saturated near 1). The set of active features depends on the current input x. Those active features respond almost linearly to changes in the input, and they provide a local basis for the variations around x, in some privileged directions which are those to which they respond (corresponding to their weight vectors, e.g., A j for feature h j ). When the data congregate near a low-dimensional manifold around x, only a few features are active in the neighborhood of x. Hence the locally active features form a coordinate system for a region in input space, corresponding to a chart, and the overall set of such regions forms an atlas of charts [3] mapping out a non-linear manifold near which the estimated input density concentrates. 5 Contractive Convolutional Network (CCNET) Convolutional neural networks are generalizations of neural networks which have been particularly successful in computer vision [12 16]. In a way similar to what has been done to generalize sparse coding and RBMs to the convolutional setting, we generalize CAEs to the convolutional setting, i.e. each convolutional feature output sees to a spatially local region in the input image (called the receptive field), while sharing the parameters of that feature (convolution kernel) with other features that have a receptive field located elsewhere. This is equivalent to replacing the neural network matrix multiplication found in linear feature extraction by a series of convolutions, which correspond to sparse structured matrices. Whereas in the usual applications of convolutions for convolutional networks a single common receptive field (convolutional kernel size) is used, we consider multiple sizes, allowing the model to capture structure at different scales (see also [17] for a similar approach, with two scales). As done previously [14, 8], we chose to initialize a convolutional neural network whose filters have been pre-trained by unsupervised learning (here as a CAE) patch-wise, i.e., an ordinary CAE was trained with patches extracted randomly at different locations, and whose size matches that of the convolutional kernels being learned. We pre-train independently a CAE for each kernel size. Assuming that we have a set of n different sizes s i 1, we denote as follows the output on patches x of the CAE trained with the i-th patch-size: h i (x) = s(a i x + α i ) We compute the corresponding feature maps on the whole image by applying h i to each s i -by-s i patch of the input image, 1 The results on TFD were obtained with (14, 14) and (18, 18) scales
8 8 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. f i (x) = σ(conv(a i, x) + α i ) followed by a maximum-pooling on p uniformly divided non-overlapping regions Q = {q 1,..., q p }: F i j (x) = max k q j (f i k(x)) The final output of the convolutional layer is defined as the stack of all the pooled features of each CAE into one long vector: F (x) = [ F 0 (x),..., F n (x) ] with F i (x) = [ F i 1(x),..., F i p(x) ] where we take y = F (x) to be the LTI representation given as input of the higher level CAEs that will perform the disentangling of the remaining factors of variation. 6 Contractive Discriminative Analysis In this section we describe our major technical contribution, the Contractive Discriminative Analysis (CDA). CDA is a semi-supervised version of CAE that promotes the disentangling of discriminative factors of variation in the data from other prominent factors that may well dominate the discriminative factors. Our goal, in deriving CDA, is to separate the factors of the image that are discriminative with respect to the facial expression recognition task from factors that characterize facial morphology and pose. CDA is an extension of the CAE framework (introduced in Sec. 4). While the standard CAE encodes the data into a single feature vector h(x), CDA learns an encoder function that maps an input into two (or more) distinct blocks of features: one that encodes discriminative factors of its input, h (d) (y) = s(w y+c), and one (or more) that encode all other factors, h (o) (y) = s(v y+b). Both feature blocks are trained to cooperate to reconstruct their common input y with a reconstruction loss function, e.g., L RECON (y, ŷ) = y ŷ 2 (4) where ŷ is the CDA reconstruction, given by a linear combination of learned features: ŷ = g([h (d) (y), h (o) (y)]) = s 2 (W T h (d) (y) + V T h (o) (y) + ρ). (5) where ρ i is an offset to capture the average value of y i. In addition, the h d (y) block is also trained to predict the facial expression label z(y) when that information is available. The class prediction is given by the sigmoid function (s( )) over an affine transformation of the discriminative block, ẑ i = P (z = i y), similarly to logistic regression: ) ẑ i = s (U i h (d) (y) + a i. (6)
9 Disentangling factors of variation for facial expression recognition 9 where the feature vector U i maps the discriminative block h (d) (y) to the prediction for class i, and a i is the class specific bias. The corresponding discriminant component of the overall loss function is: L DISC (z, ẑ) = C z i log ẑ i + (1 z i ) log(1 ẑ i ) (7) i=1 with (x, z) L, the labeled training set with input image x and expression label z (represented as a one-hot vector), and with the x s in L a subset of the set of all input examples D (some of which are unlabeled). To obtain semi-supervised training we add a CAE-inspired contractive penalty J CDA (y): J CDA (y) = h (d) (y) y 2 F + h (o) 2 (y) y + γ F i,j ( h (d) i (y) y h(o) j ) 2 (y). (8) y The first two terms penalize sensitivity in h (d) (y) and h (o) (y) respectively to local variations in y (as in the standard CAE) but crucially the third term encourages h (d) (y) and h (o) (y) to represent different directions of variation in the input y, (y) y by asking each sensitivity vector h(d) i to prefer being orthogonal to every sensitivity vector h(o) j of the i-th discriminant feature h (d) i (y) y associated with the j-th non-discriminant feature h (o) j. The addition of this term to the CDA cost function is crucial in achieving our performance results. As we discuss later it regularizes the CDA discriminative features in a manner analogous to how partial least squares can be interpreted as a regularized variant of canonical components analysis [18]. The coefficient γ modulates the relative contribution of the orthogonalization penalty to the overall CDA contractive penalty. Putting all the components of the CDA loss function together we get: L CDA (θ) = x D,y=F (x) L RECON (y, ŷ) + ηj CDA (y) + (x,z) L,y=F (x) L DISC (z, ẑ) (9) The coefficient η weighs the contribution of the contractive penalty. The set of CDA parameters is θ = {U, V, W, a, b, c, ρ}. The CDA training procedure is illustrated in Fig. 2. As expressed here, CDA strictly disentangles discriminative factors from other prominent factors in the data. However, one could easily generalize the method to incorporate any form of additional side information that could be used to further disentangle factors of variation. This would be achieved by creating additional blocks and associating each of them with a set of predicitive parameters helping to map the features in the block to the values of the factor of interest (like U and a above).
10 10 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. h (o) h (d) Logistic Regression y Fig. 2. Illustration of architecture and training procedure for CDA, that learns and separates two blocks of features (emotion-discriminant features h (d) and features h (o) capturing the other factors). 7 Connections to Previous Work The name Contractive Discriminative Analysis (CDA) was inspired by the connection between our CAE-based approach and earlier linear methods such as linear discriminative analysis (LDA) [19], canonical components analysis (CCA)[20] and partial least squares (PLS) [21]. In fact, in the case of linear activation functions (with h(y) = W y) and no orthogonality penalty (η = 0), the discriminative features that would be discovered by CDA would span the same subspace (in the non-overcomplete setting) as LDA and CCA 2. Our use of the orthogonality inducing contraction penalty (with CAE contraction coefficient λ > 0) has an important effect on the learned features. As previously mentioned, it acts as an additional regularization term on the discriminative features learned by CDA. Interestingly, in the linear setting (with h(y) = W y), the effect of this penalty is to encourage the discriminative and non-discriminative features to be mutually orthogonal. This is reminiscent of the difference between CCA that seeks a linear projection of the input that maximizes correlation with the label encoding and PLS that seeks an equivalent linear projection but rather maximizes covariance. PLS is considered a regularized form of CCA since it forces the projection to preserve additional information the input, specifically in the covariance directions. By penalizing non-orthogonality in the projecting matrix, our CDA penalty acts in a way very similar to PLS. Another interesting connection can be drawn to deep learning techniques that combine a supervised objective with an unsupervised objective when learning a feature set. This started with the idea of partially supervised training in [23] where the RBM or auto-encoder gradient (or estimated gradient) is added to the gradient of a global supervised objective for the deep network. A related idea was proposed in [24] which allowed to train fairly deep networks in a semisupervised setting (where only a few examples have a label, i.e., the supervised gradient is only added up on these). A hybrid of discriminant (conditional loglikelihood) and generative (joint log-likelihood) gradients was also used to train discriminant RBMs [25]. The most significant difference between the CDA and these other semi-supervised feature learning strategies is that CDA explicitly deals with nuisance factors by relegating them to the non-discriminative feature set. These other approaches use the labels to encourage discriminative features 2 In the discriminative setting, where one of the two projected matrices contains only label information, the CCA and LDA directions are the same [22]. ŷ
11 Disentangling factors of variation for facial expression recognition 11 Fig. 3. Left: Example images from the Toronto face database [4]. Center-Right: Convolutional kernels learnt by the CAE. center: 14x14, right:18x18. Smaller kernel sizes tends to learn features that are more local in the 2D image space. while relying on model capacity limitations to filter out the nuisance factors. In the CAE there is a deliberate and controlled loss of information (in directions of variance that correspond to these nuisance factors) in the discriminative feature block. 8 Experiments and Results For our experiments, we use the same setup as [4] and [14]. We use the same 5 standard splits (folds) of the Toronto Face Dataset, to repeatedly train our model and evaluate its performance for emotion classification. 3 For the CCNET training stage, since it is entirely unsupervised, we used the 112,234 unlabeled faces (48 48 grayscale images). More specifically, the CCNET was used to learn 512 convolutional kernels of size and 512 of size Figure 3 shows some of the learned kernels. Each post-sigmoid feature map obtained by applying one of these 1024 kernels was max-pooled within 3 3 regions, yielding = 9216 features in total. For the following CDA stage, training examples were sampled with 50% probability from the TFD unlabeled set and 50% probability from the TFD labeled training set (of the considered split). This is to make sure that the less numerous labeled examples get seen often enough during training, since they contain crucial information that we do not want to swamp under the signals brought by the unlabeled examples. For each TFD standard split, the CDA was trained to extract 1000 discriminative features and 1000 non-discriminative features. These features were then fed to a linear SVM for assessing final classification performance. Performance averaged over the 5 splits was 43.01% accuracy when using 3 Due to time constraints, we concentrated on the first fold for tuning the model s hyper-parameters. We retained the values of the hyper-parameters that yielded best linear SVM performance on that first fold s validation set, and used them unchanged for the other folds. This strategy was used to select both the CCNET hyper-parameters (kernel sizes, CAE regularization strength λ, and pooling regions), and the CDA hyper-parameters (number of discriminative and non-discriminative features and γ) that are reported in the main text.
12 12 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. Fig. 4. Singular vectors associated with largest singular values of the (left) emotion Jacobian h(e), (right) other factors Jacobian h(o). We can notice that x x h(d) is mostly sensitive to face parts associated with emotions, while h (o) captures face structure more likely to correspond to identity. the non-discriminative features versus 85.06% when using the discriminative features. This indicates that the CDA criterion was indeed able to disentangle features most relevant for emotion classification from other aspects of the faces. To qualitatively compare the discriminative to the non-discriminative features learned, we extracted the input directions to which they were most sensitive. This was achieved by extracting the 10 leading singular vectors of the derivative of either discriminative or non-discriminative features with respect to the image input. From Fig. 4 we see that, in general, the sensitivity directions for the expression-discriminate feature block, h (d), are more localized and contain less identity specific information relative to the non-discriminative feature block h (o). We also see that the expression-discriminate feature block contains expression targeted detectors such as corner of the mouth smile detectors, toothy-grin detectors, grimace detectors and wide-eye (surprise) detectors. Classification performance obtained with features extracted after each of the two stages of our model (CCNET+SVM and CCNET+CDA+SVM) are reported in Table 1, and compared to simpler single non-convolutional one-layer CAE (CAE-1+SVM) and a stack of two CAEs (CAE-2+SVM). These results confirm that each successive layer we add helps to disentangle discriminative features, yielding good classification performance. Table 2 compares the performance of our approach to that of established models[14, 11]. Table 1. Test classification accuracy of several models, trained on TFD, averaged over 5 folds (reported with standard deviation). Model CAE-1+SVM CAE-2+SVM CCNET+SVM CCNET+CDA+SVM Accuracy ± ± ± ± 0.47
13 Disentangling factors of variation for facial expression recognition 13 Fig. 5. The effect of the CDA term on the generalisation performance for different values of η. The optimum is found for a non-negligible value of η = 7. Table 2. Test classification accuracy of established models trained on TFD. Model Accuracy SVM RBF-SVM SC+SVM GFB+PCA+SVM [11] mpot+dbn CCNET+CDA+SVM 9 Discussion In this paper, we have investigated an approach to facial expression recognition based on a feature hierarchy trained to disentangle the factors of variation that give rise to facial expressions from other factors such as those responsible for subject identity, specific facial morphology and subject pose. We introduce contractive discriminative analysis (CDA), a novel semi-supervised learning paradigm that incorporates available label information to define discriminative features while regularizing the feature set with a CAE-inspired penalty to promote good generalization properties. By combining prior knowledge of the spatial topology of images together with feature learning schemes designed to recover robust features of facial expressions, we significantly surpass the previous state-of-the-art on the Toronto face database [4], achieving a generalization accuracy of 85.0%. We also show how the features recovered by our CDA scheme are invariant to factors such as subject identity and pose while remaining sensitive to changes in facial expression. References 1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (2009) Also published as a book. Now Publishers, Saul, L., Roweis, S.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4 (2002)
14 14 Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M. 3. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting autoencoders: Explicit invariance during feature extraction. In: ICML (2011) 4. Susskind, J., Anderson, A., Hinton, G.E.: The Toronto face dataset. Technical Report UTML TR , U. Toronto (2010) 5. Ranzato, M., Susskind, J., Mnih, V., Hinton, G.E.: On deep generative models with applications to recognition. In: CVPR 11. (2011) Padgett, C., Cottrell, G.W.: A simple neural network models categorical perception of facial expressions. In: In Proceedings of the Twentieth Annual Cognitive Science Conference, Erlbaum (1998) Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multistage architecture for object recognition? In: ICCV 09. (2009) 8. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011). (2011) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. CVPR (2005) 11. Dailey, M.N., Cottrell, G.W., Padgett, C., Adolphs, R.: EMPATH: A neural network that categorizes facial expressions. J. cognitive neuroscience (2002) LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (1989) Wolf, R., Platt, J.: Postal address block location using a convolutional locator network. In: NIPS 93. (1994) Ranzato, M., Huang, F., Boureau, Y., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: CVPR 07. (2007) 15. Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatiotemporal features. In: ECCV 10. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, Y.L., Gregor, K., Mathieu, M., LeCun, Y.: Learning convolutional feature hierarchies for visual recognition. In: NIPS (2010) 17. Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spikeand-slab RBMs. In: ICML (2011) 18. Barker, M., Rayens, W.: Partial least squares for discrimination. Journal of Chemometrics 17 (2003) Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (1936) Hotelling, H.: Relations between two sets of variates. Biometrika 28 (1936) Wold, S., Ruhe, A., Wold, H., Dunn, W.J.: The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5 (1984) Bartlett, M.S.: Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society 34 (1938) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS 06. MIT Press (2007) Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: ICML (2008) 25. Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: ICML (2008)
A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images
A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract
More informationMulti-Task Learning of Facial Landmarks and Expression
Multi-Task Learning of Facial Landmarks and Expression Terrance Devries 1, Kumar Biswaranjan 2, and Graham W. Taylor 1 1 School of Engineering, University of Guelph, Guelph, Canada N1G 2W1 2 Department
More informationA Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images
A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationAkarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different
More informationContractive Auto-Encoders: Explicit Invariance During Feature Extraction
: Explicit Invariance During Feature Extraction Salah Rifai (1) Pascal Vincent (1) Xavier Muller (1) Xavier Glorot (1) Yoshua Bengio (1) (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada
More informationNeural Networks: promises of current research
April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab
More informationAutoencoders, denoising autoencoders, and learning deep networks
4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationLearning Two-Layer Contractive Encodings
In Proceedings of International Conference on Artificial Neural Networks (ICANN), pp. 620-628, September 202. Learning Two-Layer Contractive Encodings Hannes Schulz and Sven Behnke Rheinische Friedrich-Wilhelms-Universität
More informationMultiple Kernel Learning for Emotion Recognition in the Wild
Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,
More informationA supervised strategy for deep kernel machine
A supervised strategy for deep kernel machine Florian Yger, Maxime Berar, Gilles Gasso and Alain Rakotomamonjy LITIS EA 4108 - Université de Rouen/ INSA de Rouen, 76800 Saint Etienne du Rouvray - France
More informationDeep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks
Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin
More informationMachine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart
Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider
More informationDeep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies
http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort
More informationBilevel Sparse Coding
Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional
More informationFacial Expression Recognition Using Non-negative Matrix Factorization
Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,
More informationIntroduction to Deep Learning
ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)
More informationExploring Bag of Words Architectures in the Facial Expression Domain
Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu
More informationStacked Denoising Autoencoders for Face Pose Normalization
Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University
More informationExtracting and Composing Robust Features with Denoising Autoencoders
Presenter: Alexander Truong March 16, 2017 Extracting and Composing Robust Features with Denoising Autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol 1 Outline Introduction
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationNeural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders
Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components
More informationMachine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,
Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image
More informationCHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION
122 CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 5.1 INTRODUCTION Face recognition, means checking for the presence of a face from a database that contains many faces and could be performed
More informationAn Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio Université de Montréal 13/06/2007
More informationLearning to Recognize Faces in Realistic Conditions
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationAn Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle larocheh@iro.umontreal.ca Dumitru Erhan erhandum@iro.umontreal.ca Aaron Courville courvila@iro.umontreal.ca
More informationLearning Feature Hierarchies for Object Recognition
Learning Feature Hierarchies for Object Recognition Koray Kavukcuoglu Computer Science Department Courant Institute of Mathematical Sciences New York University Marc Aurelio Ranzato, Kevin Jarrett, Pierre
More informationTo be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine
2014 22nd International Conference on Pattern Recognition To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine Takayoshi Yamashita, Masayuki Tanaka, Eiji Yoshida, Yuji Yamauchi and Hironobu
More informationDeep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationDeep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?
Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of
More information3D Object Recognition with Deep Belief Nets
3D Object Recognition with Deep Belief Nets Vinod Nair and Geoffrey E. Hinton Department of Computer Science, University of Toronto 10 King s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu
More informationNovel Lossy Compression Algorithms with Stacked Autoencoders
Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is
More informationText Detection and Character Recognition in Scene Images with Unsupervised Feature Learning
Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, Andrew Y. Ng Computer Science
More informationLearning Invariant Representations with Local Transformations
Kihyuk Sohn kihyuks@umich.edu Honglak Lee honglak@eecs.umich.edu Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA Abstract Learning invariant representations
More informationFrequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning
Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, 1603-1, Kamitomioka Nagaoka, Niigata
More informationDepth Image Dimension Reduction Using Deep Belief Networks
Depth Image Dimension Reduction Using Deep Belief Networks Isma Hadji* and Akshay Jain** Department of Electrical and Computer Engineering University of Missouri 19 Eng. Building West, Columbia, MO, 65211
More informationMultiresponse Sparse Regression with Application to Multidimensional Scaling
Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,
More informationSelection of Scale-Invariant Parts for Object Class Recognition
Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract
More informationImage Processing. Image Features
Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationarxiv: v1 [cs.lg] 20 Dec 2013
Unsupervised Feature Learning by Deep Sparse Coding Yunlong He Koray Kavukcuoglu Yun Wang Arthur Szlam Yanjun Qi arxiv:1312.5783v1 [cs.lg] 20 Dec 2013 Abstract In this paper, we propose a new unsupervised
More informationMulti-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature
0/19.. Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature Usman Tariq, Jianchao Yang, Thomas S. Huang Department of Electrical and Computer Engineering Beckman Institute
More informationMultiview Feature Learning
Multiview Feature Learning Roland Memisevic Frankfurt, Montreal Tutorial at IPAM 2012 Roland Memisevic (Frankfurt, Montreal) Multiview Feature Learning Tutorial at IPAM 2012 1 / 163 Outline 1 Introduction
More informationCost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling
[DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji
More informationPartial Least Squares Regression on Grassmannian Manifold for Emotion Recognition
Emotion Recognition In The Wild Challenge and Workshop (EmotiW 2013) Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition Mengyi Liu, Ruiping Wang, Zhiwu Huang, Shiguang Shan,
More informationVisual object classification by sparse convolutional neural networks
Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.
More informationFace Recognition using SURF Features and SVM Classifier
International Journal of Electronics Engineering Research. ISSN 0975-6450 Volume 8, Number 1 (016) pp. 1-8 Research India Publications http://www.ripublication.com Face Recognition using SURF Features
More informationLocally Scale-Invariant Convolutional Neural Networks
Locally Scale-Invariant Convolutional Neural Networks Angjoo Kanazawa Department of Computer Science University of Maryland, College Park, MD 20740 kanazawa@umiacs.umd.edu Abhishek Sharma Department of
More informationUsing the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection
Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual
More informationDeep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.
Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,
More informationUsing Geometric Blur for Point Correspondence
1 Using Geometric Blur for Point Correspondence Nisarg Vyas Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA Abstract In computer vision applications, point correspondence
More informationDeep Generative Models Variational Autoencoders
Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative
More informationTransfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance
Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance Telmo Amaral¹, Luís M. Silva¹², Luís A. Alexandre³, Chetak Kandaswamy¹, Joaquim Marques de Sá¹ 4, and Jorge M. Santos¹
More informationAggregating Descriptors with Local Gaussian Metrics
Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,
More informationChallenges motivating deep learning. Sargur N. Srihari
Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More informationC. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun
Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model Marc'Aurelio Ranzato C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun CIAR Summer School Toronto 2006 Why Extracting
More informationRotation Invariance Neural Network
Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural
More informationDeep Similarity Learning for Multimodal Medical Images
Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal
More informationCS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016
CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional
More informationA Keypoint Descriptor Inspired by Retinal Computation
A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationOn Deep Generative Models with Applications to Recognition
On Deep Generative Models with Applications to Recognition Marc Aurelio Ranzato Joshua Susskind Department of Computer Science University of Toronto ranzato,vmnih,hinton@cs.toronto.edu Volodymyr Mnih Geoffrey
More informationLearning Convolutional Feature Hierarchies for Visual Recognition
Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute
More informationSupplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?
Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Roberto Rigamonti, Matthew A. Brown, Vincent Lepetit CVLab, EPFL Lausanne, Switzerland firstname.lastname@epfl.ch
More informationNeural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer
More informationStacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning
Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning Mohammad Norouzi, Mani Ranjbar, and Greg Mori School of Computing Science Simon Fraser University Burnaby, BC
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationCapsule Networks. Eric Mintun
Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural Networks. Two goals: Replace max-pooling operation with something more intuitive. Keep more info about an activated
More informationConvolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Computer Science Department, Stanford University,
More information242 KHEDR & AWAD, Mat. Sci. Res. India, Vol. 8(2), (2011), y 2
Material Science Research India Vol. 8(), 4-45 () Study of Fourier Descriptors and it s Laplace Transform for Image Recognition WAEL M. KHEDR and QAMAR A. AWAD Department of Mathematical, Faculty of Science,
More informationCambridge Interview Technical Talk
Cambridge Interview Technical Talk February 2, 2010 Table of contents Causal Learning 1 Causal Learning Conclusion 2 3 Motivation Recursive Segmentation Learning Causal Learning Conclusion Causal learning
More informationCS 231A Computer Vision (Fall 2011) Problem Set 4
CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based
More informationSketchable Histograms of Oriented Gradients for Object Detection
Sketchable Histograms of Oriented Gradients for Object Detection No Author Given No Institute Given Abstract. In this paper we investigate a new representation approach for visual object recognition. The
More informationAutomated Canvas Analysis for Painting Conservation. By Brendan Tobin
Automated Canvas Analysis for Painting Conservation By Brendan Tobin 1. Motivation Distinctive variations in the spacings between threads in a painting's canvas can be used to show that two sections of
More informationWeighted Convolutional Neural Network. Ensemble.
Weighted Convolutional Neural Network Ensemble Xavier Frazão and Luís A. Alexandre Dept. of Informatics, Univ. Beira Interior and Instituto de Telecomunicações Covilhã, Portugal xavierfrazao@gmail.com
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,
More informationHandwritten Hindi Numerals Recognition System
CS365 Project Report Handwritten Hindi Numerals Recognition System Submitted by: Akarshan Sarkar Kritika Singh Project Mentor: Prof. Amitabha Mukerjee 1 Abstract In this project, we consider the problem
More informationSupervised Translation-Invariant Sparse Coding
Supervised Translation-Invariant Sparse Coding Jianchao Yang,KaiYu, Thomas Huang Beckman Institute, University of Illinois at Urbana-Champaign NEC Laboratories America, Inc., Cupertino, California {jyang29,
More informationLearning Discrete Representations via Information Maximizing Self-Augmented Training
A. Relation to Denoising and Contractive Auto-encoders Our method is related to denoising auto-encoders (Vincent et al., 2008). Auto-encoders maximize a lower bound of mutual information (Cover & Thomas,
More informationTiled convolutional neural networks
Tiled convolutional neural networks Quoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang Wei Koh, Andrew Y. Ng Computer Science Department, Stanford University {quocle,jngiam,zhenghao,danchia,pangwei,ang}@cs.stanford.edu
More informationModeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations
Modeling Visual Cortex V4 in Naturalistic Conditions with Invariant and Sparse Image Representations Bin Yu Departments of Statistics and EECS University of California at Berkeley Rutgers University, May
More informationLearning a Manifold as an Atlas Supplementary Material
Learning a Manifold as an Atlas Supplementary Material Nikolaos Pitelis Chris Russell School of EECS, Queen Mary, University of London [nikolaos.pitelis,chrisr,lourdes]@eecs.qmul.ac.uk Lourdes Agapito
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationFacial expression recognition using shape and texture information
1 Facial expression recognition using shape and texture information I. Kotsia 1 and I. Pitas 1 Aristotle University of Thessaloniki pitas@aiia.csd.auth.gr Department of Informatics Box 451 54124 Thessaloniki,
More informationKernel-based online machine learning and support vector reduction
Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science
More informationSingle Image Depth Estimation via Deep Learning
Single Image Depth Estimation via Deep Learning Wei Song Stanford University Stanford, CA Abstract The goal of the project is to apply direct supervised deep learning to the problem of monocular depth
More informationStatic Gesture Recognition with Restricted Boltzmann Machines
Static Gesture Recognition with Restricted Boltzmann Machines Peter O Donovan Department of Computer Science, University of Toronto 6 Kings College Rd, M5S 3G4, Canada odonovan@dgp.toronto.edu Abstract
More informationNeural Network Neurons
Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given
More informationAdvanced Introduction to Machine Learning, CMU-10715
Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio
More informationNeural Network Optimization and Tuning / Spring 2018 / Recitation 3
Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.
More informationCS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning
CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with
More informationDoes the Brain do Inverse Graphics?
Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto The representation used by the
More informationDeep Learning for Generic Object Recognition
Deep Learning for Generic Object Recognition, Computational and Biological Learning Lab The Courant Institute of Mathematical Sciences New York University Collaborators: Marc'Aurelio Ranzato, Fu Jie Huang,
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationAn Analysis of Single-Layer Networks in Unsupervised Feature Learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates Honglak Lee Andrew Y. Ng Stanford University Computer Science Dept. 353 Serra Mall Stanford, CA 94305 University of Michigan
More informationEffectiveness of Sparse Features: An Application of Sparse PCA
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationGeneralized Autoencoder: A Neural Network Framework for Dimensionality Reduction
Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction Wei Wang 1, Yan Huang 1, Yizhou Wang 2, Liang Wang 1 1 Center for Research on Intelligent Perception and Computing, CRIPAC
More information