Facial Expression Classification with Random Filters Feature Extraction

Size: px

Start display at page:

Download "Facial Expression Classification with Random Filters Feature Extraction"

Darcy Payne
6 years ago
Views:

1 Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle the challenging problem of facial expression classification. We were assigned with images of faces labeled with seven categories and unlabeled faces from the Toronto Faces Dataset [1]. We designed a machine learning model using random filters in the convolutional layer, followed by a ReLu and a max pooling layer, and a linear SVM to discriminate between classes. We achieved a very high classification rate of 84.7% in the public test set (418 images) and 83.7% in the private test set (835 images), ranked first and second respectively among 60 teams. II. BACKGROUND Convolutional neural networks are very successful models in the image object classification tasks [2]. Those networks usually have a convolutional layer with shared weights in the first layer, which can be viewed as image filters. Filtered images are then passed to a Rectified Linear Unit (ReLU) to obtain non linear activation to certain features. Next, to reduce data dimensionality and to make the model robust to translational invariance, a max pooling layer effectively downsamples the intermediate image with the maximum value in each subregion. Lastly, a fully connected softmax layer makes class predictions with each unit representing the probability of input data belonging to certain classes. Recent works show that for small number of total classes, a set of one vs all support vector machines (SVM) can replace the fully connected layers to reduce the model training complexity and achieve comparable results [7]. Moreover, a better model architecture (with ReLU and pooling) seems to be more important than the pretraining of the weights in the convolutional filters. In fact, research shows that randomly generated filters can perform surprisingly well [6]. The reason is that those filters provide an overcomplete set of bases that map the original image to a high dimensional space [8]. The ReLU and pooling layer increase data sparsity and preserve the most salient features in the image, and these features can then be separated by a linear SVM. III. MODEL Our model consists of a convolutional layer of random weights, a ReLU (positive and negative direction), and a max pooling layer. We then send the processed data into a linear SVM. We compared our model with other set of common filters such as gabor filters and sparse coding filters, and other popular methods such as knn, k means SVM, random forest and logistics. We discovered that the random filters perform the best among all models that we have tried. To achieve better results, we also experimented with different hyperparameters such as filter size, number of filters, pooling region, hierarchical pooling. We show that a set of 1024 randomly generated 8x8 filters with hierarchical pooling performs the best among single classifiers. Page 1 of 7

Moreover, we explored bagging method and constructed an ensemble of 15 classifiers with 40 filters, which increased the classification rate by 1%.

ZCA Whitening transforms the data to have zero mean and an identity covariance matrix. Convolutional Layer Given F P P filters, we perform convolution on the original M N image.

2 Moreover, we explored bagging method and constructed an ensemble of 15 classifiers with 40 filters, which increased the classification rate by 1%. Preprocessing First we preprocess the data to ensure each patch has zero mean and uni variance. Alternatively, we whiten the patches. ZCA Whitening transforms the data to have zero mean and an identity covariance matrix. Convolutional Layer Given F P P filters, we perform convolution on the original M N image. The valid region of the convolution output is an image of size ( M P + 1 ) ( N P + 1 ) = M N. For each filter output s, we pass it to a ReLU function: z + = m ax(0, s). In order to respond to negative features, we also have a negative ReLU: z = max(0, s ). In previous work of image object classification, positive and negative ReLU pairs are shown to be effective[3]. The ReLU suppresses the value to zero for non positive values, thus creates data sparsity for better data separation. We concatenate the features from both positive and negative ReLU functions, obtaining 2 F filtered images. After the ReLU, the model gives 2 F M N images. We then downsample the image into 2 F m n smaller images by selecting the largest value in the M /m N /n region. Alternatively, we can also use different downsample function since max pooling is in fact the L norm of the pooling region. L 1 norm corresponds to sum pooling. We are free to choose any parameter p to compute the p norm of the region as the pooled value. Lastly, we flatten the 2 F m n matrix into an image feature vector, to pass into an SVM with linear kernel. In order to classify between seven classes, we trained seven 1 vs all SVMs. The 1 following diagrams show the entire architecture of our proposed model. Figure 1 Our Proposed Model Architecture 1 Passing the filtered data into a positive and negative ReLU respectively results in 2F sets of filtered data. Page 2 of 7

We plot the images of the filters and the activated, downsampled image of a face. Figure 2 Random Filters (Left). 5x5 Activation after ReLU and Max Pooling (Right).

3 We plot the images of the filters and the activated, downsampled image of a face. Figure 2 Random Filters (Left). 5x5 Activation after ReLU and Max Pooling (Right). As we can see, there is no particular pattern in the filter initialization; however, they seem to be sensitive to regions such as eyes, noses and mouth, which explains why random filters are effective. The grid like structure in the filter images resembles a small high pass image filter, which detects high frequency variation region such as facial features and ignores low variation region such as different skin colours. Classification Layer In the classification layer, we have seven one vs. all SVM classifiers. Each gives prediction of whether an image belongs to a class, as well as the distance to the decision boundary. We take the longest distance among all classifiers on the positive side of the decision boundary as the final prediction of an image. IV. EXPERIMENT Throughout the project, we experimented with a variety of approaches and methods. In particular, we experimented with different filters, pooling regions, and ensemble methods. The final model is chosen according to the analysis of all these experiments. In this process, besides testing hyper parameters for random filters, we examined the effectiveness of bagging; we also attempted to learn a gating function using a mixture of experts. Moreover, preprocessing using k means clusters, are also tried. In the end, they were not selected in the final model as their performance does not match the selected model and they cannot be used to help the existing model. The detailed experiments results are discussed below. Filter Size We found that 8 8 filters work the best for both random and Gabor filters. They actually capture an entire eye and nose. Page 3 of 7

4 Effect of Different Filters We experimented different filters to extract new image representation. Besides randomly generated filters, we also experimented with Gabor filters[9], sparse coding[3], and k means coding[4]. Gabor filters are famous for edge detection [4]. Sparse coding and k means coding are popular encoding techniques to map data into sparse high dimensional space. Figure 3 Gabor Filters (Left) and Sparse Coding Dictionary (Right) 1024 Random Filters (no whiten) 40 Random Filters (no whiten) 4x10 Gabor Filters (whiten) 40 Sparse Coding Dictionary (whiten) [2] 40 K means Cluster Triangular Encoding 2 (no whiten) [3] 2 Fold Rate 79.61% 76.34% 77.01% 73.33% 49.47% Table 1 Effect of Different Filters Gabor filters and sparse coding filters perform better in whitened data, whereas random filters perform better by simply shifting mean and divide by standard deviation. Gabor filters have a slight advantage over random filters given the same number of filters, but there is less space for improvement because the shape and form of the filters are rather restricted. With 1024 random filters, we achieved a classification rate of 80% with only half of the data (1460 labeled images). Number of Filters As shown above, 40 random filters can achieve reasonably good classification rate. The benefit of more filters is exponentially decaying filters increased the classification rate by 3%. 3 Pooling Regions Random Filters 2x2 Square 5x5 Square Squares Squares + Rectangles 6 Held out rate 75.0% 81.8% 83.01% 84.21% Table 2 Effect of Different Pooling Regions 2 triangular K means follows the method outlined in [4], a version of soft k means that is able to keep some sparsity. 3 m x n pooling means dissecting images into non overlapping m x n regions, and sample one value in each region. 4 We used 5x5, 4x4, 3x3, 2x2 squares and concatenated the features into feature vector. 5 On top of the square from 5x5 to 2x2, we used rectangular pooling region of 5x1, 4x1, 3x1, and 2x1. 6 The held out set composed of 418 images, available as public test set on Kaggle. Page 4 of 7

5 To select a better architecture, we experimented with different pooling regions. The above table summarizes how different pooling mechanism affect the classification rate. 2x2 square max pooling suffers from loss of information. We also discover that it is beneficial to perform pooling on different sizes of pooling region. With both hierarchical square pooling boxes and rectangular pooling boxes, we achieved a classification rate of 84.21% on the held out set. Max pooling or L p norm pooling Although max pooling may be prone to loss of local features, we found that L p norm (p = 5, 10, 20) pooling does not outperform max pooling on the held out set. The loss of information can be mitigated by more hierarchical pooling regions. Training Time One of the biggest advantages of random filters is that it does not need any pre training for obtaining the filters. We list a comparison of training time. 512 Random Filters (RF) 512 Gabor Filters (GF) 512 Sparse 7 Coding (SC) K means Unsupervised Training Phase hours 0.5 hour Table 3 Unsupervised Training Time Comparison Between Encoding Techniques. In the supervised phase, SVM can train 3000 images in 5 minutes. The tradeoff of SVM is the growth of the model size. For 1024 random filters with hierarchical square and rectangle pooling regions, the trained model is approximately 3 GB. SVM Kernel Selection We found that linear kernel actually performs the best. This suggests that in our high dimensional image representation space, categories are linearly separable. Polynomial kernel does not produce any meaningful results; RBF kernel with sigma equal to 25 produces a 2 fold classification rate of 66%, which is far below linear kernel. Ensemble Method We also experimented with an ensemble of classifiers with random filters, gabor filters, and sparse coding filters. The final prediction is the arithmetic average of the prediction probability of all classifiers in the ensemble. Bag Ratio 0.8 Single 512 RF 15x40 RF 20x40 GF 10xRF+5xGF+5xSC Held out rate 83.01% 84.68% 82.54% 82.30% Table 4 Ensemble Method Results 7 We used stochastic coordinate gradient descent with mini batch of We used fkmeans (fast kmeans) from MATLABCentral by Tim Benham to boost speed [5] Page 5 of 7

6 We discover that with an ensemble of 15 classifiers (600 filters in total), we get 1.7% increase in classification rate compared to a single classifier with similar number of filters. Comparison with Other Methods a. K Nearest Neighbours (KNN), Bag of KNN KNN algorithm is provided in the project starter package. It is a baseline classifier in this task. On top of the baseline KNN, we also built an ensemble of KNN using bagging. b. Random Forest We first obtain the major 45 principal components (PCA), and then train a bag of 7x200 1 vs. all decision trees using MATLAB function fitensemble. c. Polynomial SVM We first obtain the major 45 PCA components, and then train a 7x one vs. all SVM with 3rd order polynomial kernel. d. Multinomial Logistic Regression (LR) We first pass the images through random filters, then we train 7x one vs. all SVM with linear kernel RF KNN (K=5) Bag KNN Random Forest Poly SVM Multi LR 10 Fold Rate 82.56% 57.39% 61.03% 64.20% 65.13% 67.57% Held out Rate 84.21% 58.12% Table 5 10 Fold Cross Validation Comparison of Different Classifiers. Most of the methods we explored beat the baseline KNN. The 1024 Random Filters beats the baseline by more than 25%. Held out rates of other methods are not provided because of the submission limit. V. CONCLUSION According to the abovementioned experiments and the laid out analysis, we found our proposed model works best. We obtain new representation of data by convolving random filters with original images. The positive and negative features are extracted respectively by passing the filtered data through a positive ReLU and a negative ReLU. These data from all filters are then pooled to create a new vector that becomes the new representation of the original data. To classify the new representations, 7 one vs. all linear SVMs are trained and used. Finally, it is proven effective to train multiple SVMs using different sets of random filters; the average result of these separate SVM classifiers can further boost the overall classification rate. Page 6 of 7

7 REFERENCES [1] The Toronto Faces Dataset, [2] LeCun et al. Gradient based learning applied to document recognition. In Proceedings of the IEEE, [3] Adam Coates and Andrew Y. Ng, The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ICML , [4] Adam Coates, Honglak Lee and Andrew Y. Ng, An Analysis of Single Layer Networks in Unsupervised Feature Learning. In 14th ICAIS [5] Tim Benham, Fast K means Implementation With Optional Weights, fast k means [6] Andrew M. Saxe et al. On Random Weights and Unsupervised Feature Learning. In ICML 11, [7] Yichuan Tang, Deep Learning using Linear Support Vector Machines. ICML 13, [8] B. Olshausen and D. Field, Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1? Vision Research, vol. 37, [9] M. Haghighat, S. Zonouz, M. Abdel Mottaleb, "Identification Using Encrypted Biometrics," Computer Analysis of Images and Patterns, Springer Berlin Heidelberg, pp , Page 7 of 7

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow