Facial Expression Classification with Random Filters Feature Extraction

Similar documents
Tutorial on Machine Learning Tools

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Human Vision Based Object Recognition Sye-Min Christina Chan

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Convolution Neural Networks for Chinese Handwriting Recognition

Multi-Task Learning of Facial Landmarks and Expression

Sparse coding for image classification

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Machine Learning 13. week

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Learning to Recognize Faces in Realistic Conditions

Advanced Machine Learning

Deep Learning for Computer Vision II

Exploring Bag of Words Architectures in the Facial Expression Domain

Divya Ramesh December 8, 2014

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Convolutional-Recursive Deep Learning for 3D Object Classification

Predicting Popular Xbox games based on Search Queries of Users

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classifying Depositional Environments in Satellite Images

Linear Discriminant Analysis for 3D Face Recognition System

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Inception Network Overview. David White CS793

Multiple Kernel Learning for Emotion Recognition in the Wild

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Bilevel Sparse Coding

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

The exam is closed book, closed notes except your one-page cheat sheet.

Deep Learning Cook Book

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Network Traffic Measurements and Analysis

Face Recognition using SURF Features and SVM Classifier

Lecture 20: Neural Networks for NLP. Zubin Pahuja

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Novel Lossy Compression Algorithms with Stacked Autoencoders

Deep Learning for Generic Object Recognition

Classification of objects from Video Data (Group 30)

Applying Supervised Learning

Supervised vs unsupervised clustering

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Deep Learning with Tensorflow AlexNet

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Vulnerability of machine learning models to adversarial examples

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Unsupervised learning in Vision

Single Image Depth Estimation via Deep Learning

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

A Taxonomy of Semi-Supervised Learning Algorithms

Does the Brain do Inverse Graphics?

Static Gesture Recognition with Restricted Boltzmann Machines

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Does the Brain do Inverse Graphics?

ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report

Smart Content Recognition from Images Using a Mixture of Convolutional Neural Networks *

Perceptron: This is convolution!

Allstate Insurance Claims Severity: A Machine Learning Approach

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Facial Expression Recognition Using Non-negative Matrix Factorization

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Deep Learning. Volker Tresp Summer 2014

Deep Convolutional Neural Network using Triplet of Faces, Deep Ensemble, and Scorelevel Fusion for Face Recognition

Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

arxiv: v1 [cs.lg] 20 Dec 2013

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Multi-Task Self-Supervised Visual Learning

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Unsupervised Learning of Spatiotemporally Coherent Metrics

Image enhancement for face recognition using color segmentation and Edge detection algorithm

Final Report: Kaggle Soil Property Prediction Challenge

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Logistic Regression

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Beyond Bags of Features

Deep Learning With Noise

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

Transcription:

Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle the challenging problem of facial expression classification. We were assigned with 2925 32 32 images of faces labeled with seven categories and 98058 unlabeled faces from the Toronto Faces Dataset [1]. We designed a machine learning model using random filters in the convolutional layer, followed by a ReLu and a max pooling layer, and a linear SVM to discriminate between classes. We achieved a very high classification rate of 84.7% in the public test set (418 images) and 83.7% in the private test set (835 images), ranked first and second respectively among 60 teams. II. BACKGROUND Convolutional neural networks are very successful models in the image object classification tasks [2]. Those networks usually have a convolutional layer with shared weights in the first layer, which can be viewed as image filters. Filtered images are then passed to a Rectified Linear Unit (ReLU) to obtain non linear activation to certain features. Next, to reduce data dimensionality and to make the model robust to translational invariance, a max pooling layer effectively downsamples the intermediate image with the maximum value in each subregion. Lastly, a fully connected softmax layer makes class predictions with each unit representing the probability of input data belonging to certain classes. Recent works show that for small number of total classes, a set of one vs all support vector machines (SVM) can replace the fully connected layers to reduce the model training complexity and achieve comparable results [7]. Moreover, a better model architecture (with ReLU and pooling) seems to be more important than the pretraining of the weights in the convolutional filters. In fact, research shows that randomly generated filters can perform surprisingly well [6]. The reason is that those filters provide an overcomplete set of bases that map the original image to a high dimensional space [8]. The ReLU and pooling layer increase data sparsity and preserve the most salient features in the image, and these features can then be separated by a linear SVM. III. MODEL Our model consists of a convolutional layer of random weights, a ReLU (positive and negative direction), and a max pooling layer. We then send the processed data into a linear SVM. We compared our model with other set of common filters such as gabor filters and sparse coding filters, and other popular methods such as knn, k means SVM, random forest and logistics. We discovered that the random filters perform the best among all models that we have tried. To achieve better results, we also experimented with different hyperparameters such as filter size, number of filters, pooling region, hierarchical pooling. We show that a set of 1024 randomly generated 8x8 filters with hierarchical pooling performs the best among single classifiers. Page 1 of 7

Moreover, we explored bagging method and constructed an ensemble of 15 classifiers with 40 filters, which increased the classification rate by 1%. Preprocessing First we preprocess the data to ensure each patch has zero mean and uni variance. Alternatively, we whiten the patches. ZCA Whitening transforms the data to have zero mean and an identity covariance matrix. Convolutional Layer Given F P P filters, we perform convolution on the original M N image. The valid region of the convolution output is an image of size ( M P + 1 ) ( N P + 1 ) = M N. For each filter output s, we pass it to a ReLU function: z + = m ax(0, s). In order to respond to negative features, we also have a negative ReLU: z = max(0, s ). In previous work of image object classification, positive and negative ReLU pairs are shown to be effective[3]. The ReLU suppresses the value to zero for non positive values, thus creates data sparsity for better data separation. We concatenate the features from both positive and negative ReLU functions, obtaining 2 F filtered images. After the ReLU, the model gives 2 F M N images. We then downsample the image into 2 F m n smaller images by selecting the largest value in the M /m N /n region. Alternatively, we can also use different downsample function since max pooling is in fact the L norm of the pooling region. L 1 norm corresponds to sum pooling. We are free to choose any parameter p to compute the p norm of the region as the pooled value. Lastly, we flatten the 2 F m n matrix into an image feature vector, to pass into an SVM with linear kernel. In order to classify between seven classes, we trained seven 1 vs all SVMs. The 1 following diagrams show the entire architecture of our proposed model. Figure 1 Our Proposed Model Architecture 1 Passing the filtered data into a positive and negative ReLU respectively results in 2F sets of filtered data. Page 2 of 7

We plot the images of the filters and the activated, downsampled image of a face. Figure 2 Random Filters (Left). 5x5 Activation after ReLU and Max Pooling (Right). As we can see, there is no particular pattern in the filter initialization; however, they seem to be sensitive to regions such as eyes, noses and mouth, which explains why random filters are effective. The grid like structure in the filter images resembles a small high pass image filter, which detects high frequency variation region such as facial features and ignores low variation region such as different skin colours. Classification Layer In the classification layer, we have seven one vs. all SVM classifiers. Each gives prediction of whether an image belongs to a class, as well as the distance to the decision boundary. We take the longest distance among all classifiers on the positive side of the decision boundary as the final prediction of an image. IV. EXPERIMENT Throughout the project, we experimented with a variety of approaches and methods. In particular, we experimented with different filters, pooling regions, and ensemble methods. The final model is chosen according to the analysis of all these experiments. In this process, besides testing hyper parameters for random filters, we examined the effectiveness of bagging; we also attempted to learn a gating function using a mixture of experts. Moreover, preprocessing using k means clusters, are also tried. In the end, they were not selected in the final model as their performance does not match the selected model and they cannot be used to help the existing model. The detailed experiments results are discussed below. Filter Size We found that 8 8 filters work the best for both random and Gabor filters. They actually capture an entire eye and nose. Page 3 of 7

Effect of Different Filters We experimented different filters to extract new image representation. Besides randomly generated filters, we also experimented with Gabor filters[9], sparse coding[3], and k means coding[4]. Gabor filters are famous for edge detection [4]. Sparse coding and k means coding are popular encoding techniques to map data into sparse high dimensional space. Figure 3 Gabor Filters (Left) and Sparse Coding Dictionary (Right) 1024 Random Filters (no whiten) 40 Random Filters (no whiten) 4x10 Gabor Filters (whiten) 40 Sparse Coding Dictionary (whiten) [2] 40 K means Cluster Triangular Encoding 2 (no whiten) [3] 2 Fold Rate 79.61% 76.34% 77.01% 73.33% 49.47% Table 1 Effect of Different Filters Gabor filters and sparse coding filters perform better in whitened data, whereas random filters perform better by simply shifting mean and divide by standard deviation. Gabor filters have a slight advantage over random filters given the same number of filters, but there is less space for improvement because the shape and form of the filters are rather restricted. With 1024 random filters, we achieved a classification rate of 80% with only half of the data (1460 labeled images). Number of Filters As shown above, 40 random filters can achieve reasonably good classification rate. The benefit of more filters is exponentially decaying. 1024 filters increased the classification rate by 3%. 3 Pooling Regions 4 5 1024 Random Filters 2x2 Square 5x5 Square 5 4 3 2 Squares 5 4 3 2 Squares + Rectangles 6 Held out rate 75.0% 81.8% 83.01% 84.21% Table 2 Effect of Different Pooling Regions 2 triangular K means follows the method outlined in [4], a version of soft k means that is able to keep some sparsity. 3 m x n pooling means dissecting images into non overlapping m x n regions, and sample one value in each region. 4 We used 5x5, 4x4, 3x3, 2x2 squares and concatenated the features into feature vector. 5 On top of the square from 5x5 to 2x2, we used rectangular pooling region of 5x1, 4x1, 3x1, and 2x1. 6 The held out set composed of 418 images, available as public test set on Kaggle. Page 4 of 7

To select a better architecture, we experimented with different pooling regions. The above table summarizes how different pooling mechanism affect the classification rate. 2x2 square max pooling suffers from loss of information. We also discover that it is beneficial to perform pooling on different sizes of pooling region. With both hierarchical square pooling boxes and rectangular pooling boxes, we achieved a classification rate of 84.21% on the held out set. Max pooling or L p norm pooling Although max pooling may be prone to loss of local features, we found that L p norm (p = 5, 10, 20) pooling does not outperform max pooling on the held out set. The loss of information can be mitigated by more hierarchical pooling regions. Training Time One of the biggest advantages of random filters is that it does not need any pre training for obtaining the filters. We list a comparison of training time. 512 Random Filters (RF) 512 Gabor Filters (GF) 512 Sparse 7 Coding (SC) 8 512 K means Unsupervised Training Phase 0 0 6 hours 0.5 hour Table 3 Unsupervised Training Time Comparison Between Encoding Techniques. In the supervised phase, SVM can train 3000 images in 5 minutes. The tradeoff of SVM is the growth of the model size. For 1024 random filters with hierarchical square and rectangle pooling regions, the trained model is approximately 3 GB. SVM Kernel Selection We found that linear kernel actually performs the best. This suggests that in our high dimensional image representation space, categories are linearly separable. Polynomial kernel does not produce any meaningful results; RBF kernel with sigma equal to 25 produces a 2 fold classification rate of 66%, which is far below linear kernel. Ensemble Method We also experimented with an ensemble of classifiers with random filters, gabor filters, and sparse coding filters. The final prediction is the arithmetic average of the prediction probability of all classifiers in the ensemble. Bag Ratio 0.8 Single 512 RF 15x40 RF 20x40 GF 10xRF+5xGF+5xSC Held out rate 83.01% 84.68% 82.54% 82.30% Table 4 Ensemble Method Results 7 We used stochastic coordinate gradient descent with mini batch of 25. 8 We used fkmeans (fast kmeans) from MATLABCentral by Tim Benham to boost speed [5] Page 5 of 7

We discover that with an ensemble of 15 classifiers (600 filters in total), we get 1.7% increase in classification rate compared to a single classifier with similar number of filters. Comparison with Other Methods a. K Nearest Neighbours (KNN), Bag of KNN KNN algorithm is provided in the project starter package. It is a baseline classifier in this task. On top of the baseline KNN, we also built an ensemble of KNN using bagging. b. Random Forest We first obtain the major 45 principal components (PCA), and then train a bag of 7x200 1 vs. all decision trees using MATLAB function fitensemble. c. Polynomial SVM We first obtain the major 45 PCA components, and then train a 7x one vs. all SVM with 3rd order polynomial kernel. d. Multinomial Logistic Regression (LR) We first pass the images through random filters, then we train 7x one vs. all SVM with linear kernel. 1024 RF KNN (K=5) Bag KNN Random Forest Poly SVM Multi LR 10 Fold Rate 82.56% 57.39% 61.03% 64.20% 65.13% 67.57% Held out Rate 84.21% 58.12% Table 5 10 Fold Cross Validation Comparison of Different Classifiers. Most of the methods we explored beat the baseline KNN. The 1024 Random Filters beats the baseline by more than 25%. Held out rates of other methods are not provided because of the submission limit. V. CONCLUSION According to the abovementioned experiments and the laid out analysis, we found our proposed model works best. We obtain new representation of data by convolving random filters with original images. The positive and negative features are extracted respectively by passing the filtered data through a positive ReLU and a negative ReLU. These data from all filters are then pooled to create a new vector that becomes the new representation of the original data. To classify the new representations, 7 one vs. all linear SVMs are trained and used. Finally, it is proven effective to train multiple SVMs using different sets of random filters; the average result of these separate SVM classifiers can further boost the overall classification rate. Page 6 of 7

REFERENCES [1] The Toronto Faces Dataset, http://aclab.ca/users/josh/tfd.html [2] LeCun et al. Gradient based learning applied to document recognition. In Proceedings of the IEEE, 1998. [3] Adam Coates and Andrew Y. Ng, The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ICML 11. 2011, [4] Adam Coates, Honglak Lee and Andrew Y. Ng, An Analysis of Single Layer Networks in Unsupervised Feature Learning. In 14th ICAIS. 2011. [5] Tim Benham, Fast K means Implementation With Optional Weights, http://www.mathworks.com/matlabcentral/fileexchange/31274 fast k means [6] Andrew M. Saxe et al. On Random Weights and Unsupervised Feature Learning. In ICML 11, 2011. [7] Yichuan Tang, Deep Learning using Linear Support Vector Machines. ICML 13, 2013. [8] B. Olshausen and D. Field, Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1? Vision Research, vol. 37, [9] M. Haghighat, S. Zonouz, M. Abdel Mottaleb, "Identification Using Encrypted Biometrics," Computer Analysis of Images and Patterns, Springer Berlin Heidelberg, pp. 440 448, 2013. Page 7 of 7