Deep Learning of Human Emotion Recognition in Videos. Yuqing Li Uppsala University

Deep Learning of Human Emotion Recognition in Videos Yuqing Li Uppsala University

Abstract Machine learning in computer vision has made great progress in recent years. Tasks like object detection, object classification and image segmentation reached near or even above human performance. Meanwhile, there are still tasks like human emotion recognition remains challenging. In this paper, machine learning techniques are used to recognize human emotions in movie images and videos. First of all, the theoretical background of these techniques is introduced. Secondly, informative content including audio, single video frame and multiple video frames are extracted from videos to represent emotions. In this step, OpenSMILE and Inception-ResNet-v2 model are used to extract feature vectors from audios and frames separately. Thirdly, various models are trained to classify the emotions. SVM is used to classify audio feature vectors. Inception-ResNet-v2 is used to recognize emotions in static images. C3D model is used to classify a sequence of frames(video). After that, the accuracy of these models are shown. Finally, the advantages and disadvantages of these models are discussed as well as possible improvements of future studies on human emotion recognition.

Contents 1 Introduction.................................................................................................. 1 1.1 Background....................................................................................... 1 1.2 Previous research.............................................................................. 1 1.2.1 Emotion categories............................................................ 1 1.2.2 Data set............................................................................... 2 1.2.3 Hand-crafted features........................................................ 2 1.2.4 Deep features..................................................................... 3 1.3 Problem Formulation....................................................................... 4 2 Theory.......................................................................................................... 5 2.1 Artificial Neural Network................................................................ 5 2.1.1 Architecture........................................................................ 5 2.1.2 Neurons.............................................................................. 5 2.1.3 Training process................................................................. 7 2.2 Convolutional Neural Network....................................................... 9 2.2.1 Convolutional layer and feature map................................ 9 2.2.2 Pooling layer.................................................................... 10 2.2.3 C3D.................................................................................. 10 2.3 Deep Neural Network.................................................................... 10 2.3.1 Batch Normalization....................................................... 11 2.3.2 Residual Learning............................................................ 11 2.4 RNN................................................................................................ 12 2.4.1 LSTM............................................................................... 13 2.5 Transfer Learning........................................................................... 14 3 Methodology.............................................................................................. 15 3.1 Data................................................................................................. 15 3.2 Pre-processing................................................................................ 16 3.3 Feature Extraction.......................................................................... 17 3.3.1 Audio Features................................................................. 18 3.3.2 CNN Deep Features......................................................... 18 3.4 Model Training............................................................................... 20 3.4.1 SVM for Audios.............................................................. 20 3.4.2 LSTM Model................................................................... 21 3.4.3 C3D.................................................................................. 22 4 Results........................................................................................................ 24 4.1 On Audios....................................................................................... 24

4.2 Inception ResNet V2 On Static Images........................................ 25 4.2.1 Testing on SFEW............................................................. 26 4.2.2 Failed Images................................................................... 26 4.3 On Videos....................................................................................... 26 4.3.1 LSTM............................................................................... 26 4.3.2 C3D.................................................................................. 29 5 Discussion.................................................................................................. 33 5.1 Conclusion...................................................................................... 33 5.1.1 Audios.............................................................................. 33 5.1.2 Image model.................................................................... 33 5.1.3 Video models................................................................... 34 5.2 Future Work.................................................................................... 35 References........................................................................................................ 37

1. Introduction 1.1 Background In recent years, thanks to the rapid development of computer vision and machine learning, tasks like object classification, action recognition, and face recognition have resulted in fruitful achievements. However, human emotion recognition remains one of the most challenging tasks. A lot of effort has been made to solve this problem. Since 2013, the first Emotion Recognition in the Wild (EmotiW) challenge has been held, the classification accuracy of video emotion classification has increased from 38% as the baseline to 59%[2]. Great progress was made but still unsatisfying. On the one hand, this is probably due to lack of labeled video data and the ambiguity nature of human facial expressions. On the other hand, lack of effective ways to extract facial emotion features also effects model performance. In recent years, pretrained deep convolutional neural networks have been proven to perform well in extracting image features in challenging databases such as ImageNet[1]; Long Short-Term Memory (LSTM) network shows exciting prediction accuracy by analyzing sequential data[6]; three dimension convolution neural network (C3D) achieves high performance in video action detection[2]. Thus, by applying all these new techniques and combining them together may boom accuracy of human emotion recognition in videos. 1.2 Previous research The study of automatic human facial emotion recognition started from defining and categorizing human facial expressions. After that, researchers built databases that contained labelled facial expression examples. Finally, various approaches have been used to recognize human emotions. 1.2.1 Emotion categories The study of facial emotion recognition can be traced back to 1970s. Paul Ekman and his colleagues[7] found that there are six facial expressions (happy, sadness, anger, fear, surprise, disgust) can be understood by people from different cultures. The difference in backgrounds influences facial expressions mainly in intensity[10]. For example, watching the same comedy film, Americans tend to laugh with their mouth widely open while Japanese are more 1

likely to smile without showing their teeth. The observation that infants are able to show a wide range of facial expressions and respond to facial expressions from others without being taught suggesting that the ability to deliver emotions and understand emotions via facial expressions is inherent in humans. 1.2.2 Data set Several data sets have been established to build emotion recognition models and evaluate performances. Same methods can receive dramatically different results on different data sets. This is due to the variance in data sets. Facial emotion databases can be divided into two categories: lab data sets like Cohn- Kanade CK+ database[16] and wild data sets like Acted Facial Expressions in the Wild (AFEW) [5]. In latter, facial expressions images or videos are chosen from movies and online videos which diverse significantly in resolution, illumination, head pose and etc while most lab data sets control all these factors carefully. Figure 1.1 shows the difference of two kind of data set. Figure 1.1. Cohn-Kanade CK+ database (above) have frontal facial images with stable illuminance. The Facial Expression Recognition 2013 (FER-2013) Dataset (below) has images cropped from movies varies in head posture and illuminance. 1.2.3 Hand-crafted features There are two approaches to craft facial features by hand from original images/videos: geometric features approach and appearance approach. 2

Geometric-feature-based methods are methods that extract information about facial components and their movements which imitate how humans understand facial emotions. One example of geometric-feature-based methods is the Facial Action Coding System (FACS). In order to describe facial expressions precisely, FACS was developed in which each facial expression broke down into several Action Units (AU) each representing a facial muscular movement[7]. Based on FACS, categorization of facial expressions was conducted by recognizing certain facial movements[19]. Before the 1990s, encoding facial expressions using FACS was done manually and thus it was very inefficient. Meanwhile, geometric-feature-based methods are highly dependent on the accuracy of facial component recognition and tracking which makes it less reliable than appearance-based methods. Computers started to be part of the game since the 1990s. Since then, appearance-based methods became quite popular. Optical flow (OF)[26], 2D Fourier transform coefficients[15], Local Binary Patterns(LBP)[21] and facial motions[8] were popular new features. Among these features, optical flow captures the movement of surfaces and objects in video; 2D Fourier transform converts spatial domain information into frequency domain which allows researches to decrease the dimension of image/video significantly; LBP, on the other hand, is mainly focus on comparing a pixel with its nearby pixels and encode the unique spatial pattern. New classification models were also contributed in the task. Hidden Markov model, a simplified version of Bayesian network aims to discover the hidden pattern of features, was able to classify facial expressions near real-time[17]. 1.2.4 Deep features There are plenty of challenges in Computer Vision(CV) area besides human facial emotion recognition. Accordingly, there are data sets and competitions built for these tasks. The most famous one is ImageNet database and ImageNet Large Scale Visual Recognition Competition (ILSVRC) focusing on object recognition in images. ImageNet contains over 10 million images in around 1000 classes. Since AlexNet, a 5-layer convolutional neural network, was proven to be successful in ILSVRC in 2012, deep neural networks which can extract more complex features came into popularity[13] in all CV research areas. For a convolution neural network, the convolutional layers are considered to be the feature extractors while the fully connected networks are considered classifiers. If the network consists several convolutional layers, the output of last convolutional layer is called deep feature. For a deep network, multiple convolutional layers means a large solution domain thus deep features(the output tensor of last non-classification layer) have higher dimensions and contain more infor- 3

mation from input image. Consequently, deep features are used in emotion recognition tasks and significantly improved classification results. Similar to ImageNet and ILSVRC, in facial emotion recognition area, there are Emotion Recognition in the Wild (EmotiW) Challenge and EmotiW databases designed for the challenge. The challenge was first held in 2013 with two databases Acted Facial Expressions in the Wild(AFEW) and Static Facial Expressions in the Wild(SFEW). The baseline accuracy was 38% with Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) as features and Support Vector Machine (SVM) as classifier[4]. In the following years of competition, solutions utilizing deep pretrained neural networks to extract image features and Long Short-Term Memory (LSTM) to take temporal influence into account has been proven efficient with limited labeled data[6]. The winning team of EmotiW 2016 successfully implemented three-dimensional convolution neural network (C3D) and achieved the best performance with an accuracy of 59%[2]. In this research, the databases from EmotiW will be used to train the models. Meanwhile, the baseline and competition results will be used to evaluate the performance of trained model in this work. 1.3 Problem Formulation This project consists several minor problems need to be solved: 1. Compare various emotion features extracted from videos. 2. Evaluate Inception-ResNet-v2 model s performance on human facial emotion recognition. 3. Evaluate the performance of C3D model. 4

2. Theory This chapter includes the theoretic background of the models and methods implemented in this project to extract deep features and classify video emotions. 2.1 Artificial Neural Network Artificial Neural Network (ANN) is a kind of computational model consisting of a collection of artificial neurons as their basic computation units. There are a set of variants of ANN, such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Based on different architecture and neurons, ANN can be used to solve different problems. 2.1.1 Architecture The structure of ANN can be determined by two factors. The first is how many layers and how many neurons each layer has in ANN. The second is how information/inputs are transferred in ANN. For the former factor, the more layers an ANN has, the deeper it is while the more neurons in each layer, the fatter the ANN is. More neurons means larger solution domain at the cost of longer training time. With limited computational power and time (the number of weights can be trained), thinner and deeper ANNs are proven to be of better performance[20]. For the latter factor, the number of possible ways to connect neurons is enormous. Most of them are not possible for training at the moment. Of all the feasible networks, the most commonly used and typical ones are fully connect feed-forward network and recurrent neural network (RNN) as shown in Figure 2.1. 2.1.2 Neurons Neurons in ANN works in a similar way to neuron cells in animal brains. Neuron cells receive stimuli, process it and produce feedback based on it. Artificial neurons does exactly the same thing by summarizing inputs, adding bias and using an activation function to decide responses, as shown in Figure 2.2. Mathematically speaking, the operations in neurons can be summarized as below: 5

Figure 2.1. Feed-forward network is the one in the left. RNN is the one in the right. As the name indicate, neurons in fully connected feed-forward network only take the output of all neurons its previous layer as its input. The flow of information is unidirectional. Meanwhile, neurons input in RNN may come from other neurons in the same layer. Figure 2.2. Artificial neurons 6

y = ϕ( w i x i + a) (2.1) i where x i is input, w i is the weight of x i, a is the bias added in this neuron and ϕ is the activation function. Activation function Activation function determines the output of a neuron. There are quite a lot of commonly used activation functions including Logistic function (sigmoid function), hyperbolic tangent function (tanh function), ramp function (ReLU) and normalized exponential function (softmax function). These activation functions works as filters deciding whether the information will be passed on and how strong the signal will be. For instance, a node with ReLU as activation function can be written as below: [ ] y = max ( w i x i + a),0 i (2.2) If the sum of weighted input is larger than zero, the signal will be passed on without changing its intensity. Otherwise, the signal will vanish. 2.1.3 Training process The training process of a ANN is a process to find a value for each parameters in the ANN so that the output of the ANN is optimal. This process involves three problems: what initial value should be given to the parameters, how to update the parameters and how to define what is the optimal result of the output. Initial parameters For smaller ANN, the initial value of its parameters is normally set to a number between 0 to 1 or random numbers in certain range generated by computer. However, this approach is reported performing poorly in deep neural networks[9] and it takes more time for networks to converge even in "shallow" networks. In some extreme cases, ANN with poor initial parameters is not able to converge at all. Except this approach, parameters can also be initialized with values from pretrained models as explained in Chapter 2.5. Loss function Loss functions are functions used to calculate the distance of actual value (target) and the output value. Take mean square error(mse) as an example: Loss = [ n i=1 (y i p i )]/n. y i is the actual label/value while p i is the prediction of the model. The aim of training is to decrease the loss as much as possible. 7

In order to achieve this goal, the choice of loss function plays an important role. Cross entropy Loss = [ n i=1 p ilny i ]/n is capable of representing loss properly when the output layer is softmax layer as shown in Figure 2.3. Figure 2.3. Cross entropy(black) and square error(red) of a two layer network. W 1 and W 2 are weights of first and second layer. Gradient descent and learning With parameters and loss function, the mechanism to link them together is gradient descent, an method to guide how to approximate the optimal value of parameters. There are plenty of methods but they are all based on gradient descent. With L as loss, w as weight, µ as learning rate (the speed set manually, normally less than 0.1), gradient descent in one layer network includes following steps: 1. Compute the derivative of L with respect to w, L/ w. 2. Update w by adding µ L/ w. 3. repeat step 1-2 until L/ w is approximately 0. In order to simplify the calculation, back-propagation (BP) is introduced. With h i as the output of layer i, w i as the weights of layer i, b i as the bias of layer i and L as loss, we have h i = w i h i 1 + b i. Thus: Where L = L h i = L h i 1 (2.3) w i h i w i h i L = L h i+1 = L w i+1 = L h i h i+1 h i h i+1 h l l t=i+1 w t (2.4) 8

2.2 Convolutional Neural Network CNN is a type of feed-forward ANN inspired by animal visual cortex and is known for outstanding performance in image classification. Compared to regular fully-connected feed-forward ANNs, CNNs is much easier to train due to sparse connectivity and shared weights. Sparse connectivity means that each neuron on convolution layers only take certain amount of output values of previous layers instead of all the output of previous layers like other fully-connected ANNs. Meanwhile, CNNs also share the weights among hidden layers which means that inputs of different locations are filtered by same learned kernels. These two features of CNN decrease the parameters needed to be trained dramatically. In Figure 2.4, there is LeNet-5, a simple convolution neural network designed for handwritten and machine-printed character recognition. Figure 2.4. Structure of LeNet-5. Each plane is a feature map. 2.2.1 Convolutional layer and feature map Feature maps as shown in Figure 2.4 are the results of applying functions across sub-regions of entire image. The operations in a convolutional layer are listed below: 1. Convolution of the input image f m n with a linear filter g p q. Mathematically, 2 dimension convolution o st for image f m n at location s,t is presented as: o st = f [p,q] st g[p,q] = p q u=0 v=0 f [u,v]g[p u,q v] (2.5) 2. Add a bias b to o st. 3. Apply a none-linear function ϕ (activation function) o st + b. 4. Adding stride to change location (value of s,t) and repeat step 1-3 until exhaust all the required locations. 5. Change a filter and repeat step 1-4 until exhaust all filters. The number of output feature maps is equal to the number of the convolutional layer s filters. 9

2.2.2 Pooling layer Pooling layers are used for down-sampling in CNNs. Down-sampling or subsampling is to decrease the size of feature maps as shown below. Of different kinds of pooling methods, average pooling and max pooling are the most commonly used ones. The benefit of pooling layers not only lays in much less dimensions lessening the computation cost but also controls the overfitting. The process of pooling is shown in Figure 2.5. Figure 2.5. Pooling 2.2.3 C3D Three-dimension convolutional neural networks are a special kind of convolutional neural networks which perform convolution on three dimensions. These networks extract features not only from spatial dimension/images but also integrate information from temporal dimension/videos as shown in Figure 2.6. In the case of 2D CNN, all the filters are of two dimensions while in C3D, all the filters are 3D filters. C3D has shown good performance(82.3% top-1 accuracy) on UCF101 (a data set of 101 human actions classes from videos in the wild)[24]. 2.3 Deep Neural Network Deep neural networks (DNN) have shown state-of-art accuracy on a lot of challenging database such as ImageNet due to its larger feature space and solution space[9]. However, the training process of DNN is much trickier compared to "shallow" networks since deeper networks are more likely to have vanishing gradient problem and exploding gradient problem[9]. As shown in Equation 2.3 and Equation 2.4, if the network goes deeper, the production of w might become so influential that the value of L/ w i will be extreme 10

Figure 2.6. 3D Convolutional Neural Network large when absolute value of w is larger than 1(exploding gradient problem) and extreme small when it is smaller than 1(vanishing gradient problem). In this case, the loss of the model will be shaking instead of decreasing. Batch normalization and residual learning are two methods to solve the problems. 2.3.1 Batch Normalization Batch normalization was first introduced in [12]. Its successful application is Inception Network by Google. As indicated by its name, batch normalization is to normalize data, specifically, to normalize layer inputs. Batch normalization make higher learning rate possible which accelerates the learning process of DNN which prevent vanishing gradient problem and exploding gradient problem[12]. For each mini batch with x 1,x 2,...,x m, first calculate the mean value µ and deviation σ: µ = 1 m σ 2 = 1 m m i=1 m i=1 x i (2.6) (x i µ) 2 (2.7) Then normalize x 1,x 2,...,x m by using a small number ε in case σ = 0: ˆx i = x i µ σ 2 + ε (2.8) 2.3.2 Residual Learning Residual learning is a kind of framework of deep neural network firstly introduced in [11]. It enables DNN to be trained with high accuracy and costs less time to converge[11]. The idea of residual learning is quite simple. Instead 11

of mapping inputs to outputs with stacked layers, residual network use layers to map fluctuations so gradually map the output. The comparison is shown in Figure 2.7. Figure 2.7. H(x) is any desired mapping, plain net hopes the 2 weight layers fit H(x) while residual net hopes the 2 weight layers fit F(x) and let H(x) = F(x) + x. 2.4 RNN As shown in Figure 2.1, recurrent neural network are networks that have connections between neurons in hidden layers. By doing so, it becomes possible for RNN to handle sequential information where the sequence of input matters and the meaning of data depend on the "context". While CNN share weights by using "filter" in spatial dimension, RNN share weights by using the same function to handle information at different time in time domain. Figure 2.8. A standard RNN neuron. RNN share wights in sequential data. In Figure 2.8, there is a simple example of a neuron in standard RNN. With sequential data x 1,x 2,...,x m, output of the node with input x t is h t : 12 For all x t and h t, the parameters of F θ is the same. h t = F θ (h t 1,x t ) (2.9)

2.4.1 LSTM Long Short-Term Memory networks are a special kind of RNN which has a different and more complex structure for neural cells. A LSTM neuron has three gates: an input gate, a forget gate and an output gate. Figure 2.9. A LSTM neuron. Inside a LSTM neuron, three things need to be decided. First, how "clear" the new information should be remembered. Secondly, how much of previous memory should be forgotten. Thirdly, what signal should be passed on to influence other neurons. With sequential data x 1,x 2,...,x m, for x t, the influence of x t on x t+1 is h t. c t is the cell state after processing x t, the process of first step is: i t = σ(w i [x t,h t 1 ] + b i ) (2.10) where σ is a sigmoid function and produce a number between [0 1]. 0 means the information conveyed by x t and h t 1 will not be used at all and 1 means it will be kept entirely. And then preprocess new information: c t = tanh(w c [x t,h t 1 ] + b c ) (2.11) i t c t is the new part need to be add to existing memory. While adding new part is necessary, the old information in the memory will also be affected by new signals. The forget gate works as: f t = σ(w f [x t,h t 1 ] + b f ) (2.12) 13

The range of f t is [0 1]. 0 means all the old memory will be removed and 1 means it will be kept entirely. Thus, the memory at sequence t can be inferred: c t = f t c t 1 + i t c t (2.13) Finally, the output o t for next layer of x t and the message h t from x t to x t+1 is calculated: o t = φ(w o [x t,h t 1 ] + b o ) (2.14) h t = o t tanh(c t ) (2.15) 2.5 Transfer Learning Transfer learning is a term used to describe the fact that humans can apply the knowledge they learn in one field to another field to generate better results. In machine learning, predictions on new data are based on statistical model trained with previous collected data. Once problem domains, tasks or data distributions are changed, models need to be retrained from scratch with relative data. In real world application, data collection is extremely resourceconsuming and so is the training process. In order to solve this problem, data scientists started to apply transfer learning observed in humans to machine learning. Compared to traditional learning approach, transfer learning allows knowledge learning from previous tasks being used in target new task[18]. The comparison between learning from scratch and transfer learning are shown in Figure 2.10. Figure 2.10. Different learning approach of traditional learning (left) and transfer learning (learning). 14

3. Methodology In this section, the specific approach are illustrated. It can be divided into three parts: data collection and pre-processing, feature extraction and classification and evaluation method. This research involves 3 models. The first one is audio-svm model. Audios will be used to extract audio features with OpenSMILE as feature extractor and classified with SVM. The second one is CNN-LSTM model. This involves three steps: train the feature extractor, CNN model(inception-resnetv2 model); use CNN model to extract deep features from face images cropped from video frames; and use LSTM as classifier to integrate the deep features and classify the emotion. The third one is video-c3d model. This model use face frames from videos as input and C3D model as both feature-extractor and classifier. 3.1 Data Three data sets are involved in the training and evaluation process of the model. The first one is AFEW 6.0. The second one is Static Facial Expression Recognition in the Wild (SFEW). The third one is Facial Expression Recognition 2013 data set (FER2013). FER2013 is used to train CNN model(inception-resnet-v2 model) and SFEW is used to evaluate this Inception- ResNet-v2 model. AFEW 6.0 will be used for both training(60%) and evaluating(40%) SVM, LSTM and C3D model. AFEW 6.0 AFEW 6.0 is a data set consisting of video clips collected from movies and reality TV shows. It is the newest version of AFEW data set. Compared to AFEW 1.0-5.0, reality TV show clips are newly added. There are 1750 video clips in the data set and they are originally divided into 774 training videos, 383 validation videos and 593 test videos. Each of them are labelled with only one emotion of the universally recognized seven emotions, as shown in Table 3.1. Due to the EmotiW contest, the labels of test videos are not available when this project is conducted. All the videos are of 25 fps (25 frames per second) and are of 720*576 resolution. 15

Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of videos 197 114 127 213 178 120 207 Table 3.1. Emotion distribution of AFEW 6.0 data set for training and validation SFEW SFEW is a data set consisting of images collected from movie frames with a label from seven emotions. There are 861 labelled images in total in the train set and validate set. The distribution of SFEW data set is shown in Table 3.2. All the images are frames of movies of 720*567 resolution. Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of videos 153 53 91 188 143 86 147 Table 3.2. Emotion distribution of SFEW data set for training and validation FER2013 The FER2013 database is an image data set containing 35889 48*48 pixel gray-scale facial expression images labelled with the seven universal emotions above. The facial expression images in the FER2013 data set are also gathered from wild environment (movies) and thus the features learned from it can be applied to AFEW 6.0 data set. The distribution of images in different categories is shown in Table 3.3. Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of images 4952 546 5120 8988 4829 6076 6197 Table 3.3. Emotion distribution of FER2013 data set 3.2 Pre-processing Videos contain rich information. However, in order to train models more efficiently, only audios and facial crops from video frames are used in the training. AFEW 6.0 For AFEW 6.0 data set, ffmpeg, a cross-platform open source audio-video processing framework is used to extract audio files and video frames. All the video clips are around 2 seconds and have a frame rate of 25 frames per second (25fps). Since not every frame in a video contains at least one human face, in order to get enough cropped faces to analyze temporal information, all frames were extracted from videos and Dlib frontal face detector was used to crop the largest face in a frame. In the end, all the faces are resized to 2 16

standard sizes: 48*48 and 299*299 and converted to gray-scale image. Grayscale facial images and audios from videos will be used as input for further feature extraction as shown in Figure 3.1. Figure 3.1. Pre-process of AFEW 6.0 data SFEW For SFEW data set, Dlib frontal face detector is also used to corp the largest face in the frame. And Cropped faces are then resized to 299*299 and converted to gray-scale images for future usage. FER2013 For FER2013 data set, all the faces are resized to 299*299. Normalization Besides, all the faces are linear normalized to decrease the influence of illumination. For a pixel at location (x,y) with intensity value I (x,y), while I max and I min are the largest and smallest intensity value of the original image, the normalized intensity value of the pixel I (x,y) is calculated: I (x,y) = I (x,y) I min I max I min 255 (3.1) 3.3 Feature Extraction Before training classifiers, features must be extracted accordingly. In this research, audio features are extracted using OpenSMILE while facial image features are extracted from a pre-trained and fine-tuned deep CNN model. 17

3.3.1 Audio Features OpenSMILE (Speech and Music Interpretation by Large-space Extraction) feature extraction toolkit is used to extract audio features. OpenSMILE can extract audio low-level descriptors such as Mel-frequency cepstral coefficients, loudness, perceptual linear predictive cepstral coefficients, line spectral frequencies and format frequencies. The extracted feature of a 2 second audio is a feature of 1582 dimensions. 3.3.2 CNN Deep Features Inception ResNet V2 network Inception ResNet V2 network pretrained with ImageNet database is used in deep feature extraction. Inception-ResNet-v2 network is a deep convolutional neural network utilizing residual connection and inception deep convolutional architecture[22], as shown in Figure 3.2 and Figure 3.4. Figure 3.2. The structure of inception ResNet v2 network. Detailed structures of repeating blocks are shown in Figure 3.4 and Figure 3.3. Inception-ResNet-v2 has the highest classification accuracy (top-1 accuracy: 80.4%, top-5 accuracy: 95.3%)[22] on ILSVRC image classification benchmark so far. Consequently, it can be considered of sufficient ability to extract features from images of different content. Thus, the parameter value of pretrained Inception-ResNet-v2 with ImageNet will be used as the initial value of Inception-ResNet-v2 model in this project. However, since ImageNet has 1000 class while the 7 universal emotions will be used as classes in this research, the parameters in fully connected layers 18

Figure 3.3. On the left is Stem block in Figure 3.2. On the right there is block A (below) and block B (above) in Figure 3.2. 19

Figure 3.4. The structure of Inception-ResNet-v2 network blocks. The block in the left is block A shown in Figure 3.2. The block in the middle is block B shown in Figure 3.2. The block in the right is block C shown in Figure 3.2. in ImageNet pretrained model is not suitable in this model. All the parameters in AuxLogits block and Logits block will be generated randomly instead of restored from pretrained model. Fine-tune In order to enhance the ability of Inception-ResNet-v2 model to extract facial expression features, the Facial Expression Recognition 2013 data set (FER2013) is used to fine-tune the Inception-ResNet-v2 model pretrained with ImageNet. All the layers of pretrained Inception-ResNet-v2 model were tuned with the FER2013 database. FER2013 data set is divided into train set of 28709 images and validate set of 7188 images. The learning rate is set to 0.01 for step 1-30000. Then with learning rate as 0.0001, fine-tune the model for 2000 steps. After fine-tuning, this model is able to classify facial emotions from static images. The output layer will produce the classification result while the output of convolution layers will be deep emotional features of facial images. 3.4 Model Training 3.4.1 SVM for Audios Input The audio features explained in Section 3.3.1 will be used as model input. Model details After extracting 1582 dimensional audio features, a SVM is trained as classifier. Scikit-learn, the open source machine learning library is used to train 20

this SVM model. To be specific, Classification SVM type 1 (C-SVM) is used. Training of C-SVM is a process to minimize the error function: 1 2 wt w +C N i=1 ε i (3.2) with constrains: y i (w T Φ(x i ) + b) 1 ε i (3.3) ε i 0,i = 1,...,N (3.4) where y i is class label, x i is input data, w is coefficient vector, b is a bias, ε i is single input parameter, and C is capacity constant. Kernel Φ is used to transform input data into feature space. Training Method To find the optimal parameter set, 10-fold cross validation is used. For linear kernel, the possible value of C is 1, 10 and 100. For radial basis function (rbf) kernel, the possible value of C is among 1, 10, 100 and 1000. Besides, for rbf kernel, the possible value of gamma is 0.01, 0.001 and 0.0001. The training set of AFEW will be used for training. And the validation set of AFEW will be used to evaluate the performance of all the SVM models.the highest-performed combination of parameters will be chosen as the model to test. Output The output of SVM model will be a one-digit number indicating the classification result. 3.4.2 LSTM Model Deep features of facial images (single frame) is extracted from the output value of last flatten layer, which is the input of dropout layer (the purple layer in Figure 3.2). The deep feature of singe facial image is a 1536 dimension vector. For each video, 16 facial images are used to get a deep feature. For videos in AFEW database, each video contains around 50 frames. However, not every frames contains a face and in some cases, the Dlib frontal face detector is not able to locate faces due to head posture, illuminance and etc. Thus, it is common that some videos can have more than 16 faces while some have less than that. For those videos have x faces and x 16, a random number s between 0 and x 16 will be generated and the deep features from face s to face s + 15 will be used as input of LSTM model. For those videos have x faces and x < 16, padding is needed. A random face selected from existing 21

faces will be duplicated and add to face collection until the number of faces reaches 16. Thus, the input of LSTM will be a 16*1536 dimension feature. Model Details One layer LSTM model is used. Training Method The learning rate of LSTM is set to 0.001 with 40000 training iterations. Batch size is 32. Output The output of LSTM model is seven logits indicating the likelihood of each emotion. 3.4.3 C3D The input of C3D are cropped faces of size 48*48 described in Section 3.2. As explained in Section 3.4.2, the number of faces found in videos varies. Thus, a similar method is used to get the same size input for C3D. For those videos have more than 16 faces, 16 sequential faces will be chosen randomly as input. For those videos have more than 1 face but less than 16 faces, the padding technique in 3.4.2 is used. For those videos with no face founded inside, they will be removed from training set. Model Details There are 7 hidden layers in LSTM model as shown in Figure 3.5. The first five layers are convolutional layers to extract video features. Kernels for these layers are all 3*3*3, for every dimension the stride is 1. The last 2 layers are fully connected layers in order to classify. This model structure has been proven efficient in classifying videos in UCF101 database, a database of 101 different kind of human actions (101 classes) and shown an accuracy of 72.6%[25]. Output The output is seven logits indicating the likelihood of each emotion. 22

Figure 3.5. The structure of C3D model 23

4. Results In this chapter, the performance of each model is evaluated independently. The accuracy of all models are shown while the confusion matrix for Inception- ResNet-v2 model, LSTM model and C3D model are displayed as well. Also, the learning process of Inception-ResNet-v2 model, LSTM model and C3D model is also shown to provide more details. 4.1 On Audios Of 15 SVM models trained, the linear ones have better accuracy as shown in Table 4.1. Besides, the accuracy does not vary much depending on parameter C and gamma. After grid search, linear SVM model with C = 1 is chosen a the classifier due to its accuracy and efficiency. kernel\c 1 10 100 1000 linear 0.229 0.229 0.229 - gamma=0.01 0.194 0.194 0.194 0.194 rbf gamma=0.001 0.194 0.194 0.194 0.194 gamma=0.0001 0.193 0.193 0.193 0.194 Table 4.1. The accuracy of SVM models with different parameters With the model trained by AFEW 6.0 training set, AFEW 6.0 validation set is used to evaluate the model as shown in Table 4.2. The accuracy of SVM model on validation set is 25%. This classifier is better at angry videos with a f1-score 0.37, while the performance on disgust and surprise videos is much worse than average. precision recall f1-score support angry 0.31 0.44 0.37 64 disgust 0.07 0.10 0.08 40 fear 0.24 0.22 0.23 46 happy 0.24 0.30 0.27 63 sad 0.31 0.20 0.24 61 surprise 0.15 0.09 0.11 46 neutral 0.31 0.24 0.27 63 total 0.25 0.24 0.24 383 Table 4.2. Accuracy of audio SVM model test on AFEW validation set. 24

4.2 Inception ResNet V2 On Static Images The Inception-ResNet-v2 model is first fine-tuned with FER2013. The training process of Inception-ResNet-v2 model is shown in Figure 4.1. Training step is 32000. After 30000 training steps, the learning rate is adjusted to 0.0001 to fine-tune the model. After 20000 steps of training, the loss is tend to be stable and the accuracy remains the same. (a) Training loss (b) Learning rate Figure 4.1. Details of fine-tuning. After fine-tuning, the accuracy of the model on the validation set of FER2013 is shown in Table 4.3. The model has better performance on disgusted, happy and surprise emotions. Among all, the high accuracy of disgusted. Overall, the accuracy is 60% on 5740 images in FER2013. 25

Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Overall Accuracy 54% 80% 48% 78% 49% 67% 48% 60% Table 4.3. Accuracy on FER2013 test set. 4.2.1 Testing on SFEW The accuracy of Inception-ResNet-v2 model shown in Table 4.3 is based on training data and testing data are both from FER2013 database and thus cropped using same method. However, this model is trained to recognize all the facial images in movies cropped by Dlib frontal face detector. Different cropping methods may result in various facial images and may have an impact on the prediction result. Thus, testing the model using SFEW is necessary. All the pre-processed labelled images from SFEW described in Section 3.2 are used for testing on the fine-tuned model. The confusion matrix of fine-tuned Inception-ResNet-v2 model on SFEW is shown in Table 4.4. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry 42.4837 0 15.6863 16.3399 11.7647 5.88235 7.84314 Disgust 24.5283 0 13.2075 24.5283 15.0943 0 22.6415 Fear 6.59341 0 21.978 9.89011 21.978 24.1758 15.3846 Happy 1.59574 0 0.531915 86.7021 4.78723 0.531915 5.85106 Sad 0.699301 0 22.3776 11.8881 43.3566 4.8951 16.7832 Surprise 5.81395 0 16.2791 6.97674 13.9535 41.8605 15.1163 Neutral 4.7619 0 12.9252 4.08163 25.8503 6.80272 45.5782 Table 4.4. Results on static facial expressions There are 861 images being tested. 413 Images are correctly predicted. The overall accuracy on all emotions is 47.97%. As shown in Table 4.4, the Inception-ResNet-v2 model remains satisfying performance on happy images. However, it fails to recognize disgusted images completely with an accuracy of 0. 4.2.2 Failed Images In Figure 4.2, some of the failed images from SFEW database are listed. They are chosen from the movie The Hangover. The prediction of model and the ground truth label of each image is listed below. 4.3 On Videos 4.3.1 LSTM The validation set of AFEW 6.0 is used to test the accuracy of LSTM. The loss decreases dramatically during the first 5000 training steps as shown in Figure 26

(a) (b) (c) (d) (e) (f) Figure 4.2. Failed images from SFEW data set.(a)prediction: Sad. Ground truth: Neutral. (b)prediction: Sad. Ground truth: Angry. (c)prediction: Neutral. Ground truth: Disgust. (d)prediction: Surprise. Ground truth: Fear. (e)prediction: Angry. Ground truth: Surprise. (f)prediction: Happy. Ground truth: Fear. 4.3 and it declines slowly during the training process. In the end, it fluctuates around 1.3. While the loss decreases, the accuracy increases as a result during training as shown in Figure 4.4. The training accuracy and testing accuracy both increase during the first 5000 step of training. However, further training failed to increase the accuracy of testing set. The confusion matrix of LSTM model is shown in Table 4.5. LSTM model is more capable of classify angry and happy emotions while it fails completely at rest of other emotions. 41.67% videos are classified as neutral emotion indicates that overfitting might exists. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry 24 1 4 1 0 0 8 Disgust 8 1 1 3 1 3 14 Fear 9 1 3 1 1 0 12 Happy 6 0 2 28 0 2 10 Sad 9 1 5 0 1 0 14 Surprise 4 1 4 4 5 18 25 Neutral 4 1 2 3 1 1 32 Table 4.5. The Confusion Matrix of LSTM model 27

Figure 4.3. The training accuracy and testing accuracy of LSTM model. Figure 4.4. The training accuracy and testing accuracy of LSTM model. 28

4.3.2 C3D Two C3D models are trained at different learning rates. C3D-1 C3D-1 is the first C3D model trained. It is trained with learning rate 0.01 and it decays every 2700 steps with a decay rate 0.1 as shown in Figure 4.5. The loss decreases dramatically during the first 1000 training steps.and it decrease slowly during the following 9000 steps. Finally, the training loss fluctuates slightly around 1.9. The accuracy of the model increases fast during first 3000 training steps. However, the training accuracy is not stable at all. After smoothing the curve, we can see the accuracy stays around 22%. After training, the validation set of AFEW 6.0 is used to evaluate the accuracy of C3D-1. The confusion matrix is shown in Table 4.6. All the videos are labelled as happy shows the model overfits the training data completely. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry 0 0 0 48 0 0 0 Disgust 0 0 0 34 0 0 0 Fear 0 0 0 32 0 0 0 Happy 0 0 0 53 0 0 0 Sad 0 0 0 30 0 0 0 Surprise 0 0 0 45 0 0 0 Neutral 0 0 0 39 0 0 0 Table 4.6. C3D-1 results on AFEW 6.0 validation set The overall accuracy of C3D-1 on AFEW 6.0 validation set is 18.86%. 53 videos are correctly labelled out of 281 videos in total. C3D-2 In order to further decrease the possibility of overfitting, another C3D model is trained. The training process is as shown in Figure 4.6. C3D-2 is training with smaller learning rate and more training steps. Ideally it should be able to avoid overfitting and converges slower. As shown in Figure 4.6, the learning rate of C3D-2 is set to 0.00001 and it decays every 1600 steps with a decay rate of 0.1. The average loss drops fast to around 3.3 during the beginning phrase of training but remains the same in the following training steps. After learning rate drops below 0.000001, the learning of the model can be considered stopped. Both loss and training accuracy remains very unstable. By the end of training process, training accuracy is around 19%. After training, the validation set of AFEW 6.0 is used to evaluate the accuracy of C3D-2. The confusion matrix is shown in Table 4.7. There is no videos being labelled as disgust and fear. Some of videos are labelled as sur- 29

(a) Learning rate (b) Training loss (c) Training Accuracy Figure 4.5. Details of fine-tuning. 30

(a) Learning rate (b) Training loss (c) Training Accuracy Figure 4.6. Details of C3D training process. 31

prise correctly. Most of videos are classified as happy videos which clearly still indicates overfitting. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry 4 0 0 33 0 7 4 Disgust 1 0 0 26 0 6 1 Fear 3 0 0 17 0 10 2 Happy 4 0 0 35 0 8 6 Sad 0 0 0 35 0 0 0 Surprise 0 0 0 30 0 8 7 Neutral 2 0 0 24 0 6 7 Table 4.7. C3D-2 result final The overall accuracy of C3D-2 on AFEW 6.0 validation set is 21.7%. 61 videos are correctly labelled out of 281 videos in total. 32

5. Discussion In this chapter, the conclusions are draw based on the comparison of EmotiW competitors work and future work are purposed based on conclusions. 5.1 Conclusion 5.1.1 Audios The SVM model shows a weak ability to distinguish certain emotions. Compare to an accuracy of 14.3% when randomly label a given video an emotion, the model has a better accuracy of 25%. As shown in 4.2, the SVM model is more capable of distinguish angry emotion than any other emotion with a f1 score of 0.37 much higher than average f1 score 0.24. It might due to the fact that most angry videos involves someone shouts loudly which makes the classifier easy to find key feature. On the other hand, SVM model is not capable to classify disgust and surprise. Because there is hardly any sounds are particularly surprise even to human ears while there aren t enough audios for the model to find the key feature of disgust. Besides, the audios in a certain category can vary even more than audios across categories. For instance, some surprise audios are quite quiet that they can be considered neutral while some surprise audios are quite noisy that they can be classified as angry or fear. It is difficult to compare this audio-svm model with sate-of-art work since EmotiW competitors did not provide the precise accuracy of their audio model. Overall speaking, the audio model is of limited ability to classify emotions. But with a 1582 dimension feature for a 2-second audio, it is neither efficient nor precise. 5.1.2 Image model The Inception-ResNet-v2 model shows good ability to classify static facial emotions under challenging situation. On FER2013 database, the accuracy is 60%, 6% lower than state-of-art performance[14]. On SFEW database, the accuracy is 47.97%, increase 9% over baseline on images while the state-of-art result on SFEW from the winner team of EmotiW 2015 is 61.6%[6]. As shown in Table 4.3 and Table 4.4, deeper model didn t improve performance significantly but it has good ability across databases. The reason why 33

Inception-ResNet-v2 model could not outperform previous models might be different cropping method, various input size and different training method. The state-of-art result on SFEW uses faces cropped with Viola-Jones face detector while in this research Dlib frontal face detector is used while means faces in different angle might not be detected. The size of FER2013 images are 48*48 gray-scale images. However, the input size of pretrained Inception-ResNet-v2 model is 299*299*3. After resizing all images into 299*299, every pixel of FER2013 image is extended to an area contains at least 6*6 pixels. This makes the first several layers of Inception-ResNet-v2 model not able to extract too much useful information since the stride of those layers are 1 or 2 and the size of filters are 3*3 pixels or 1*1 pixels. Unless the filters move on the broader of two 6*6 pixel areas, the filters can hardly extract any useful information. On the other hand, the cropped faces from SFEW database are of different sizes. A lot of them are large enough and full of details. Not to mention that faces from SFEW are RGB-color images. The Inception-ResNet-v2 model trained with FER2013 is not able to extract those image details. Besides, ImageNet is a database of both colored images and gray-scale images. Without color information in FER2013 database, the full power of Inception-ResNet-v2 model is not utilized and so is the pretrained parameters of the model. 5.1.3 Video models LSTM The LSTM model shows good ability to integrate the temporal information. As shown in Section 4.3.1, the accuracy of LSTM model on AFEW 6.0 is 41.67%, increases 4% over baseline[5], but still 3.67% lower than state of art[2]. Compared to state-of-art research, this LSTM model uses less training data, different cropping method and different input feature. The state-of-art result using same method has more training data[2]. They used both training set and validation set of AFEW 6.0 database for training and test set of AFEW 6.0 for testing. In this research, only training set are used for training since the label of test set is not available at the moment. Also, the state-of-art research uses the faces provided in AFEW 6.0 which are cropped with Viola-Jones face detector and then they developed a face classifier to remove those non-faces from it. In this research, faces are cropped with Dlib frontal face detector and no face classifier is developed due to limit of time. Besides, the input feature of models are different. The state-of-art model uses deep features extracted by VGG16-FACE, a CNN model pretrained with face database and FER2013. Even though VGG16-FACE has a much higher accuracy(70.74%) on FER2013 database, the outcome of LSTM model is not 34