Deep Learning of Human Emotion Recognition in Videos. Yuqing Li Uppsala University

Size: px
Start display at page:

Download "Deep Learning of Human Emotion Recognition in Videos. Yuqing Li Uppsala University"

Transcription

1 Deep Learning of Human Emotion Recognition in Videos Yuqing Li Uppsala University

2 Abstract Machine learning in computer vision has made great progress in recent years. Tasks like object detection, object classification and image segmentation reached near or even above human performance. Meanwhile, there are still tasks like human emotion recognition remains challenging. In this paper, machine learning techniques are used to recognize human emotions in movie images and videos. First of all, the theoretical background of these techniques is introduced. Secondly, informative content including audio, single video frame and multiple video frames are extracted from videos to represent emotions. In this step, OpenSMILE and Inception-ResNet-v2 model are used to extract feature vectors from audios and frames separately. Thirdly, various models are trained to classify the emotions. SVM is used to classify audio feature vectors. Inception-ResNet-v2 is used to recognize emotions in static images. C3D model is used to classify a sequence of frames(video). After that, the accuracy of these models are shown. Finally, the advantages and disadvantages of these models are discussed as well as possible improvements of future studies on human emotion recognition.

3 Contents 1 Introduction Background Previous research Emotion categories Data set Hand-crafted features Deep features Problem Formulation Theory Artificial Neural Network Architecture Neurons Training process Convolutional Neural Network Convolutional layer and feature map Pooling layer C3D Deep Neural Network Batch Normalization Residual Learning RNN LSTM Transfer Learning Methodology Data Pre-processing Feature Extraction Audio Features CNN Deep Features Model Training SVM for Audios LSTM Model C3D Results On Audios

4 4.2 Inception ResNet V2 On Static Images Testing on SFEW Failed Images On Videos LSTM C3D Discussion Conclusion Audios Image model Video models Future Work References

5 1. Introduction 1.1 Background In recent years, thanks to the rapid development of computer vision and machine learning, tasks like object classification, action recognition, and face recognition have resulted in fruitful achievements. However, human emotion recognition remains one of the most challenging tasks. A lot of effort has been made to solve this problem. Since 2013, the first Emotion Recognition in the Wild (EmotiW) challenge has been held, the classification accuracy of video emotion classification has increased from 38% as the baseline to 59%[2]. Great progress was made but still unsatisfying. On the one hand, this is probably due to lack of labeled video data and the ambiguity nature of human facial expressions. On the other hand, lack of effective ways to extract facial emotion features also effects model performance. In recent years, pretrained deep convolutional neural networks have been proven to perform well in extracting image features in challenging databases such as ImageNet[1]; Long Short-Term Memory (LSTM) network shows exciting prediction accuracy by analyzing sequential data[6]; three dimension convolution neural network (C3D) achieves high performance in video action detection[2]. Thus, by applying all these new techniques and combining them together may boom accuracy of human emotion recognition in videos. 1.2 Previous research The study of automatic human facial emotion recognition started from defining and categorizing human facial expressions. After that, researchers built databases that contained labelled facial expression examples. Finally, various approaches have been used to recognize human emotions Emotion categories The study of facial emotion recognition can be traced back to 1970s. Paul Ekman and his colleagues[7] found that there are six facial expressions (happy, sadness, anger, fear, surprise, disgust) can be understood by people from different cultures. The difference in backgrounds influences facial expressions mainly in intensity[10]. For example, watching the same comedy film, Americans tend to laugh with their mouth widely open while Japanese are more 1

6 likely to smile without showing their teeth. The observation that infants are able to show a wide range of facial expressions and respond to facial expressions from others without being taught suggesting that the ability to deliver emotions and understand emotions via facial expressions is inherent in humans Data set Several data sets have been established to build emotion recognition models and evaluate performances. Same methods can receive dramatically different results on different data sets. This is due to the variance in data sets. Facial emotion databases can be divided into two categories: lab data sets like Cohn- Kanade CK+ database[16] and wild data sets like Acted Facial Expressions in the Wild (AFEW) [5]. In latter, facial expressions images or videos are chosen from movies and online videos which diverse significantly in resolution, illumination, head pose and etc while most lab data sets control all these factors carefully. Figure 1.1 shows the difference of two kind of data set. Figure 1.1. Cohn-Kanade CK+ database (above) have frontal facial images with stable illuminance. The Facial Expression Recognition 2013 (FER-2013) Dataset (below) has images cropped from movies varies in head posture and illuminance Hand-crafted features There are two approaches to craft facial features by hand from original images/videos: geometric features approach and appearance approach. 2

7 Geometric-feature-based methods are methods that extract information about facial components and their movements which imitate how humans understand facial emotions. One example of geometric-feature-based methods is the Facial Action Coding System (FACS). In order to describe facial expressions precisely, FACS was developed in which each facial expression broke down into several Action Units (AU) each representing a facial muscular movement[7]. Based on FACS, categorization of facial expressions was conducted by recognizing certain facial movements[19]. Before the 1990s, encoding facial expressions using FACS was done manually and thus it was very inefficient. Meanwhile, geometric-feature-based methods are highly dependent on the accuracy of facial component recognition and tracking which makes it less reliable than appearance-based methods. Computers started to be part of the game since the 1990s. Since then, appearance-based methods became quite popular. Optical flow (OF)[26], 2D Fourier transform coefficients[15], Local Binary Patterns(LBP)[21] and facial motions[8] were popular new features. Among these features, optical flow captures the movement of surfaces and objects in video; 2D Fourier transform converts spatial domain information into frequency domain which allows researches to decrease the dimension of image/video significantly; LBP, on the other hand, is mainly focus on comparing a pixel with its nearby pixels and encode the unique spatial pattern. New classification models were also contributed in the task. Hidden Markov model, a simplified version of Bayesian network aims to discover the hidden pattern of features, was able to classify facial expressions near real-time[17] Deep features There are plenty of challenges in Computer Vision(CV) area besides human facial emotion recognition. Accordingly, there are data sets and competitions built for these tasks. The most famous one is ImageNet database and ImageNet Large Scale Visual Recognition Competition (ILSVRC) focusing on object recognition in images. ImageNet contains over 10 million images in around 1000 classes. Since AlexNet, a 5-layer convolutional neural network, was proven to be successful in ILSVRC in 2012, deep neural networks which can extract more complex features came into popularity[13] in all CV research areas. For a convolution neural network, the convolutional layers are considered to be the feature extractors while the fully connected networks are considered classifiers. If the network consists several convolutional layers, the output of last convolutional layer is called deep feature. For a deep network, multiple convolutional layers means a large solution domain thus deep features(the output tensor of last non-classification layer) have higher dimensions and contain more infor- 3

8 mation from input image. Consequently, deep features are used in emotion recognition tasks and significantly improved classification results. Similar to ImageNet and ILSVRC, in facial emotion recognition area, there are Emotion Recognition in the Wild (EmotiW) Challenge and EmotiW databases designed for the challenge. The challenge was first held in 2013 with two databases Acted Facial Expressions in the Wild(AFEW) and Static Facial Expressions in the Wild(SFEW). The baseline accuracy was 38% with Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) as features and Support Vector Machine (SVM) as classifier[4]. In the following years of competition, solutions utilizing deep pretrained neural networks to extract image features and Long Short-Term Memory (LSTM) to take temporal influence into account has been proven efficient with limited labeled data[6]. The winning team of EmotiW 2016 successfully implemented three-dimensional convolution neural network (C3D) and achieved the best performance with an accuracy of 59%[2]. In this research, the databases from EmotiW will be used to train the models. Meanwhile, the baseline and competition results will be used to evaluate the performance of trained model in this work. 1.3 Problem Formulation This project consists several minor problems need to be solved: 1. Compare various emotion features extracted from videos. 2. Evaluate Inception-ResNet-v2 model s performance on human facial emotion recognition. 3. Evaluate the performance of C3D model. 4

9 2. Theory This chapter includes the theoretic background of the models and methods implemented in this project to extract deep features and classify video emotions. 2.1 Artificial Neural Network Artificial Neural Network (ANN) is a kind of computational model consisting of a collection of artificial neurons as their basic computation units. There are a set of variants of ANN, such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Based on different architecture and neurons, ANN can be used to solve different problems Architecture The structure of ANN can be determined by two factors. The first is how many layers and how many neurons each layer has in ANN. The second is how information/inputs are transferred in ANN. For the former factor, the more layers an ANN has, the deeper it is while the more neurons in each layer, the fatter the ANN is. More neurons means larger solution domain at the cost of longer training time. With limited computational power and time (the number of weights can be trained), thinner and deeper ANNs are proven to be of better performance[20]. For the latter factor, the number of possible ways to connect neurons is enormous. Most of them are not possible for training at the moment. Of all the feasible networks, the most commonly used and typical ones are fully connect feed-forward network and recurrent neural network (RNN) as shown in Figure Neurons Neurons in ANN works in a similar way to neuron cells in animal brains. Neuron cells receive stimuli, process it and produce feedback based on it. Artificial neurons does exactly the same thing by summarizing inputs, adding bias and using an activation function to decide responses, as shown in Figure 2.2. Mathematically speaking, the operations in neurons can be summarized as below: 5

10 Figure 2.1. Feed-forward network is the one in the left. RNN is the one in the right. As the name indicate, neurons in fully connected feed-forward network only take the output of all neurons its previous layer as its input. The flow of information is unidirectional. Meanwhile, neurons input in RNN may come from other neurons in the same layer. Figure 2.2. Artificial neurons 6

11 y = ϕ( w i x i + a) (2.1) i where x i is input, w i is the weight of x i, a is the bias added in this neuron and ϕ is the activation function. Activation function Activation function determines the output of a neuron. There are quite a lot of commonly used activation functions including Logistic function (sigmoid function), hyperbolic tangent function (tanh function), ramp function (ReLU) and normalized exponential function (softmax function). These activation functions works as filters deciding whether the information will be passed on and how strong the signal will be. For instance, a node with ReLU as activation function can be written as below: [ ] y = max ( w i x i + a),0 i (2.2) If the sum of weighted input is larger than zero, the signal will be passed on without changing its intensity. Otherwise, the signal will vanish Training process The training process of a ANN is a process to find a value for each parameters in the ANN so that the output of the ANN is optimal. This process involves three problems: what initial value should be given to the parameters, how to update the parameters and how to define what is the optimal result of the output. Initial parameters For smaller ANN, the initial value of its parameters is normally set to a number between 0 to 1 or random numbers in certain range generated by computer. However, this approach is reported performing poorly in deep neural networks[9] and it takes more time for networks to converge even in "shallow" networks. In some extreme cases, ANN with poor initial parameters is not able to converge at all. Except this approach, parameters can also be initialized with values from pretrained models as explained in Chapter 2.5. Loss function Loss functions are functions used to calculate the distance of actual value (target) and the output value. Take mean square error(mse) as an example: Loss = [ n i=1 (y i p i )]/n. y i is the actual label/value while p i is the prediction of the model. The aim of training is to decrease the loss as much as possible. 7

12 In order to achieve this goal, the choice of loss function plays an important role. Cross entropy Loss = [ n i=1 p ilny i ]/n is capable of representing loss properly when the output layer is softmax layer as shown in Figure 2.3. Figure 2.3. Cross entropy(black) and square error(red) of a two layer network. W 1 and W 2 are weights of first and second layer. Gradient descent and learning With parameters and loss function, the mechanism to link them together is gradient descent, an method to guide how to approximate the optimal value of parameters. There are plenty of methods but they are all based on gradient descent. With L as loss, w as weight, µ as learning rate (the speed set manually, normally less than 0.1), gradient descent in one layer network includes following steps: 1. Compute the derivative of L with respect to w, L/ w. 2. Update w by adding µ L/ w. 3. repeat step 1-2 until L/ w is approximately 0. In order to simplify the calculation, back-propagation (BP) is introduced. With h i as the output of layer i, w i as the weights of layer i, b i as the bias of layer i and L as loss, we have h i = w i h i 1 + b i. Thus: Where L = L h i = L h i 1 (2.3) w i h i w i h i L = L h i+1 = L w i+1 = L h i h i+1 h i h i+1 h l l t=i+1 w t (2.4) 8

13 2.2 Convolutional Neural Network CNN is a type of feed-forward ANN inspired by animal visual cortex and is known for outstanding performance in image classification. Compared to regular fully-connected feed-forward ANNs, CNNs is much easier to train due to sparse connectivity and shared weights. Sparse connectivity means that each neuron on convolution layers only take certain amount of output values of previous layers instead of all the output of previous layers like other fully-connected ANNs. Meanwhile, CNNs also share the weights among hidden layers which means that inputs of different locations are filtered by same learned kernels. These two features of CNN decrease the parameters needed to be trained dramatically. In Figure 2.4, there is LeNet-5, a simple convolution neural network designed for handwritten and machine-printed character recognition. Figure 2.4. Structure of LeNet-5. Each plane is a feature map Convolutional layer and feature map Feature maps as shown in Figure 2.4 are the results of applying functions across sub-regions of entire image. The operations in a convolutional layer are listed below: 1. Convolution of the input image f m n with a linear filter g p q. Mathematically, 2 dimension convolution o st for image f m n at location s,t is presented as: o st = f [p,q] st g[p,q] = p q u=0 v=0 f [u,v]g[p u,q v] (2.5) 2. Add a bias b to o st. 3. Apply a none-linear function ϕ (activation function) o st + b. 4. Adding stride to change location (value of s,t) and repeat step 1-3 until exhaust all the required locations. 5. Change a filter and repeat step 1-4 until exhaust all filters. The number of output feature maps is equal to the number of the convolutional layer s filters. 9

14 2.2.2 Pooling layer Pooling layers are used for down-sampling in CNNs. Down-sampling or subsampling is to decrease the size of feature maps as shown below. Of different kinds of pooling methods, average pooling and max pooling are the most commonly used ones. The benefit of pooling layers not only lays in much less dimensions lessening the computation cost but also controls the overfitting. The process of pooling is shown in Figure 2.5. Figure 2.5. Pooling C3D Three-dimension convolutional neural networks are a special kind of convolutional neural networks which perform convolution on three dimensions. These networks extract features not only from spatial dimension/images but also integrate information from temporal dimension/videos as shown in Figure 2.6. In the case of 2D CNN, all the filters are of two dimensions while in C3D, all the filters are 3D filters. C3D has shown good performance(82.3% top-1 accuracy) on UCF101 (a data set of 101 human actions classes from videos in the wild)[24]. 2.3 Deep Neural Network Deep neural networks (DNN) have shown state-of-art accuracy on a lot of challenging database such as ImageNet due to its larger feature space and solution space[9]. However, the training process of DNN is much trickier compared to "shallow" networks since deeper networks are more likely to have vanishing gradient problem and exploding gradient problem[9]. As shown in Equation 2.3 and Equation 2.4, if the network goes deeper, the production of w might become so influential that the value of L/ w i will be extreme 10

15 Figure D Convolutional Neural Network large when absolute value of w is larger than 1(exploding gradient problem) and extreme small when it is smaller than 1(vanishing gradient problem). In this case, the loss of the model will be shaking instead of decreasing. Batch normalization and residual learning are two methods to solve the problems Batch Normalization Batch normalization was first introduced in [12]. Its successful application is Inception Network by Google. As indicated by its name, batch normalization is to normalize data, specifically, to normalize layer inputs. Batch normalization make higher learning rate possible which accelerates the learning process of DNN which prevent vanishing gradient problem and exploding gradient problem[12]. For each mini batch with x 1,x 2,...,x m, first calculate the mean value µ and deviation σ: µ = 1 m σ 2 = 1 m m i=1 m i=1 x i (2.6) (x i µ) 2 (2.7) Then normalize x 1,x 2,...,x m by using a small number ε in case σ = 0: ˆx i = x i µ σ 2 + ε (2.8) Residual Learning Residual learning is a kind of framework of deep neural network firstly introduced in [11]. It enables DNN to be trained with high accuracy and costs less time to converge[11]. The idea of residual learning is quite simple. Instead 11

16 of mapping inputs to outputs with stacked layers, residual network use layers to map fluctuations so gradually map the output. The comparison is shown in Figure 2.7. Figure 2.7. H(x) is any desired mapping, plain net hopes the 2 weight layers fit H(x) while residual net hopes the 2 weight layers fit F(x) and let H(x) = F(x) + x. 2.4 RNN As shown in Figure 2.1, recurrent neural network are networks that have connections between neurons in hidden layers. By doing so, it becomes possible for RNN to handle sequential information where the sequence of input matters and the meaning of data depend on the "context". While CNN share weights by using "filter" in spatial dimension, RNN share weights by using the same function to handle information at different time in time domain. Figure 2.8. A standard RNN neuron. RNN share wights in sequential data. In Figure 2.8, there is a simple example of a neuron in standard RNN. With sequential data x 1,x 2,...,x m, output of the node with input x t is h t : 12 For all x t and h t, the parameters of F θ is the same. h t = F θ (h t 1,x t ) (2.9)

17 2.4.1 LSTM Long Short-Term Memory networks are a special kind of RNN which has a different and more complex structure for neural cells. A LSTM neuron has three gates: an input gate, a forget gate and an output gate. Figure 2.9. A LSTM neuron. Inside a LSTM neuron, three things need to be decided. First, how "clear" the new information should be remembered. Secondly, how much of previous memory should be forgotten. Thirdly, what signal should be passed on to influence other neurons. With sequential data x 1,x 2,...,x m, for x t, the influence of x t on x t+1 is h t. c t is the cell state after processing x t, the process of first step is: i t = σ(w i [x t,h t 1 ] + b i ) (2.10) where σ is a sigmoid function and produce a number between [0 1]. 0 means the information conveyed by x t and h t 1 will not be used at all and 1 means it will be kept entirely. And then preprocess new information: c t = tanh(w c [x t,h t 1 ] + b c ) (2.11) i t c t is the new part need to be add to existing memory. While adding new part is necessary, the old information in the memory will also be affected by new signals. The forget gate works as: f t = σ(w f [x t,h t 1 ] + b f ) (2.12) 13

18 The range of f t is [0 1]. 0 means all the old memory will be removed and 1 means it will be kept entirely. Thus, the memory at sequence t can be inferred: c t = f t c t 1 + i t c t (2.13) Finally, the output o t for next layer of x t and the message h t from x t to x t+1 is calculated: o t = φ(w o [x t,h t 1 ] + b o ) (2.14) h t = o t tanh(c t ) (2.15) 2.5 Transfer Learning Transfer learning is a term used to describe the fact that humans can apply the knowledge they learn in one field to another field to generate better results. In machine learning, predictions on new data are based on statistical model trained with previous collected data. Once problem domains, tasks or data distributions are changed, models need to be retrained from scratch with relative data. In real world application, data collection is extremely resourceconsuming and so is the training process. In order to solve this problem, data scientists started to apply transfer learning observed in humans to machine learning. Compared to traditional learning approach, transfer learning allows knowledge learning from previous tasks being used in target new task[18]. The comparison between learning from scratch and transfer learning are shown in Figure Figure Different learning approach of traditional learning (left) and transfer learning (learning). 14

19 3. Methodology In this section, the specific approach are illustrated. It can be divided into three parts: data collection and pre-processing, feature extraction and classification and evaluation method. This research involves 3 models. The first one is audio-svm model. Audios will be used to extract audio features with OpenSMILE as feature extractor and classified with SVM. The second one is CNN-LSTM model. This involves three steps: train the feature extractor, CNN model(inception-resnetv2 model); use CNN model to extract deep features from face images cropped from video frames; and use LSTM as classifier to integrate the deep features and classify the emotion. The third one is video-c3d model. This model use face frames from videos as input and C3D model as both feature-extractor and classifier. 3.1 Data Three data sets are involved in the training and evaluation process of the model. The first one is AFEW 6.0. The second one is Static Facial Expression Recognition in the Wild (SFEW). The third one is Facial Expression Recognition 2013 data set (FER2013). FER2013 is used to train CNN model(inception-resnet-v2 model) and SFEW is used to evaluate this Inception- ResNet-v2 model. AFEW 6.0 will be used for both training(60%) and evaluating(40%) SVM, LSTM and C3D model. AFEW 6.0 AFEW 6.0 is a data set consisting of video clips collected from movies and reality TV shows. It is the newest version of AFEW data set. Compared to AFEW , reality TV show clips are newly added. There are 1750 video clips in the data set and they are originally divided into 774 training videos, 383 validation videos and 593 test videos. Each of them are labelled with only one emotion of the universally recognized seven emotions, as shown in Table 3.1. Due to the EmotiW contest, the labels of test videos are not available when this project is conducted. All the videos are of 25 fps (25 frames per second) and are of 720*576 resolution. 15

20 Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of videos Table 3.1. Emotion distribution of AFEW 6.0 data set for training and validation SFEW SFEW is a data set consisting of images collected from movie frames with a label from seven emotions. There are 861 labelled images in total in the train set and validate set. The distribution of SFEW data set is shown in Table 3.2. All the images are frames of movies of 720*567 resolution. Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of videos Table 3.2. Emotion distribution of SFEW data set for training and validation FER2013 The FER2013 database is an image data set containing *48 pixel gray-scale facial expression images labelled with the seven universal emotions above. The facial expression images in the FER2013 data set are also gathered from wild environment (movies) and thus the features learned from it can be applied to AFEW 6.0 data set. The distribution of images in different categories is shown in Table 3.3. Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Number of images Table 3.3. Emotion distribution of FER2013 data set 3.2 Pre-processing Videos contain rich information. However, in order to train models more efficiently, only audios and facial crops from video frames are used in the training. AFEW 6.0 For AFEW 6.0 data set, ffmpeg, a cross-platform open source audio-video processing framework is used to extract audio files and video frames. All the video clips are around 2 seconds and have a frame rate of 25 frames per second (25fps). Since not every frame in a video contains at least one human face, in order to get enough cropped faces to analyze temporal information, all frames were extracted from videos and Dlib frontal face detector was used to crop the largest face in a frame. In the end, all the faces are resized to 2 16

21 standard sizes: 48*48 and 299*299 and converted to gray-scale image. Grayscale facial images and audios from videos will be used as input for further feature extraction as shown in Figure 3.1. Figure 3.1. Pre-process of AFEW 6.0 data SFEW For SFEW data set, Dlib frontal face detector is also used to corp the largest face in the frame. And Cropped faces are then resized to 299*299 and converted to gray-scale images for future usage. FER2013 For FER2013 data set, all the faces are resized to 299*299. Normalization Besides, all the faces are linear normalized to decrease the influence of illumination. For a pixel at location (x,y) with intensity value I (x,y), while I max and I min are the largest and smallest intensity value of the original image, the normalized intensity value of the pixel I (x,y) is calculated: I (x,y) = I (x,y) I min I max I min 255 (3.1) 3.3 Feature Extraction Before training classifiers, features must be extracted accordingly. In this research, audio features are extracted using OpenSMILE while facial image features are extracted from a pre-trained and fine-tuned deep CNN model. 17

22 3.3.1 Audio Features OpenSMILE (Speech and Music Interpretation by Large-space Extraction) feature extraction toolkit is used to extract audio features. OpenSMILE can extract audio low-level descriptors such as Mel-frequency cepstral coefficients, loudness, perceptual linear predictive cepstral coefficients, line spectral frequencies and format frequencies. The extracted feature of a 2 second audio is a feature of 1582 dimensions CNN Deep Features Inception ResNet V2 network Inception ResNet V2 network pretrained with ImageNet database is used in deep feature extraction. Inception-ResNet-v2 network is a deep convolutional neural network utilizing residual connection and inception deep convolutional architecture[22], as shown in Figure 3.2 and Figure 3.4. Figure 3.2. The structure of inception ResNet v2 network. Detailed structures of repeating blocks are shown in Figure 3.4 and Figure 3.3. Inception-ResNet-v2 has the highest classification accuracy (top-1 accuracy: 80.4%, top-5 accuracy: 95.3%)[22] on ILSVRC image classification benchmark so far. Consequently, it can be considered of sufficient ability to extract features from images of different content. Thus, the parameter value of pretrained Inception-ResNet-v2 with ImageNet will be used as the initial value of Inception-ResNet-v2 model in this project. However, since ImageNet has 1000 class while the 7 universal emotions will be used as classes in this research, the parameters in fully connected layers 18

23 Figure 3.3. On the left is Stem block in Figure 3.2. On the right there is block A (below) and block B (above) in Figure

24 Figure 3.4. The structure of Inception-ResNet-v2 network blocks. The block in the left is block A shown in Figure 3.2. The block in the middle is block B shown in Figure 3.2. The block in the right is block C shown in Figure 3.2. in ImageNet pretrained model is not suitable in this model. All the parameters in AuxLogits block and Logits block will be generated randomly instead of restored from pretrained model. Fine-tune In order to enhance the ability of Inception-ResNet-v2 model to extract facial expression features, the Facial Expression Recognition 2013 data set (FER2013) is used to fine-tune the Inception-ResNet-v2 model pretrained with ImageNet. All the layers of pretrained Inception-ResNet-v2 model were tuned with the FER2013 database. FER2013 data set is divided into train set of images and validate set of 7188 images. The learning rate is set to 0.01 for step Then with learning rate as , fine-tune the model for 2000 steps. After fine-tuning, this model is able to classify facial emotions from static images. The output layer will produce the classification result while the output of convolution layers will be deep emotional features of facial images. 3.4 Model Training SVM for Audios Input The audio features explained in Section will be used as model input. Model details After extracting 1582 dimensional audio features, a SVM is trained as classifier. Scikit-learn, the open source machine learning library is used to train 20

25 this SVM model. To be specific, Classification SVM type 1 (C-SVM) is used. Training of C-SVM is a process to minimize the error function: 1 2 wt w +C N i=1 ε i (3.2) with constrains: y i (w T Φ(x i ) + b) 1 ε i (3.3) ε i 0,i = 1,...,N (3.4) where y i is class label, x i is input data, w is coefficient vector, b is a bias, ε i is single input parameter, and C is capacity constant. Kernel Φ is used to transform input data into feature space. Training Method To find the optimal parameter set, 10-fold cross validation is used. For linear kernel, the possible value of C is 1, 10 and 100. For radial basis function (rbf) kernel, the possible value of C is among 1, 10, 100 and Besides, for rbf kernel, the possible value of gamma is 0.01, and The training set of AFEW will be used for training. And the validation set of AFEW will be used to evaluate the performance of all the SVM models.the highest-performed combination of parameters will be chosen as the model to test. Output The output of SVM model will be a one-digit number indicating the classification result LSTM Model Deep features of facial images (single frame) is extracted from the output value of last flatten layer, which is the input of dropout layer (the purple layer in Figure 3.2). The deep feature of singe facial image is a 1536 dimension vector. For each video, 16 facial images are used to get a deep feature. For videos in AFEW database, each video contains around 50 frames. However, not every frames contains a face and in some cases, the Dlib frontal face detector is not able to locate faces due to head posture, illuminance and etc. Thus, it is common that some videos can have more than 16 faces while some have less than that. For those videos have x faces and x 16, a random number s between 0 and x 16 will be generated and the deep features from face s to face s + 15 will be used as input of LSTM model. For those videos have x faces and x < 16, padding is needed. A random face selected from existing 21

26 faces will be duplicated and add to face collection until the number of faces reaches 16. Thus, the input of LSTM will be a 16*1536 dimension feature. Model Details One layer LSTM model is used. Training Method The learning rate of LSTM is set to with training iterations. Batch size is 32. Output The output of LSTM model is seven logits indicating the likelihood of each emotion C3D The input of C3D are cropped faces of size 48*48 described in Section 3.2. As explained in Section 3.4.2, the number of faces found in videos varies. Thus, a similar method is used to get the same size input for C3D. For those videos have more than 16 faces, 16 sequential faces will be chosen randomly as input. For those videos have more than 1 face but less than 16 faces, the padding technique in is used. For those videos with no face founded inside, they will be removed from training set. Model Details There are 7 hidden layers in LSTM model as shown in Figure 3.5. The first five layers are convolutional layers to extract video features. Kernels for these layers are all 3*3*3, for every dimension the stride is 1. The last 2 layers are fully connected layers in order to classify. This model structure has been proven efficient in classifying videos in UCF101 database, a database of 101 different kind of human actions (101 classes) and shown an accuracy of 72.6%[25]. Output The output is seven logits indicating the likelihood of each emotion. 22

27 Figure 3.5. The structure of C3D model 23

28 4. Results In this chapter, the performance of each model is evaluated independently. The accuracy of all models are shown while the confusion matrix for Inception- ResNet-v2 model, LSTM model and C3D model are displayed as well. Also, the learning process of Inception-ResNet-v2 model, LSTM model and C3D model is also shown to provide more details. 4.1 On Audios Of 15 SVM models trained, the linear ones have better accuracy as shown in Table 4.1. Besides, the accuracy does not vary much depending on parameter C and gamma. After grid search, linear SVM model with C = 1 is chosen a the classifier due to its accuracy and efficiency. kernel\c linear gamma= rbf gamma= gamma= Table 4.1. The accuracy of SVM models with different parameters With the model trained by AFEW 6.0 training set, AFEW 6.0 validation set is used to evaluate the model as shown in Table 4.2. The accuracy of SVM model on validation set is 25%. This classifier is better at angry videos with a f1-score 0.37, while the performance on disgust and surprise videos is much worse than average. precision recall f1-score support angry disgust fear happy sad surprise neutral total Table 4.2. Accuracy of audio SVM model test on AFEW validation set. 24

29 4.2 Inception ResNet V2 On Static Images The Inception-ResNet-v2 model is first fine-tuned with FER2013. The training process of Inception-ResNet-v2 model is shown in Figure 4.1. Training step is After training steps, the learning rate is adjusted to to fine-tune the model. After steps of training, the loss is tend to be stable and the accuracy remains the same. (a) Training loss (b) Learning rate Figure 4.1. Details of fine-tuning. After fine-tuning, the accuracy of the model on the validation set of FER2013 is shown in Table 4.3. The model has better performance on disgusted, happy and surprise emotions. Among all, the high accuracy of disgusted. Overall, the accuracy is 60% on 5740 images in FER

30 Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Overall Accuracy 54% 80% 48% 78% 49% 67% 48% 60% Table 4.3. Accuracy on FER2013 test set Testing on SFEW The accuracy of Inception-ResNet-v2 model shown in Table 4.3 is based on training data and testing data are both from FER2013 database and thus cropped using same method. However, this model is trained to recognize all the facial images in movies cropped by Dlib frontal face detector. Different cropping methods may result in various facial images and may have an impact on the prediction result. Thus, testing the model using SFEW is necessary. All the pre-processed labelled images from SFEW described in Section 3.2 are used for testing on the fine-tuned model. The confusion matrix of fine-tuned Inception-ResNet-v2 model on SFEW is shown in Table 4.4. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry Disgust Fear Happy Sad Surprise Neutral Table 4.4. Results on static facial expressions There are 861 images being tested. 413 Images are correctly predicted. The overall accuracy on all emotions is 47.97%. As shown in Table 4.4, the Inception-ResNet-v2 model remains satisfying performance on happy images. However, it fails to recognize disgusted images completely with an accuracy of Failed Images In Figure 4.2, some of the failed images from SFEW database are listed. They are chosen from the movie The Hangover. The prediction of model and the ground truth label of each image is listed below. 4.3 On Videos LSTM The validation set of AFEW 6.0 is used to test the accuracy of LSTM. The loss decreases dramatically during the first 5000 training steps as shown in Figure 26

31 (a) (b) (c) (d) (e) (f) Figure 4.2. Failed images from SFEW data set.(a)prediction: Sad. Ground truth: Neutral. (b)prediction: Sad. Ground truth: Angry. (c)prediction: Neutral. Ground truth: Disgust. (d)prediction: Surprise. Ground truth: Fear. (e)prediction: Angry. Ground truth: Surprise. (f)prediction: Happy. Ground truth: Fear. 4.3 and it declines slowly during the training process. In the end, it fluctuates around 1.3. While the loss decreases, the accuracy increases as a result during training as shown in Figure 4.4. The training accuracy and testing accuracy both increase during the first 5000 step of training. However, further training failed to increase the accuracy of testing set. The confusion matrix of LSTM model is shown in Table 4.5. LSTM model is more capable of classify angry and happy emotions while it fails completely at rest of other emotions % videos are classified as neutral emotion indicates that overfitting might exists. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry Disgust Fear Happy Sad Surprise Neutral Table 4.5. The Confusion Matrix of LSTM model 27

32 Figure 4.3. The training accuracy and testing accuracy of LSTM model. Figure 4.4. The training accuracy and testing accuracy of LSTM model. 28

33 4.3.2 C3D Two C3D models are trained at different learning rates. C3D-1 C3D-1 is the first C3D model trained. It is trained with learning rate 0.01 and it decays every 2700 steps with a decay rate 0.1 as shown in Figure 4.5. The loss decreases dramatically during the first 1000 training steps.and it decrease slowly during the following 9000 steps. Finally, the training loss fluctuates slightly around 1.9. The accuracy of the model increases fast during first 3000 training steps. However, the training accuracy is not stable at all. After smoothing the curve, we can see the accuracy stays around 22%. After training, the validation set of AFEW 6.0 is used to evaluate the accuracy of C3D-1. The confusion matrix is shown in Table 4.6. All the videos are labelled as happy shows the model overfits the training data completely. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry Disgust Fear Happy Sad Surprise Neutral Table 4.6. C3D-1 results on AFEW 6.0 validation set The overall accuracy of C3D-1 on AFEW 6.0 validation set is 18.86%. 53 videos are correctly labelled out of 281 videos in total. C3D-2 In order to further decrease the possibility of overfitting, another C3D model is trained. The training process is as shown in Figure 4.6. C3D-2 is training with smaller learning rate and more training steps. Ideally it should be able to avoid overfitting and converges slower. As shown in Figure 4.6, the learning rate of C3D-2 is set to and it decays every 1600 steps with a decay rate of 0.1. The average loss drops fast to around 3.3 during the beginning phrase of training but remains the same in the following training steps. After learning rate drops below , the learning of the model can be considered stopped. Both loss and training accuracy remains very unstable. By the end of training process, training accuracy is around 19%. After training, the validation set of AFEW 6.0 is used to evaluate the accuracy of C3D-2. The confusion matrix is shown in Table 4.7. There is no videos being labelled as disgust and fear. Some of videos are labelled as sur- 29

34 (a) Learning rate (b) Training loss (c) Training Accuracy Figure 4.5. Details of fine-tuning. 30

35 (a) Learning rate (b) Training loss (c) Training Accuracy Figure 4.6. Details of C3D training process. 31

36 prise correctly. Most of videos are classified as happy videos which clearly still indicates overfitting. Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral Angry Disgust Fear Happy Sad Surprise Neutral Table 4.7. C3D-2 result final The overall accuracy of C3D-2 on AFEW 6.0 validation set is 21.7%. 61 videos are correctly labelled out of 281 videos in total. 32

37 5. Discussion In this chapter, the conclusions are draw based on the comparison of EmotiW competitors work and future work are purposed based on conclusions. 5.1 Conclusion Audios The SVM model shows a weak ability to distinguish certain emotions. Compare to an accuracy of 14.3% when randomly label a given video an emotion, the model has a better accuracy of 25%. As shown in 4.2, the SVM model is more capable of distinguish angry emotion than any other emotion with a f1 score of 0.37 much higher than average f1 score It might due to the fact that most angry videos involves someone shouts loudly which makes the classifier easy to find key feature. On the other hand, SVM model is not capable to classify disgust and surprise. Because there is hardly any sounds are particularly surprise even to human ears while there aren t enough audios for the model to find the key feature of disgust. Besides, the audios in a certain category can vary even more than audios across categories. For instance, some surprise audios are quite quiet that they can be considered neutral while some surprise audios are quite noisy that they can be classified as angry or fear. It is difficult to compare this audio-svm model with sate-of-art work since EmotiW competitors did not provide the precise accuracy of their audio model. Overall speaking, the audio model is of limited ability to classify emotions. But with a 1582 dimension feature for a 2-second audio, it is neither efficient nor precise Image model The Inception-ResNet-v2 model shows good ability to classify static facial emotions under challenging situation. On FER2013 database, the accuracy is 60%, 6% lower than state-of-art performance[14]. On SFEW database, the accuracy is 47.97%, increase 9% over baseline on images while the state-of-art result on SFEW from the winner team of EmotiW 2015 is 61.6%[6]. As shown in Table 4.3 and Table 4.4, deeper model didn t improve performance significantly but it has good ability across databases. The reason why 33

38 Inception-ResNet-v2 model could not outperform previous models might be different cropping method, various input size and different training method. The state-of-art result on SFEW uses faces cropped with Viola-Jones face detector while in this research Dlib frontal face detector is used while means faces in different angle might not be detected. The size of FER2013 images are 48*48 gray-scale images. However, the input size of pretrained Inception-ResNet-v2 model is 299*299*3. After resizing all images into 299*299, every pixel of FER2013 image is extended to an area contains at least 6*6 pixels. This makes the first several layers of Inception-ResNet-v2 model not able to extract too much useful information since the stride of those layers are 1 or 2 and the size of filters are 3*3 pixels or 1*1 pixels. Unless the filters move on the broader of two 6*6 pixel areas, the filters can hardly extract any useful information. On the other hand, the cropped faces from SFEW database are of different sizes. A lot of them are large enough and full of details. Not to mention that faces from SFEW are RGB-color images. The Inception-ResNet-v2 model trained with FER2013 is not able to extract those image details. Besides, ImageNet is a database of both colored images and gray-scale images. Without color information in FER2013 database, the full power of Inception-ResNet-v2 model is not utilized and so is the pretrained parameters of the model Video models LSTM The LSTM model shows good ability to integrate the temporal information. As shown in Section 4.3.1, the accuracy of LSTM model on AFEW 6.0 is 41.67%, increases 4% over baseline[5], but still 3.67% lower than state of art[2]. Compared to state-of-art research, this LSTM model uses less training data, different cropping method and different input feature. The state-of-art result using same method has more training data[2]. They used both training set and validation set of AFEW 6.0 database for training and test set of AFEW 6.0 for testing. In this research, only training set are used for training since the label of test set is not available at the moment. Also, the state-of-art research uses the faces provided in AFEW 6.0 which are cropped with Viola-Jones face detector and then they developed a face classifier to remove those non-faces from it. In this research, faces are cropped with Dlib frontal face detector and no face classifier is developed due to limit of time. Besides, the input feature of models are different. The state-of-art model uses deep features extracted by VGG16-FACE, a CNN model pretrained with face database and FER2013. Even though VGG16-FACE has a much higher accuracy(70.74%) on FER2013 database, the outcome of LSTM model is not 34

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Convolutional Neural Network for Facial Expression Recognition

Convolutional Neural Network for Facial Expression Recognition Convolutional Neural Network for Facial Expression Recognition Liyuan Zheng Department of Electrical Engineering University of Washington liyuanz8@uw.edu Shifeng Zhu Department of Electrical Engineering

More information

Convolutional Neural Networks for Facial Expression Recognition

Convolutional Neural Networks for Facial Expression Recognition Convolutional Neural Networks for Facial Expression Recognition Shima Alizadeh Stanford University shima86@stanford.edu Azar Fazel Stanford University azarf@stanford.edu Abstract In this project, we have

More information

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Training Neural Networks II Connelly Barnes Overview Preprocessing Initialization Vanishing/exploding gradients problem Batch normalization Dropout Additional

More information

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( ) Structure: 1. Introduction 2. Problem 3. Neural network approach a. Architecture b. Phases of CNN c. Results 4. HTM approach a. Architecture b. Setup c. Results 5. Conclusion 1.) Introduction Artificial

More information

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane

More information

Inception Network Overview. David White CS793

Inception Network Overview. David White CS793 Inception Network Overview David White CS793 So, Leonardo DiCaprio dreams about dreaming... https://m.media-amazon.com/images/m/mv5bmjaxmzy3njcxnf5bml5banbnxkftztcwnti5otm0mw@@._v1_sy1000_cr0,0,675,1 000_AL_.jpg

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

Recurrent Neural Networks and Transfer Learning for Action Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the

More information

MoonRiver: Deep Neural Network in C++

MoonRiver: Deep Neural Network in C++ MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement

More information

Yelp Restaurant Photo Classification

Yelp Restaurant Photo Classification Yelp Restaurant Photo Classification Rajarshi Roy Stanford University rroy@stanford.edu Abstract The Yelp Restaurant Photo Classification challenge is a Kaggle challenge that focuses on the problem predicting

More information

Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator

Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator Mundher Al-Shabi, Wooi Ping Cheah, Tee Connie Faculty of Information Science and Technology, Multimedia University, Melaka, Malaysia Abstract.

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

INTRODUCTION TO DEEP LEARNING

INTRODUCTION TO DEEP LEARNING INTRODUCTION TO DEEP LEARNING CONTENTS Introduction to deep learning Contents 1. Examples 2. Machine learning 3. Neural networks 4. Deep learning 5. Convolutional neural networks 6. Conclusion 7. Additional

More information

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Classifying Depositional Environments in Satellite Images

Classifying Depositional Environments in Satellite Images Classifying Depositional Environments in Satellite Images Alex Miltenberger and Rayan Kanfar Department of Geophysics School of Earth, Energy, and Environmental Sciences Stanford University 1 Introduction

More information

ECE 5470 Classification, Machine Learning, and Neural Network Review

ECE 5470 Classification, Machine Learning, and Neural Network Review ECE 5470 Classification, Machine Learning, and Neural Network Review Due December 1. Solution set Instructions: These questions are to be answered on this document which should be submitted to blackboard

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

Image Processing Pipeline for Facial Expression Recognition under Variable Lighting

Image Processing Pipeline for Facial Expression Recognition under Variable Lighting Image Processing Pipeline for Facial Expression Recognition under Variable Lighting Ralph Ma, Amr Mohamed ralphma@stanford.edu, amr1@stanford.edu Abstract Much research has been done in the field of automated

More information

Report: Privacy-Preserving Classification on Deep Neural Network

Report: Privacy-Preserving Classification on Deep Neural Network Report: Privacy-Preserving Classification on Deep Neural Network Janno Veeorg Supervised by Helger Lipmaa and Raul Vicente Zafra May 25, 2017 1 Introduction In this report we consider following task: how

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 11 Sinan Kalkan TRAINING A CNN Fig: http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ Feed-forward pass Note that this is written in terms of the

More information

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Outline: Introduction Action classification architectures

More information

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why? Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Based on improved STN-CNN facial expression recognition

Based on improved STN-CNN facial expression recognition Journal of Computing and Electronic Information Management ISSN: 2413-1660 Based on improved STN-CNN facial expression recognition Jianfei Ding Automated institute, Chongqing University of Posts and Telecommunications,

More information

Action Unit Based Facial Expression Recognition Using Deep Learning

Action Unit Based Facial Expression Recognition Using Deep Learning Action Unit Based Facial Expression Recognition Using Deep Learning Salah Al-Darraji 1, Karsten Berns 1, and Aleksandar Rodić 2 1 Robotics Research Lab, Department of Computer Science, University of Kaiserslautern,

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Facial expression recognition is a key element in human communication.

Facial expression recognition is a key element in human communication. Facial Expression Recognition using Artificial Neural Network Rashi Goyal and Tanushri Mittal rashigoyal03@yahoo.in Abstract Facial expression recognition is a key element in human communication. In order

More information

11. Neural Network Regularization

11. Neural Network Regularization 11. Neural Network Regularization CS 519 Deep Learning, Winter 2016 Fuxin Li With materials from Andrej Karpathy, Zsolt Kira Preventing overfitting Approach 1: Get more data! Always best if possible! If

More information

Facial Expression Recognition Using Non-negative Matrix Factorization

Facial Expression Recognition Using Non-negative Matrix Factorization Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,

More information

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University. Visualizing and Understanding Convolutional Networks Christopher Pennsylvania State University February 23, 2015 Some Slide Information taken from Pierre Sermanet (Google) presentation on and Computer

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

House Price Prediction Using LSTM

House Price Prediction Using LSTM House Price Prediction Using LSTM Xiaochen Chen Lai Wei The Hong Kong University of Science and Technology Jiaxin Xu ABSTRACT In this paper, we use the house price data ranging from January 2004 to October

More information

Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman Cemil Zalluhoğlu Introduction Aim Extend deep Convolution Networks to action recognition in video. Motivation

More information

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python. Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC)

More information

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets) Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets) Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez, Senior Researcher, Microsoft Module Outline

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Facial Expression Detection Using Implemented (PCA) Algorithm

Facial Expression Detection Using Implemented (PCA) Algorithm Facial Expression Detection Using Implemented (PCA) Algorithm Dileep Gautam (M.Tech Cse) Iftm University Moradabad Up India Abstract: Facial expression plays very important role in the communication with

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Fuzzy Set Theory in Computer Vision: Example 3

Fuzzy Set Theory in Computer Vision: Example 3 Fuzzy Set Theory in Computer Vision: Example 3 Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Purpose of these slides are to make you aware of a few of the different CNN architectures

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Advanced Video Analysis & Imaging

Advanced Video Analysis & Imaging Advanced Video Analysis & Imaging (5LSH0), Module 09B Machine Learning with Convolutional Neural Networks (CNNs) - Workout Farhad G. Zanjani, Clint Sebastian, Egor Bondarev, Peter H.N. de With ( p.h.n.de.with@tue.nl

More information

Facial expression recognition using shape and texture information

Facial expression recognition using shape and texture information 1 Facial expression recognition using shape and texture information I. Kotsia 1 and I. Pitas 1 Aristotle University of Thessaloniki pitas@aiia.csd.auth.gr Department of Informatics Box 451 54124 Thessaloniki,

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Exploring Bag of Words Architectures in the Facial Expression Domain

Exploring Bag of Words Architectures in the Facial Expression Domain Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu

More information

Convolution Neural Networks for Chinese Handwriting Recognition

Convolution Neural Networks for Chinese Handwriting Recognition Convolution Neural Networks for Chinese Handwriting Recognition Xu Chen Stanford University 450 Serra Mall, Stanford, CA 94305 xchen91@stanford.edu Abstract Convolutional neural networks have been proven

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 7: Universal Approximation Theorem, More Hidden Units, Multi-Class Classifiers, Softmax, and Regularization Peter Belhumeur Computer Science Columbia University

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Facial Emotion Recognition using Eye

Facial Emotion Recognition using Eye Facial Emotion Recognition using Eye Vishnu Priya R 1 and Muralidhar A 2 1 School of Computing Science and Engineering, VIT Chennai Campus, Tamil Nadu, India. Orcid: 0000-0002-2016-0066 2 School of Computing

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Supplementary Material for: Video Prediction with Appearance and Motion Conditions

Supplementary Material for: Video Prediction with Appearance and Motion Conditions Supplementary Material for Video Prediction with Appearance and Motion Conditions Yunseok Jang 1 2 Gunhee Kim 2 Yale Song 3 A. Architecture Details (Section 3.2) We provide architecture details of our

More information

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature 0/19.. Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature Usman Tariq, Jianchao Yang, Thomas S. Huang Department of Electrical and Computer Engineering Beckman Institute

More information

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Character Recognition Using Convolutional Neural Networks

Character Recognition Using Convolutional Neural Networks Character Recognition Using Convolutional Neural Networks David Bouchain Seminar Statistical Learning Theory University of Ulm, Germany Institute for Neural Information Processing Winter 2006/2007 Abstract

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

A Real Time Facial Expression Classification System Using Local Binary Patterns

A Real Time Facial Expression Classification System Using Local Binary Patterns A Real Time Facial Expression Classification System Using Local Binary Patterns S L Happy, Anjith George, and Aurobinda Routray Department of Electrical Engineering, IIT Kharagpur, India Abstract Facial

More information

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong Using Capsule Networks for Image and Speech Recognition Problems by Yan Xiong A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved November 2018 by the

More information

LBP Based Facial Expression Recognition Using k-nn Classifier

LBP Based Facial Expression Recognition Using k-nn Classifier ISSN 2395-1621 LBP Based Facial Expression Recognition Using k-nn Classifier #1 Chethan Singh. A, #2 Gowtham. N, #3 John Freddy. M, #4 Kashinath. N, #5 Mrs. Vijayalakshmi. G.V 1 chethan.singh1994@gmail.com

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

A Deep Learning Approach to Vehicle Speed Estimation

A Deep Learning Approach to Vehicle Speed Estimation A Deep Learning Approach to Vehicle Speed Estimation Benjamin Penchas bpenchas@stanford.edu Tobin Bell tbell@stanford.edu Marco Monteiro marcorm@stanford.edu ABSTRACT Given car dashboard video footage,

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Advanced Introduction to Machine Learning, CMU-10715

Advanced Introduction to Machine Learning, CMU-10715 Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio

More information

Human Face Classification using Genetic Algorithm

Human Face Classification using Genetic Algorithm Human Face Classification using Genetic Algorithm Tania Akter Setu Dept. of Computer Science and Engineering Jatiya Kabi Kazi Nazrul Islam University Trishal, Mymenshing, Bangladesh Dr. Md. Mijanur Rahman

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Large-scale Video Classification with Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY Tutorial on Keras CAP 6412 - ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY Deep learning packages TensorFlow Google PyTorch Facebook AI research Keras Francois Chollet (now at Google) Chainer Company

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information