Emotion Classification Shai Savir 038052395 Gil Sadeh 026511469 1. Abstract Automated facial expression recognition has received increased attention over the past two decades. Facial expressions convey non-verbal cues, which play an important role in interpersonal relations. The objective of this project is to demonstrate the feasibility of a camera-based application to recognize, in real-time, a face in an image and to analyze the facial emotional expression. Potential outcomes could be a tool for blind and autistic people who are not able, or have difficulties in recognizing emotions based on facial expressions, it will give them feedback on what the person in front of them is feeling in real time. It could also be used as an instructional tool which can train people with emotional recognition disorders and improve their ability to recognize and express emotional expressions. 2. Introduction 2.1. Motivation Face recognition is a task that humans perform routinely and effortlessly in their daily lives. Face recognition is a visual pattern recognition problem where a three-dimensional object is to be identified based on its two-dimensional image. In recent years, significant progress has been made in this area. Thus the research is now focusing more on facial expression recognition. The motivation to focus on facial expressions, come from many aspects of our day to day lives: Nonverbal information prevails over words themselves in human communication. Ubiquitous and universal use of computational systems, requires improved humancomputer interaction. Humanize computers. More human-like human-computer and human robot interaction. Treatment for people with psycho-affective illnesses (autism). Helping the blind experience what people with eyesight take for granted. 2.2. Project Purpose & Possible Applications Our project s goal is to demonstrate the feasibility of a camera-based application to recognize, in real-time, a face in an image and to analyze the facial emotional expression. We tried achieving this goal by using new techniques discussed at [1] and attempting to improve them. Possible Applications could be a tool for blind and autistic people who are not able, or have difficulties in recognizing emotions based on facial expressions, it will give them feedback on
what the person in front of them is feeling in real time. It could also be used as an instructional tool which can train people with emotional recognition disorders and improve their ability to recognize and express emotional expressions. 2.3. Locality preserving projection (LPP) The LPP method, as described in [3], is a way to embed image sequences of facial expression from the high dimensional appearance feature space to a low dimensional manifold. The goal of creating such a manifold is to discover a latent space in which topology of the input features x, sometimes also informed by labels of x, is preserved. Such data representation may be more discriminative and better suited for modeling of dynamic ordinal regression. A face image with N pixels can be considered as a point in the N-dimensional image space, and the variations of face images can be represented as low dimensional manifolds. It would be desired to analyze facial expressions in the low dimensional subspace rather than the ambient space. Given a set of N pixel images, we will find a transformation matrix that will map these points to ( ), such that represents, where. LPP is a linear approximation of the nonlinear Laplacian Eigenmap. It is achieved by the following algorithm: 1. Constructing an adjacency graph we will construct a node graph, with edges between nodes and if is close to. Closeness can be determined by Euclidian Norm, w.r.t parameter, or by considering the nearest neighbors, w.r.t parameter. 2. Weighting the edges is the weight matrix, where having the weight of the edge joining nodes and and if there is no edge joining them. There are two variations in choosing the weights: a., if nodes and are connected. b., if nodes and are connected (parameter ). 3. Eigenmaps we compute the eigenvectors and eigenvalues of. Where D is a diagonal matrix whose entries are column sums of W. is the Laplacian matrix. X is a matrix with columns. We then take eigenvectors and compute, where ( ) is the projection matrix. The basic idea behind this algorithm is trying to minimize the objective function ( ) which is the criterion we will use for choosing a good map. Further details can be found in [8]. The advantages of this method is that the mapping preserves location w.r.t the objective function, it is much easier to analyze facial expressions in the low dimensional subspace, and this method is linear and therefore fast and suitable for practical applications.
2.4. Supervised locality preserving projection (S-LPP) For a data set containing images of typical expressions from different subjects, as appearance varies a lot cross different subjects, there is significant overlapping among different expression classes. Therefore, the original LPP, which performs in an unsupervised manner, fails to embed the data set in low dimensional space in which different expression classes are well clustered. The Supervised Locality Preserving Projections (S-LPP) method, described in [4], solves this problem by not only preserving local structure, but also encoding class information in the process. The local neighborhood of a sample from class should be composed of samples belonging to class only. This can be achieved by increasing the distances between samples belonging to different classes by the following: ( ) ( ) ( ) Where ( ) is the distance between and, ( ), and ( ) if and belong to different classes, and 0 otherwise. is a parameter that determines the supervising extent. When we obtain unsupervised LPP. When we obtain fully supervised LPP where distances between samples from different classes will be larger than the maximum distance in the entire data set. Varying between 0 and 1 gives us a partially supervised LPP, this creates some separation between classes. By applying S-LPP to the data set of images of typical expressions, a subspace is derived, in which different expression classes are well clustered and separated. The subspace provides global coordinates for the manifolds of different subjects, which are aligned on one generalized manifold. Image sequences representing facial expressions from beginning to apex are mapped on the generalized manifold as the curves from the neutral faces to the cluster of the typical expressions. 3. Our Solution 3.1. Main idea The main idea of our solution was to form a large image database of several subjects displaying different emotions, focusing on three basic emotions: Happiness, Anger and Surprise. From each image we extract important features which contain information about the displayed emotion (we will elaborate on the features we decided to use later on) and form a feature vector for each image. After collecting a large database of feature vectors we will use a variation of the S-LPP method to obtain a projection that will both project our feature vectors into a lower dimension (important for real-time performance) and obtain the optimal separation between the vector features of the different classes (emotions).
After finding this optimal projection, we will calculate the mean and covariance matrix of each class, and under the assumption of Gaussian distribution we will achieve the normal probability density function (PDF) which is an approximation of the actual PDF of each class. After doing all the described above offline we can perform real-time face tracking and extract the features (the same way we did to collect our database) and project the feature vector using the optimal projection that we found. And since we managed to achieve a good separation between classes we will now calculate the PDF of each class in the location of the projected feature vector. After calculating the three PDF s we will normalize the values such that they will sum up to one to achieve the probability that the current frame is expressing each emotion. 3.2. Detailed description In our project we decided to use the Kinect camera because we believe that more accurate face tracking and facial feature extraction can be made with such cameras, thanks to the additional depth information they supply. We were also assisted by Microsoft Kinect Developer Toolkit which supplied us with a C++ algorithm for face tracking [2]. In our project we managed to sync between the C++ algorithm and matlab command window, which means we use the face tracking algorithm for real-time face tracking and then convert the information we need (such as x-y locations of important facial points) to matlab mex variables and send them to the matlab workspace. The rest of our project is implemented in matlab. The face tracking algorithm also finds 98 facial points (as shown in figure[1]), and returns their 2D location (x & y coordinates). We took these points to create a large feature vector (feature vector size 4950) of distances between every two points, thinking that the distances between points contain information about the emotion displayed. Figure 1: Tracked Points We built a database containing 4 subjects. Each subject was filmed for a short 10 seconds film for each emotion, and every film we sampled to achieve many images. For each image we extracted the 98 facial points locations and constructed the distances feature vector. After constructing the feature vectors of the entire database we needed to find the optimal projection that will best separate our 3 classes. We did so using a variation of the S-LPP method. We
divided the feature vectors into segments of size 100 (of the 4950 feature vectors) and performed the S-LPP method on them separately, and from each segment we took the two dimensions who gave the best separation. Doing so we got 49 projection matrices of size -. Now in order to project a feature vector we need to take each segment and multiply it by the corresponding projection matrix, and then by concatenating the projected products we receive a 98 sized projected feature vector. After building the projected database we calculated the mean and covariance matrix of each class, and this data in addition to the projection matrices is saved and loaded by the real-time algorithm. This algorithm takes the images and facial points from the FaceTrackingVisualization algorithm and creates the feature vector and projected feature vector of each image. Then we calculate the approximated PDF function at the projected feature point for every class, and after normalizing these three results we achieve the probability that the image belongs to each class. These results are normalized such that they will sum up to one. In order to smooth our results we also calculated distance from mean of each class, normalized the distances, and concluded the probability calculation as following: ( ) ( ) ( ) ( ( ) Where ( ) stands for the guassian probability density distribution of class i, with mean and covariance matrix that we extract from the projected database. - the projected feature vector Probability normalization factor. ( ) Distance normalization factor. - Steepness parameter ( ). 4. Experiments and results To test our results we ran the database on our algorithm and checked what our success probability is. The results of the hit miss percentage is summarize in the following table: Happy Angry Surprise Hit Percentage 98.9744 98.1061 95.4670 Miss Percentage 1.0256 1.8939 4.5330 We also viewed the separation of the different classes on different dimensions. The graphs below show the separation in 4 different dimensions:
0.2 0.08 0.15 0.06 0.1 0.04 0.05 0.02 0 0-0.05-0.02-0.1-0.04-0.15-0.06-0.2-0.1-0.08-0.06-0.04-0.02 0 0.02 0.04 0.06 0.08-0.08-0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 The pictures below give an idea on how the system works in real time. The results obtained for the three emotions give a good separation and a good determination of the emotion. However we do have some points we still need to work on: Add more emotions. Dealing with three classes is significantly easier than dealing with a wider range of classes. We need to implement a function that will rotate the face in such a way that even if the face has some 3D shift we will be able to straighten it and this way our algorithm will be invariant to rotation of the face in a wider range. 5. Conclusions and Future Ideas We managed to achieve satisfactory results considering that our database is based on only 4 different people. In order to achieve better results we will need to make our algorithm invariant to facial rotations, to enlarge our database (significantly) and possibly improve our variation to the S-LPP method in order to achieve separation on larger number of classes. 6. References [1] C. Shan, S. Gong, and P.W. Mcowan. Appearance manifold of facial expression. Lecture Notes in Comp. Science, 3766:221 230, 2005. [2] Microsoft Developer Toolkit Browser 1.6 FaceTrackingVisualization.