Manifold Learning for Video-to-Video Face Recognition Abstract. We look in this work at the problem of video-based face recognition in which both training and test sets are video sequences, and propose a novel approach based on manifold learning. The idea consists of first learning the intrinsic personal characteristics of each subject from the training video sequences by discovering the hidden low-dimensional nonlinear manifold of each individual. Then, a target face video sequence is projected and compared to the manifold of each subject. The closest manifold, in terms of a recently introduced manifold distance measure, determines the identity of the person in the sequence. Experiments on a large set of talking faces under different image resolutions show very promising results (recognition rate of 99.8%), outperforming many traditional approaches. 1 Introduction Recently, there has been an increasing interest on video-based face recognition (e.g. [1 3]). This is partially due to the limitations of still image-based methods in handling illumination changes, pose variations and other factors. The most studied scenario in video-based face recognition is having a set of still images as the gallery (enrollment) and video sequences as the probe (test set). However, in some real-world applications such as in human-computer interaction and content based video retrieval, both training and test sets can be video sequences. In such settings, performing video-to-video matching may be crucial for robust face recognition but this task is far from being trivial. There are several ways of approaching the problem of face recognition in which both training and test sets are video sequences. Basically, one could build an appearance-based system by selecting few exemplars from the training sequences as gallery models and then performing still image-based recognition and fusing the results over the target video sequence [4]. Obviously, such an approach is not optimal as some important information in the video sequences may be left out. Another direction consists of using spatiotemporal representations for encoding the information both in the training and test video sequences [1 3]. Perhaps, the most popular approach in this category is based on the hidden Markov models (HMMs) which have been successfully applied to face recognition from videos [2]. The idea is quite simple: in the training phase, an HMM is created to learn both the statistics and temporal dynamics of each individual. During the recognition process, the temporal characteristics of the face sequence are analyzed over time by the HMM corresponding to each subject. The likelihood scores provided by the HMMs are compared. The highest score provides
2 the identity of a face in the video sequence. Unfortunately, most methods which use spatiotemporal representations for face recognition have not yet shown their full potential as they suffer from different drawbacks such as the use of only global features while local information is shown to also be important to facial image analysis [5] and the lack of discriminating between the facial dynamics which are useful for recognition from those which can hinder the recognition process [6]. Very recently, inspired by studies in neuroscience emphasizing manifold ways of visual perception, we introduced in [7] a novel method for gender classification from videos using manifold learning. The idea consists of clustering the face sequences in the low-dimensional space based on the intrinsic characteristic of men and women. Then, a target face sequence is projected into both men and women manifolds for classification. The proposed approach reached excellent results not only in gender recognition problem but also in age and ethnic classification from face video sequences. In this work, we extend the approach proposed in [7] to the problem of video-to-video face recognition. Thus, we propose to first learn and discover the hidden low-dimensional nonlinear manifold of each individual. Then, a target face sequence can be projected into each manifold for classification. The closest manifold will then determine the identity of the person in the target face video sequence. The experiments which are presented in Section 4 show that such manifold-based approach yields in excellent results outperforming many traditional methods for video-based face recognition. The rest of this paper is organized as follows. Section 2 explains the notion of face manifold and discusses some learning methods. Then, we describe our proposed approach to the problem of video-to-video face recognition and the experimental analysis in sections 3 and 4, respectively. Finally, we draw a conclusion in Section 5. 2 Face Manifold Let I(P,s) denote a face image of a person P at configuration s. The variable s describes a combination of factors such as facial expression, pose, illuminations etc. Let ξ p, ξ p = {I(P,s) s S} (1) be the collection of face images of the person P under all possible configurations S. The ξ p thus defined is called the face manifold of person P. Additionally, if we consider all the face images of different individuals, then we obtain the face manifold ξ: ξ = p ξ p (2) Such a manifold ξ resides only in a small subspace of the high-dimensional image space. Consider the example of Fig. 1 showing face images of a person when moving his face from left to right. The only obvious degree of freedom in this case is the rotation angle of the face. Therefore, the intrinsic dimensionality of
Manifold Learning for Video-to-Video Face Recognition 3 the faces is very small (close to 1). However, these faces are embedded in a 1600- dimensional image space (since the face images have 40 40 = 1600 pixels) which is highly redundant. If one could discover the hidden low-dimensional structure of these faces (the rotation angle of the face) from the input observations, this would greatly facilitate the further analysis of the face images such as visualization, classification, retrieval etc. Our proposed approach to the problem of video-tovideo face recognition, which is described in Section 3, exploits the properties of face manifolds. Fig. 1. An example showing a face manifold of a given subject embedded in the high dimensional image space Neuroscience studies also pointed out the manifold ways of visual perception [8]. Indeed, facial images are not isolated patterns in the image space but lie on a nonlinear low-dimensional manifold. The key issue in manifold learning is to discover the low-dimensional manifold embedded in the high dimensional space. This can be done by projecting the face images into low-dimensional coordinates. For that purpose, there exist several methods. The traditional ones are Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). These methods are simple to implement and efficient in discovering the structure of data lying on or near linear subspaces of the high-dimensional input space. However, face images do not satisfy this constraint as they lie on a complex nonlinear and nonconvex manifold in the high-dimensional space. Therefore, such linear methods generally fail to discover the real structure of the face images in the low-dimensional space. As an alternative to PCA and MDS, one can consider some nonlinear dimensionality reduction methods such as Self-Organizing Maps (SOM) [9], Generative Topographic Mapping (GTM) [10], Sammon s Mappings (SM) [11] etc. Though these methods can also handle nonlinear manifolds, most of them tend to involve several free parameters such as learning rates and convergence criteria. In addition, most of these methods do not have an obvious guarantee of convergence to the global optimum. Fortunately, in the recent
4 years, a set of new manifold learning algorithms have emerged. These methods are based on an Eigen decomposition and combine the major algorithmic features of PCA and MDS (computational efficiency, global optimality, and flexible asymptotic convergence guarantees) with flexibility to learn a broad class of nonlinear manifolds. Among these algorithms are Locally Linear Embedding (LLE) [12], ISOmetric feature MAPping (ISOMAP) [13] and Laplacian Eigenmaps [14]. 3 Proposed Approach to Video-Video Face Recognition We approach the problem of video-to-video face recognition from manifold learning perspective. We adopt the LLE algorithm for manifold learning due to its demonstrated simplicity and efficiency to recover meaningful low-dimensional structures hidden in complex and high-dimensional data such as face images. LLE is an unsupervised learning algorithm which maps high-dimensional data onto a low-dimensional, neighbor-preserving embedding space. In brief, considering a set of N face images and organizing them into a matrix X (where each column vector represents a face), the LLE algorithm involves then the following three steps: 1. Find the k nearest neighbors of each point X i. 2. Compute the weights W ij that best reconstruct each data point from its neighbors, minimizing the cost in Equation (3): 2 N ǫ(w) = X i W ij X j (3) i=1 j neighbors(i) while enforcing the constraints W ij = 0 if X j is not a neighbor of X i, and N j=1 W ij = 1 for every i (to ensure that W is translation-invariant). 3. Compute the embedding Y (of lower dimensionality d << D, where D is the dimension of the input data) best reconstructed by the weights W ij minimizing the quadratic form in Equation (4): 2 N Φ(Y ) = Y i W ij Y j (4) i=1 j neighbors(i) under constraints N i=1 Y i = 0 (to ensure a translation-invariant embedding) and 1 N N i=1 Y iyi T = 0 (normalized unit covariance). The aim of the first two steps of the algorithm is to preserve the local geometry of the data in the low-dimensional space, while the last step discovers the global structure by integrating information from overlapping local neighborhoods. LLE is an efficient approach to compute the low-dimensional embeddings of high-dimensional data assumed to lie on a non-linear manifold. Its ability to deal with large sizes of high-dimensional data and its non-iterative way to find the embeddings make it attractive.
Manifold Learning for Video-to-Video Face Recognition 5 Given a set of training face video sequences with one or more sequences per person. For each person, we first apply the LLE algorithm on all his/her face images in the training set. We obtain then coordinates in the low-dimensional space, thus defining a face manifold of the person. Let us denote then the obtained embedding for a given person P as ξ P. Note that the calculation of ξ P involves only two free parameters which are the number of neighbors (k) and the dimension of the embedding space (d). A discussion on the values of these two parameters can be found in [7]. To determine the identity of an unknown person in a given face sequence {Face frame(1),face frame(2),...,face frame(l) } we first project every face instance Face frame(i) into the face manifold of each subject in the low-dimensional space. The closest manifold will then determine the identity of the person in the sequence. Fig. 2 shows an example of embedding results of three video sequences of the subjects shown in Fig. 3. The projection of the target face sequence into the manifold of person P is done using the following steps: a. Let now X i be the column vector representing the face image (Face frame(i) ) from the new sequence. b. Find the k nearest neighbors of each point X i among the training face samples of person P. c. Compute the weights W ij that best reconstruct each data point X i from its neighbors using Equation (3). d. Use the obtained weights W ij to compute the embedding Yi P of each point X i (i.e. Face frame(i) ) as: Y P i = j neighbors(x i) W ij ξ P j (5) where ξj P refers to the embedding point of the j th neighbor of the point X i in the face manifold of person P. As a result, we obtain the embedding Y P of the new face sequence in every face manifold ξ P. Then, we compute how close is the embedding Y P to the face manifold ξ P using: D P = 1 L L Y P i=1 i ξ P(i) j (6) where L is the length of the target face sequence, Yi P is the embedding of the point X i in the low-dimensional space and ξ P(i) j is the closest point (in term of Euclidean distance) from the manifold ξ P to Yi P. Finally, the identity of the L person in the target face sequence is given by: argmin i=1 Yi P ξ P(i) j. p 4 Experimental Analysis For experimental analysis, we considered the VidTIMIT [15] face video database containing 43 talking subjects (19 female and 24 male), reciting ten short sen-
6 Fig. 2. Examples of embedding results of 3 sequences of the subjects shown in Fig. 3 tences in three sessions with an average delay of a week between sessions, allowing for appearance and mood changes. In total, there are ten face sequences per persons. From each sequence, we automatically detected the eye positions from the first frame. The determined eye positions are then used to crop the facial area in the whole sequence, yielding in not well aligned face images. Finally, we scaled the resulted images into four different resolutions: 20 20, 40 40, 60 60 and 80 80 pixels. Examples of face images from some sequences are shown in Fig. 3. For evaluation, we randomly selected one face sequence per person for training while the rest was used for testing. In all our experiments, we considered the average recognition rates of 100 random permutations. For comparative study, we also implemented some state-of-the-art methods including three still image-based methods (PCA, LDA and LBP [16]) and two spatiotemporal-based approaches (HMM [2] and ARMA [1]). For still imagebased analysis, we adopted a scheme proposed in [4] to perform appearance-based face recognition from videos. The approach consists of performing unsupervised learning to extract a set of K most representative samples (or exemplars) from the raw gallery videos (K = 3 in our experiments). Once these exemplars are extracted, we build a view-based system and use a probabilistic voting strategy to recognize the individuals in the probe video sequences. The performance of our proposed approach and also those of the considered methods under four different resolutions are plotted in Fig. 4. From the results, we can notice that all the methods perform quite well but the proposed manifold-based approach significantly outperforms all other methods in all im-
Manifold Learning for Video-to-Video Face Recognition 7 Fig. 3. Examples of facial images extracted from videos of three different subjects age resolution configurations. For instance, at image resolution of 60 60, our approach yielded in recognition rate of 99.8% while PCA, LDA, LBP, HMM, and ARMA yielded in recognition rates of 94.2%, 94.0%, 97.6%, 92.9% and 95.8%, respectively. It is worth noting that, in addition to its efficiency, our approach involves only two free parameters which are quite easy to determine [7]. From the results, we can also notice that the spatiotemporal-based methods (HMM and ARMA) do not always perform better than PCA, LDA, and LBP based methods. This supports the conclusions of other researchers indicating that using spatiotemporal representations does not systematically enhance the recognition performance. Our results also show that low-image resolutions affect all methods and the best results using the proposed manifold-based approach are obtained using 60 60 pixels as image resolution. Table 1. The performance of different methods using image resolution of 60x60 pixels Method Recognition rate PCA 94.2 % LDA 94.0 % LBP [16] 97.6 % HMM [2] 92.9 % ARMA [1] 95.8 % Manifold Learning 99.8 %
8 Fig. 4. Performance of the considered methods under four different resolutions 5 Conclusion To overcome the limitations of traditional video-based face recognition methods, we introduced a novel video-to-video matching approach based on manifold learning. Our approach consisted of first learning the hidden low-dimensional manifold of each individual. Then, a target face sequence is projected into each manifold for classification. The closest manifold determined the identity of the person in the target face video sequence. Experiments on a large set of talking faces under different resolutions showed excellent results outperforming stateof-the-art approaches. Our future work consists of extending our approach to multi-view face recognition from videos and experimenting with much larger databases. References 1. Aggarwal, G., Chowdhury, A.R., Chellappa, R.: A system identification approach for video-based face recognition. In: 17th ICPR. Volume 4. (2004) 175 178 2. Liu, X., Chen, T.: Video-based face recognition using adaptive hidden markov models. In: IEEE Int. Conf. on CVPR. (2003) 340 345 3. Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. In: IEEE Int. Conf. on CVPR. (2003) 313 320 4. Hadid, A., Pietikäinen, M.: Selecting models from videos for appearance-based face recognition. In: 17th ICPR. (2004) 304 308 5. Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: Component based versus global approaches. CVIU 91(1-2) (2003) 6 21 6. Hadid, A., Pietikäinen, M.: An experimental investigation about the integration of facial dynamics in video-based face recognition. ELCVIA 5(1) (2005) 1 13 7. Anonymous
Manifold Learning for Video-to-Video Face Recognition 9 8. Seung, H.: The manifold ways of perception. Science 290(12) (2000) 2268 2269 9. Kohonen, T., ed.: Self-Organizing Maps. Springer-Verlag, Berlin (1997) 10. Bishop, C.M., Svensen, M., Williams, C.K.I.: GTM: The generative topographic mapping. Neural Computation 10(1) (1998) 215 234 11. Sammon, J.: A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 18(5) (1969) 401 409 12. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500) (2000) 2323 2326 13. Tenenbaum, J.B., DeSilva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500) (2000) 2319 2323 14. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in NIPS 14. (2002) 585 591 15. Sanderson, C., ed.: Biometric Person Recognition: Face, Speech and Fusion. VDM- Verlag (2008) 16. Ahonen, T., Hadid, A., Pietikäinen, M.: Face description with local binary patterns: Application to face recognition. IEEE TPAMI 28(12) (2006) 2037 2041