Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model

Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav Aggarwal and Patrick J. Flynn, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame {sbiswas, gaggarwa, flynn}@nd.edu Abstract Low-resolution surveillance videos with uncontrolled pose and illumination present a significant challenge to both face tracking and recognition algorithms. Considerable appearance difference between the probe videos and high-resolution controlled images in the gallery acquired during enrollment makes the problem even harder. In this paper, we extend the simultaneous tracking and recognition framework [22] to address the problem of matching highresolution gallery images with surveillance quality probe videos. We propose using a learning-based likelihood measurement model to handle the large appearance and resolution difference between the gallery images and probe videos. The measurement model consists of a mapping which transforms the gallery and probe features to a space in which their inter-euclidean distances approximate the distances that would have been obtained had all the descriptors been computed from good quality frontal images. Experimental results on real surveillance quality videos and comparisons with related approaches show the effectiveness of the proposed framework. 1. Introduction The wide range of applications in law-enforcement and security has made face recognition (FR) a very important area of research in the field of computer vision and pattern recognition. The ubiquitous use of surveillance cameras for improved security has shifted the focus of face recognition This research was funded by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory (ARL). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of IARPA, the ODNI, the Army Research Laboratory, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. from controlled scenarios to the uncontrolled environment typical in surveillance setting [17]. Typically, the images or videos captured from the surveillance systems have nonfrontal pose and uncontrolled illumination in addition to low-resolution due to the distance of the subjects from the cameras. On the other hand, good high-resolution images of the subjects may be present in the gallery during enrollment. This presents the challenge of matching gallery and probe images or videos which differ significantly in resolution, pose and illumination. In this paper, we consider the scenario in which the gallery consists of one or more highresolution frontal images, while the probe consists of lowresolution videos with uncontrolled pose and illumination as is typically obtained in surveillance systems. Most of the research in video-based face recognition has focused on dealing with one or more challenges like uncontrolled pose, illumination, etc. [23], but there are very few approaches which simultaneously deal with all the challenges. Some of the recent approaches which handle the resolution difference between the gallery and probe either are restricted to frontal images [6] or require videos for enrollment [2]. For video based FR, a tracking-thenrecognition paradigm is typically followed, in which the faces are first tracked and then used for recognition. But both tracking and recognition are very challenging for lowquality videos with low-resolution and significant variations in pose and illumination. In this paper, we extend the simultaneous tracking and recognition framework [22] which performs the two tasks of tracking and recognition in a single unified framework to address these challenges. We propose using distance learning based techniques for better modeling the appearance changes between the frames of the low-resolution probe videos and the high-resolution gallery images for better recognition and tracking accuracy. Multidimensional Scaling [4] is used to learn a mapping from training images which transforms the gallery and probe features to a space in which their inter-euclidean distances approximate the distances that would have been obtained had all the descrip- 1 978-1-4577-1359-0/11/$26.00 2011 IEEE

tors been computed from high-resolution frontal images. We evaluate the effectiveness of the proposed approach on surveillance quality videos from the MBGC data [16]. We observe that the proposed approach performs significantly better in terms of both tracking and recognition accuracy as compared to standard appearance modeling approaches. The rest of the paper is organized as follows. An overview of the related approaches is discussed in Section 2. The details of the proposed approach are provided in Section 3. The results of experimental evaluation are presented in Section 4. The paper concludes with a brief summary and discussion. 2. Previous Work In this section, we discuss the related work in the literature. For brevity, we will refer to high-resolution as HR and low-resolution as LR. There has been a considerable amount of work in general video-based FR addressing two kind of scenarios: (1) both the gallery and probe are video sequences [11] [13] [10] [18] and (2) the probe videos are compared with one or multiple still images in the gallery [22]. For tracking and recognizing faces in realworld, noisy videos, Kim et al. [10] propose a tracker that adaptively builds a target model reflecting changes in appearance typical of a video setting. In the subsequent recognition phase, the identity of the tracked subject is established by fusing pose-discriminant and person-discriminant features over the duration of a video sequence. Stallkamp et al. [18] classify faces using a local appearance-based FR algorithm for real-time video-based face identification. The obtained confidence scores from each classification are progressively combined to provide an identity estimate for the entire sequence. Many researchers have also addressed the problem of video-based FR by treating the videos as image sets [20]. Most of the current approaches which address the problem of LR still face recognition follow a super-resolution approach. Given an LR face image, Jia and Gong [8] propose directly computing a maximum likelihood identity parameter vector in the HR tensor space which can be used for recognition and reconstruction of HR face images. Liu et al. [12] propose a two-step statistical modeling approach for hallucinating a HR face image from a LR input. The relationship between the HR images and their corresponding LR images is learned using a global linear model and the residual high-frequency content is modeled by a patch-based non-parametric Markov network. Several other super-resolution techniques have also been proposed [5] [9]. The main aim of these techniques is to produce a high resolution image from the low-resolution input using assumptions about the image content, and they are usually not designed from a matching perspective. A Multidimensional Scaling (MDS)-based approach has been recently proposed to improve the performance of still LR images, but it does not deal with matching a HR gallery image with a LR probe video [3]. Recently, Hennings-Yeomans et al. [6] proposed an approach to perform super-resolution and recognition simultaneously. Using features from the face and super-resolution priors, they extract an HR template that simultaneously fits the super-resolution as well as the face-feature constraints. The formulation was extended to use multiple frames, and the authors showed that it can also be generalized to use multiple image formation processes, modeling different cameras [7]. But this approach assumes that the probe and gallery images are in the same pose making them not directly applicable for more general scenarios. Arandjelovic and Cipolla [2] propose a generative model for separating the illumination and down-sampling effects for the problem of matching a face in a LR query video sequence against a set of HR gallery sequences. It is an extension of the Generic Shape-Illumination Manifold framework [1] which was used to describe the appearance variations due to the combined effects of facial shape and illumination. As noted in [7], a limitation of this approach is that it requires a video sequence at enrollment. 3. Proposed Approach For matching LR probe videos with significant pose and illumination variations with HR frontal gallery images, we propose to use a learning based appearance modeling in a simultaneous tracking and recognition framework. 3.1. Simultaneous Tracking and Recognition First, we briefly describe the tracking and recognition framework [22] which uses a modified version of the CON- DENSATION algorithm for tracking the facial features across the frames in the poor quality probe video and for recognition. The filtering framework consists of a motion model which characterizes the motion of the subject in the video. The overall state vector of this unified tracking and recognition framework consists of an identity variable in addition to the usual motion parameters. The observation model determines the measurement likelihood i.e. the likelihood of observing the particular measurement given the current state consisting of the motion and identity variable. Motion Model: The motion model is given by the firstorder Markov chain θ t = θ t 1 + u t ; t 1 (1) Here affine motion parameters are used and so θ = (a 1, a 2, a 3, a 4, t x, t y ) where {a 1, a 2, a 3, a 4 } are deformation parameters and {t x, t y } are 2D translation parameters. u t is noise in the motion model.

Identity equation: Assuming that the identity does not change as time proceeds, the identity equation is given by n t = n t 1 ; t 1 (2) Observation Model: Assuming that the transformed observation is a noise-corrupted version of some still template in the gallery, the observation equation can be written as T θt {z t } = I nt + v t ; t 1 (3) where v t is the observation noise at time t and T θt {z t } is a transformed version of the observation z t. Here T θt {z t } is composed of (1) an affine transform of z using {a 1, a 2, a 3, a 4 }, (2) cropping the region of interest at position {t x, t y } with the same size as some still template and (3) performing zero-mean-unit-variance normalization. In this modified version of the CONDENSATION algorithm, random samples are propagated on the motion vector while the samples on the identity variable are kept fixed. Although only the marginal distribution is propagated for motion tracking, the joint distribution is propagated for recognition purposes. This results in a considerable improvement in computation over propagating random variables on both the motion vector and identity variable for large databases. The different steps of the simultaneous tracking and recognition framework are given in Figure 1. The mean of the Gaussian distributed prior comes from the initial detector whose covariance matrix is manually specified. Please refer to [22] for more details of the algorithm. 3.2. Traditional Likelihood Measurement If there is no significant facial appearance difference between the probe frames and the gallery templates, a simple likelihood measurement like a truncated Laplacian is sufficient [22]. More sophisticated likelihood measurement models like the probabilistic subspace density approach are required to handle greater appearance difference between the probe and the gallery [22]. In this approach, the intrapersonal variations are learned using the available gallery and one frame of the video sequences. Usually, surveillance videos have very poor resolution, in addition to large variations in pose and illumination which results in decrease in both tracking and recognition performance. Here we propose a multidimensional scaling (MDS)-based approach for computing the measurement likelihood which results in better modeling of the appearance difference between the gallery and probe resulting in both better tracking and recognition. 3.3. Learning-Based Likelihood Measurement In this work, we use local SIFT features [14] at seven fiducial locations for representing a face (Figure 2). SIFT descriptors are fairly robust to modest variations in pose and Initialize a sample set S 0 = {(θ (j) 0 )}J j=1 according to the prior distribution p(θ 0 z 0 ) which is assumed to be Gaussian. The particle weights for each subject {w (j) 0,n }J j=1, n = 1,, N is initialized to 1. J and N denotes the number of particles and subjects respectively. 1. Predict: sample by drawing θ (j) t from the motion state transition probability p(θ t θ (j) t 1 ) and compute the transformed image T corresponding to the predicted sample. 2. Update: the weights using α j t,n = w (j) t 1,n p(z t n, θ (j) t ) (measurement likelihood) for each subject in the gallery. The normalized weights are given by w (j) t,n = αt,n/ j N J n=1 j=1 αj t,n. The measurement likelihood is learned from a set of HR training images (Section 3.3). 3. Resample: Particles for all subjects are reweighted to obtain samples with new weights w (j) t,n = w t,n/w (j) (j) t, where the denominator is given by w (j) t = N n=1 w(j) t,n. Marginalize over θ t to obtain the weights for n t to obtain the probe id. Figure 1. Simultaneous tracking and recognition framework [22]. resolution and this kind of representation has been shown to be useful for matching facial images in uncontrolled scenarios. But the large variations in pose, illumination and resolution observed in surveillance quality videos results in significant decrease in recognition performance using SIFT descriptors. The MDS-based approach transforms the SIFT descriptors extracted from gallery/probe images to a space in which their inter-euclidean distances approximate the distances had all the descriptors been computed using HR frontal images. The transformation is learned from a set of HR and corresponding LR training images. Figure 2. SIFT features at fiducial locations used for representing the face. Let HR frontal images are denoted by I (h,f), and the LR non-frontal images are denoted by I (l,p). The corresponding SIFT-based feature descriptors are denoted by x (h,f) and x (l,p). Let f : R d R m denote the mapping from the

Figure 3. Flow chart showing the steps of the proposed algorithm. input feature space R d to the embedded Euclidean space R m f(x; W) = W T φ(x) (4) Here φ(x) can be a linear or non-linear function of the feature vectors and W is the matrix of the weights to be determined. The goal is to simultaneously transform the feature vectors from I (h,f) i and I (l,p) j such that the Euclidean distance between the transformed feature vectors approximates d (h,f) i,j (distance if both the images are frontal and high resolution). Thus the objective function to be minimized is given by the distance preserving term J DP which ensures that the distance between the transformed feature vectors approximates d (h,f) i,j J DP (W) = i=1 j=1 (q ij (W) d (h,f) ) 2 (5) Here q ij (W) is the distance between the transformed feature vectors of the images I (h,f) i i,j and I (l,p) j. An optional class separability term J CS can also be incorporated in the objective function to further facilitate discriminability J CS (W) = i=1 j=1 δ(ω i, ω j )qi,j(w) 2 (6) This term tries to minimize the distance between feature vectors belonging to same class [21]. Here δ(ω i, ω j ) = 0; when ω i ω j and 0 otherwise (ω i denotes the class label of the i th image). Combining the above two terms, the transformation is obtained by minimizing the following objective function J(W) = λj DP (W) + (1 λ)j CS (W) (7) The relative effect of the two terms in the objective function is controlled by the parameter λ. The iterative majorization algorithm [21] is used to minimize the objective function (7) to solve for the transformation matrix W. To compute the measurement likelihood, the SIFT descriptors of the gallery and affine-transformed probe frame are mapped using the learned transformation W, followed by computation of Euclidean distances between the transformed features. p(z t n t, θ t ) = W T [ φ(t θt {z t }) φ(x nt ) ] (8) Figure 3 shows a flow-chart of the proposed learning-based simultaneous tracking and recognition framework. 4. Experimental Evaluation In this section, we will discuss in detail the experimental evaluation of the proposed approach. 4.1. Dataset Used For our experiments, we use 50 surveillance quality videos (each 40 100 frames from 50 subjects) from the

Figure 4. Example frames from MBGC video challenge [16]. Multiple Biometric Grand Challenge (MBGC) [16] video challenge data for the probe videos. Figure 4 shows some sample frames from a video sequence. Since the MBGC video challenge data does not contain high resolution frontal still images needed to form the HR gallery set, we select images of the same subjects from FRGC data which has considerable subject overlap with the MBGC data. Figure 5 (top row) shows some sample gallery images from the dataset used and the bottom row shows cropped face regions from the corresponding probe videos. We see that there is a considerable difference in pose, illumination and resolution between the gallery images and the probe videos. 2. Probabilistic subspace density based likelihood: To handle significant appearance differences between the facial images in the gallery and probe, Zhou et al. [22] proposed using the probabilistic subspace density based approach proposed by Moghaddam et al. [15] due to its computational efficiency and high recognition accuracy. The available gallery and one video frame was used for constructing the intra-personal space (IPS). Using this approach, the measurement likelihood can be written as p(zt nt, θt ) = PS Tθt {zt } Int (10) where PS(x) = Ps 2 i=1 (yi /λi ) Q 1/2 s (2π)s/2 i=1 λi exp 1/2 Here {λi, ei }si=1 are the top s eigenvalues and the corresponding eigenvectors obtained by performing regular Principal Component Analysis [19] on IPS and yi = eti x is the ith principal component of x. Figure 5. (Top) Example high resolution gallery images; (Bottom) Cropped facial regions from the corresponding low resolution probe videos. 4.2. Recognition and Tracking Accuracy Here we report both tracking and recognition performance of the proposed approach. The proposed learningbased likelihood measurement model is compared with the following two approaches for computing the likelihood measurement [22]: 1. Truncated laplacian likelihood: Here the likelihood measurement model is given by [22] p(zt nt, θt ) = LAP k Tθt {zt } Int k; σ1, τ1 (9) Here k. k is the absolute distance and 1 σ exp( x/σ) if x τ σ, LAP(x; σ; τ ) = σ 1 exp( τ ) otherwise We build upon the code provided in the authors website (http://www.cfar.umd.edu/shaohua/sourcecodes.html). For all experiments, the kernel mapping φ is set to identity (i.e., φ(x) = x) to highlight just the performance improvement due to the proposed learning approach. Training is done using images from a separate set of 50 subjects. For computation of the transformation matrix using the iterative majorization algorithm, we observe that the objective function decreases till around 20 iterations and then stabilizes. The value of the parameter λ is set to 0.8 and the output dimension m is set to 100. The number of particles for the particle filtering framework is taken to be 200. The recognition performance of the proposed approach is shown in Table 1. Comparisons with the two different kinds of likelihood models are also shown. The three approaches label each video as belonging to one of the subjects in the gallery. The recognition rate is calculated as the percentage of correct labels out of all videos. We see that the recognition performance of the proposed learningbased simultaneous tracking and recognition framework is considerably better than the other approaches due to better

Method Truncated laplacian Probabilistic subspace density Proposed Approach Likelihood Based likelihood Rank 1 Recog. Accuracy 24% 40% 68% Tracking Accuracy 4.8 5.8 2.8 Table 1. Rank-1 recognition accuracy and tracking accuracy (pixels/frame) using the proposed approach. Comparisons with other approaches are also provided. modeling of the appearance difference between the gallery and the probe images. To compute the tracking error, we manually marked three fiducial locations (the center of the two eyes and the bottom of the nose) of every fifth frame of each video. For each probe video, we measured the difference between the manually marked ground truth locations and the locations given by the tracker. For a probe video, the tracking error is given by the average difference in the fiducial locations (averaged over all the frames). Figure 6 shows the tracking results for a few frames of a probe video for the proposed approach. Figure 7 shows the tracking error for the proposed approach and for the truncated laplacian-based likelihood and probabilistic subspace density-based likelihood model. We see for 49 out of 50 videos, the proposed approach achieves a lower tracking error as compared to the other approaches. The mean tracking error (in pixels) over all the probe videos for all the approaches are shown in Table 1. problem. Performing tracking and recognition simultaneously in a unified framework as opposed to first performing tracking and then recognition has been shown to improve both the tracking and recognition performance. But simple likelihood measurement models like truncated laplacian, IPS, etc. fail to give satisfactory performance for cases where there is significant difference between the appearance of the gallery images and the faces in the probe videos. In this paper, we propose using a learning-based likelihood measurement model to improve both the recognition and tracking accuracy for surveillance quality videos. In the training stage, a transformation is learned to simultaneously transform the features from the poor quality probe images and the high quality gallery images in such a manner that the distances between them approximate the distances had the probe videos been captured in the same conditions as the gallery images. In the testing stage, the learned transformation matrix is used to transform the features from the gallery images and the different particles to compute the likelihood of each particle in the modified particle-filtering framework. Experiments on surveillance quality videos show the usefulness of the proposed approach. References Figure 7. Average tracking accuracy of the proposed learningbased approach. Comparison with the other approaches are also provided. 5. Summary and Discussion In this paper, we consider the problem of matching faces in low-resolution surveillance videos with good high resolution images in the gallery. Tracking and recognizing faces in low-resolution videos with considerable variations in pose, illumination, expression, etc. is a very challenging [1] O. Arandjelovic and R. Cipolla. Face recognition from video using the generic shape-illumination manifold. In European Conf. on Computer Vision, pages 27 40, 2006. 2 [2] O. Arandjelovic and R. Cipolla. A manifold approach to face recognition from low quality video across illumination and pose using implicit super-resolution. In IEEE International Conf. on Computer Vision, 2007. 1, 2 [3] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimensional scaling for matching low-resolution facial images. In IEEE International Conf. On Biometrics: Theory, Applications And Systems, 2010. 2 [4] I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, Second Edition, New York, NY, 2005. 1 [5] B. Gunturk, A. Batur, Y. Altunbasak, M. Hayes, and R. Mersereau. Eigenface-domain super-resolution for face recognition. IEEE Trans. on Image Processing, 12(5):597 606, May 2003. 2 [6] P. Hennings-Yeomans, S. Baker, and B. Kumar. Simultaneous super-resolution and feature extraction for recognition of low-resolution faces. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1 8, 2008. 1, 2

Figure 6. A few frames showing the tracking results obtained using the proposed approach. Here only the region of the frames containing the person is shown for better visualization. [7] P. Hennings-Yeomans, B. Kumar, and S. Baker. Recognition of low-resolution faces using multiple still images and multiple cameras. In IEEE International Conf. On Biometrics: Theory, Applications And Systems, pages 1 6, 2008. 2 [8] K. Jia and S. Gong. Multi-modal tensor face for simultaneous super-resolution and recognition. In IEEE International Conf. on Computer Vision, pages 1683 1690, 2005. 2 [9] K. Jia and S. Gong. Generalized face super-resolution. IEEE Trans. on Image Processing, 17(6):873 886, June 2008. 2 [10] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley. Face tracking and recognition with visual constraints in real-world videos. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1 8, 2008. 2 [11] K. C. Lee, J. Ho, M. H. Yang, and D. Kriegman. Video-based face recognition using probabilistic appearance manifolds. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 313 320, 2003. 2 [12] C. Liu, H. Y. Shum, and W. T. Freeman. Face hallucination: Theory and practice. International Journal of Computer Vision, 75(1):115 134, 2007. 2 [13] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 340 345, 2003. 2 [14] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, 2004. 3 [15] B. Moghaddam. Principal manifolds and probabilistic subspaces for visual recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(6):780 788, June 2002. 5 [16] P. J. Phillips, P. J. Flynn, J. R. Beveridge, W. T. Scruggs, A. J. O Toole, D. S. Bolme, K. W. Bowyer, A. Draper, Bruce, G. H. Givens, Y. M. Lui, H. Sahibzada, J. A. Scallan, and S. Weimer. Overview of the multiple biometrics grand challenge. In International Conference on Biometrics, pages 705 714, 2009. 2, 5 [17] P. J. Phillips, W. T. Scruggs, A. J. O Toole, P. J. Flynn, K. W. Bowyer, C. L. Schott, and M. Sharpe. Frvt 2006 and ice 2006 large-scale experimental results. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):831 846, 2010. 1 [18] J. Stallkamp, H. K. Ekenel, and R. Stiefelhagen. Video-based face recognition on real-world data. In IEEE International Conf. on Computer Vision, 2007. 2 [19] M. Turk and P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neurosicence, 3(1):71 86, 1991. 5 [20] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold distance with application to face recognition based on image set. In IEEE Conf. on Computer Vision and Pattern Recognition, 2008. 2 [21] A. Webb. Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recognition, 28(5):753 759, May 1995. 4 [22] S. K. Zhao, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces from video. Computer Vision and Image Understanding, 91:214 245, 2003. 1, 2, 3, 5 [23] W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399 458, 2003. 1