A 2D+3D FACE IDENTIFICATION SYSTEM FOR SURVEILLANCE APPLICATIONS

A 2D+3D FACE IDENTIFICATION SYSTEM FOR SURVEILLANCE APPLICATIONS Filareti Tsalakanidou, Sotiris Malassiotis and Michael G. Strintzis Informatics and Telematics Institute Centre for Research and Technology Hellas st Km Thermi-Panorama Rd, Thessaloniki 5700, Greece filareti@iti.gr, malasiot@iti.gr, strintzi@eng.auth.gr Abstract A novel surveillance system integrating 2D and 3D facial data is presented in this paper, based on a low-cost sensor capable of real-time acquisition of 3D images and associated color images of a scene. Depth data is used for robust face detection, localization and 3D pose estimation, as well as for compensating pose and illumination variations of facial images prior to classification. The proposed system was tested under an open-set identification scenario for surveillance of humans passing through a relatively constrained area. Experimental results demonstrate the accuracy and robustness of the system under a variety of conditions usually encountered in surveillance applications.. Introduction During the last 0 years, state-of-the-art face recognition systems using 2D intensity images have advanced to be fairly accurate under controlled conditions. However, public face recognition benchmarks have shown that their performance degrades dramatically for images subject to pose, illumination or facial expressions variations []. To improve performance under these conditions, 3D face recognition has been proposed [2]. The use of 3D images for personal identification was mainly based on the fact that the 3D structure of the human face can be highly discriminatory and is inherently insensitive to illumination variations and face pigment. Moreover, the use of 3D data can aid significantly facial feature detection, facial pose estimation and pose compensation. The above characteristics are highly desirable especially for applications demanding dependable operation in uncontrolled environments, such as surveillance of public places. Nevertheless, implementation of 3D face recognition techniques for such applications is not widespread, mainly due to limitations imposed by 3D sensors. Unlike the widely-used CCTV cameras, 3D imaging technologies are not mature enough to provide high image quality at low cost. Moreover, 3D sensors usually have a limited field of view which constrains their working area to a radius of -2 meters for images of acceptable quality. In this paper a complete 3D face recognition system suitable for small-scale surveillance applications requiring unobtrusive recognition of uncooperative users within a limited area (e.g. monitoring of passengers validating their tickets in metro stations) is proposed. The system relies on a novel low-cost sensor capable of real-time acquisition of 3D images and associated color images of a scene [3]. Unlike feature-based techniques [4], that rely on high quality depth images usually acquired by 3D scanners, we propose an appearance based approach that demonstrates reliable performance even with low cost sensors. The main problem with such methods is the requirement for accurate alignment between probe and gallery images. A pose and illumination compensation scheme was proposed to achieve such geometric and photometric alignment. We also propose the combination of 2D and 3D images for reliable multimodal authentication. The combination of 2D and 3D images for PCA-based face recognition was also proposed in [5], where a database of significant size was used to produce comparative results of face recognition using eigenfaces for 2D, 3D and their combination. This test however considered only frontal images captured under constant illumination. In [6], a database containing several appearance and environmental variations was used. The Embedded Hidden Markov model method was implemented in both color and depth images and high recognition rates were reported. To cope with intrapersonal variations due to viewpoint or lighting conditions, the face database was enriched with artificially generated examples depicting variations in pose and illumination. In this paper, we describe and evaluate a complete face recognition system using a combination of 2D color and 3D

range images captured in real-time. A completely different approach compared to [6] is adopted. Variations of facial images due to changes in pose or illumination are compensated prior to classification. Several novel techniques which are capable, taking as input a pair of 2D and 3D images, to produce a pair of normalized images depicting frontal pose and illumination are proposed. Also, a novel face detection and localization method relying exclusively on depth data is presented. The efficiency and robustness of the proposed system is demonstrated on a data set of significant size depicting several variations, representative of the conditions encountered in real-world applications. The paper is organized as follows. In Section 2 the application scenario and 3D acquisition set-up are described. Pose and illumination compensation algorithms are presented in Sections 3 and 4 respectively. Face classification using 2D and 3D images is outlined in Section 5, while the performance of the system is extensively evaluated in Section 6. Finally, Section 7 concludes the paper. 2. Application scenario In this paper we propose a watch-list identification system for surveillance applications requiring unsupervised recognition of humans passing through a relatively constrained area. The proposed system employs a novel low-cost sensor capable of quasi-synchronous acquisition of 3D and color images, based on the Color Coded Light Approach [3]. The sensor is integrated using a standard CCTV color camera and a video projector both embedded in a mechanical construction. A color coded light pattern is projected on the scene and its deformation on the object surfaces is measured, thus generating a 3D image of the scene. By rapidly alternating between this pattern and a white light, a color image may be captured almost simultaneously (see Figure ). The resolution of the generated images is 580 780 pixels. The sensor was optimized for the application scenario described below, leading to an average depth accuracy of 0.5mm for objects located about m meter from the camera in an effective working space of 75cm 75cm 50cm. According to this scenario, the surveillance system comprises of a network of 3D sensors located above specific entrance or exit points. People are allowed to pass through these areas one at a time. The sensor is continuously recording pairs of depth and color images of the scene located inside its working volume. In parallel, the face recognition software processes each image pair to detect the presence of human faces. We note here, that the most important limitation of surveillance systems is caused by the fact that they have to operate in an unconstrained environment and without having the user s cooperation. This leads to environmental variations caused by changes in lighting conditions, as well as facial appearance variations due to facial pose. To cope with such variations, a pose and illumination compensation step is subsequently applied to each image pair, thus resulting in a pair of images depicting frontal pose and illumination. The normalized images are then compared against a watchlist to determine if the person appearing in these images is one of the enrolled persons (open-set :N identification). In case of a positive answer, an alarm signal is sent to the surveillance center. Although there are obviously some limitations on the working conditions under which the surveillance system operates, e.g. large facial poses (more than 35 ), the proposed system does not impose any strict constraints on user movement. 3. Face localization and pose compensation In the proposed system, face detection and localization is based solely on 3D data, thus being relatively insensitive in illumination changes and occlusion of facial features. Segmentation of the head from the body relies on modelling the head - torso points as a mixture of Gaussians, while the model parameters are estimated using the Expectation Maximization algorithm and a-priori knowledge of body and head geometry [7]. Figure : Pose compensation examples. The original pair of images and the resulting frontal views are shown. In the range image, warmer colors correspond to points closer to the camera, while black pixels correspond to undetermined depth values. To localize the position of the face and estimate the pose, first the tip of the nose is accurately detected. Then, a 3D line is fitted on the 3D points comprising the ridge of the nose, which defines two of the three degrees of freedom of the face orientation. The third degree of freedom, that 2

is the rotation angle around the nose axis, is estimated by exploiting the inherent face bilateral symmetry [7]. In order to compensate for any head rotation and generate a frontal view of this face, we first define a local coordinate system centered on the tip of the nose and aligned with the estimated face orientation. Then, we align this local frame with a reference coordinate frame using a warping transform. The resulting 3D image contains missing pixels, which can be filled either by using face symmetry or by interpolating their values using neighboring pixels (see Figure ). Gallery images are also aligned with the reference frame during the training phase of the algorithm. This last alignment is performed manually. The face detection-localization algorithm runs near to real-time (about 0fps). For each frame captured, an image quality measure is computed as follows: The estimated face pose parameters are used to create a normalized frontal depth image (dimensions 40 50) by means of 3D warping. This image is subsequently projected on a face-subspace created using a set of training depth images. Based on the distance from the face-subspace, a face-ness measure is computed and used as a quality measure [8]. Note, that if a non-face surface is in front of the camera, pose estimation will normally fail and thus the resulting quality measure will be very low. For subsequent frames of adequate quality, the best is selected and used for identification. In Figure 2 subsequent frames of an individual s face passing through the system s working volume can be seen. The green rectangular frame corresponds to the result of face detection. Figure 3: Illumination compensation examples. (a) Original image, (b) R(I D, L, u), (c) novel image relit by frontal light. an example-based regression technique is employed for estimating the non-linear relationship between the image brightness and light source direction L using a set of artificially generated bootstrap images. Given the frontal light direction L 0 and the estimate of the light source direction L, the illumination compensated image ĨC is given by multiplication of the input color image I C with a ratio image: Ĩ C (u) = I C (u) R(I D, L, u) R(I D, L 0, u) I C, I D are the pose compensated color and depth images and R is a rendering of the surface with constant albedo [3]. An important advantage of this algorithm is its flexibility in coping with complex illumination conditions by adaptation of the rendering function R. Figure 3 illustrates the relighting of side illuminated images. 5. Face classification Figure 2: Subsequent frames of an individual s face passing through the system s working volume. The green rectangular frame corresponds to the result of face detection. 4. Illumination compensation Inspired by recent work on scene relighting for realistic image renderings [9], the proposed illumination compensation method is based on the generation of a new face image relit from a frontal direction. First, we recover the scene illumination from a pair of pose compensated color and depth images. Assuming that the scene is illuminated by a single light source, A multimodal classification scheme integrating 2D and 3D face images is proposed. Two independent classifiers, one for color and one for depth images are used. Among several state-of-the-art face recognition algorithms examined, the Probabilistic Matching (PM) algorithm [8] gave the lowest error rates, while being computationally efficient. The PM algorithm is applied to both depth and color pose and illumination compensated images, independently. The scores computed by each classifier are subsequently normalized using the quadric-line-quadric normalization technique (QLQ), proposed in [0]. QLQ is an adaptive normalization procedure aiming to decrease the overlap of the genuine and impostor distributions, while still mapping the scores in [0, ]. Fusion of color and depth data is achieved 3

Detection / Recognition Face detection, localization and 3D pose estimation 3D pose compensation Illumination compensation Generation of 3D face template Watch list recognition Y/N Manual 3D pose compensation Illumination Compensation Generation of 3D face template Enrolment Figure 4: Block diagram illustrating the various steps of the proposed face identification system. Figure 5: Examples of test images depicting several appearance variations. by simply adding the two normalized scores (simple-sum fusion rule). The block diagram of Figure 4 illustrates the various steps of the face recognition algorithm. 6. Experimental results In this Section, we evaluate the efficiency of the proposed identification system using a face database of significant size, compiled in conditions that are similar to those encountered in real-world applications. The acquired images contain several facial appearance, as well as environmental variations: frontal neutral views, pose variations, facial expressions, illumination variations, images with/without wearing eyeglasses (see Figure 5). In total, more than 3000 image pairs were recorded for 73 enrolled users. The PM algorithm was trained using 2 frontal images per person. Testing was performed using the remaining images. The open-set :N identification scenario [], known as watch-list identification, was adopted for the evaluation of the proposed algorithms. According to this scenario, not every probe (unknown test face image) has a match in the gallery database (set of enrolled users), i.e. there are images in the testing set belonging to individuals not known to the system (impostors). In this case, the recognition task involves a) negative identification (rejection) of people not belonging to the gallery and b) correct identification of people that make up the watch list [2]. In surveillance applications, the watch-list of wanted individuals is much smaller than the number of people expected to be screened by the system. People not belonging to the watch-list are regarded as impostors, whose negative identification is sought after, while people in the watch-list are regarded as enroled users, whom the system should recognize correctly. For our experiments, the size of the watchlist varies from to 25 subjects for a set of 73 individuals. System performance is measured over the impostor and watch-list populations using the correct rejection rate and the correct identification rate respectively. The correct rejection rate is defined as the fraction of impostor probe images for whom the computed similarity measures, derived from their comparison to enroled users, are below threshold t. The correct identification rate is the fraction of enrolled user test images, for whom the similarity measure corresponding to their own gallery image (same id) is greater than t. In the following, we report the mean (average) correct identification and correct rejection rates obtained over 200 randomized runs of the algorithm. In each run, the watchlist subjects are randomly selected from the 73 individuals participating in our database. Table shows the mean correct rejection and recognition rates obtained for different watch-list sizes, as well as the corresponding system accuracy (detection and identification rate). For a probe set of N images (test images of both impostors and enrolled users), the accuracy is (average number of correct rejections+ average number of correct recognitions)/n. All reported rates are obtained using 4

Table : Average correct rejection and correct recognition rates and corresponding detection and identification accuracy. The threshold was set to a fixed value. Watch Correct Correct Equal List Recognition Rejection Accuracy Correct Size Rate (%) Rate (%) (%) Rate (%) 5 96.92 94.45 94.62 96.58 0 97. 90.03 9.00 96.5 5 97.07 86.22 88.45 95.82 20 96.90 83.2 86.88 95.30 25 97.0 80. 85.9 95.4 Correct Rejection Rate 5 0.75 0.7 0.65 Mean system performance for different watch list sizes 5 Correct Recognition Rate Watch list size = 5 Watch list size = 0 Watch list size = 5 Watch list size = 20 Watch list size = 25 Figure 6: Correct recognition rate vs. correct rejection rate for various watch-list sizes. a fixed threshold value. The last column of Table tabulates the equal correct rate values of the proposed system. The equal correct rate is defined as the correct rejection rate on the operating point (threshold value) for which the correct rejection and correct recognition rates are equal. Table 2 tabulates the correct recognition rates obtained for specific values of the correct rejection rate and viceversa. The corresponding detection and identification accuracy is also reported. Figure 6 illustrates the system s performance on different operating points. The average correct recognition rate is plotted against the average correct rejection rate for varying threshold values and different watch-list sizes. It can be observed that as the watch list size increases the system s performance gradually decreases, as expected. In Figure 7, the performance of the multimodal color+depth identification system, is compared to the performance of the unimodal systems for a watch-list of 5 subjects. The average correct recognition rate is plotted Table 2: Average correct recognition rates obtained for specific values of the average correct rejection rate and viceversa. Watch Correct Correct List Recognition Rejection Accuracy Size Rate (%) Rate (%) (%) 98.00 82.98 84.0 95.00 98.94 98.67 5 92.00 99.85 99.3 96.06 98.00 97.83 96.90 95.00 95.02 98.00 75.34 78.45 95.00 98.0 97.70 0 92.00 99.73 98.66 95. 98.00 97.62 96.53 95.00 95.32 98.00 65.0 72.45 95.00 97.42 96.85 5 92.00 99.58 98.03 94.6 98.00 97.3 96.00 95.00 97.4 98.00 6.03 70 95.00 96.0 95.48 20 92.00 99.33 97.36 93.92 98.00 96.87 95.56 95.00 95.83 98.00 56.70 70.60 95.00 95.48 95.50 25 92.00 99.28 96.72 93.60 98.00 98.76 95.4 95.00 95.0 against the average correct rejection rate for varying threshold values. It is clearly seen that the combined use of depth and color facial data leads to increased identification accuracy and high recognition rates. 7. Conclusions In this paper, a complete 2D+3D face recognition system based on automatic image normalization algorithms that exploit the availability of 3D information was presented. The proposed system is particularly suitable for small-scale surveillance applications requiring unobtrusive recognition of uncooperative users within a limited area, since it successfully addresses two major limitations of existing applications: pose and illumination variability. Given a pair of color and depth images, the pose of the face is first estimated using depth data only, and is then compensated, resulting in a frontal view. Finally, the scene illumination is recovered from the pair of images, and a new color image, 5

Correct Rejection Rate 5 0.75 0.7 0.65 0.6 0.55 0.5 Mean system performance using different modalities. 5 Correct Recognition Rate color depth color+depth Dimensional Image Processing, Vol., No. 5-6, pp. 358 369, Oct.-Dec. 2005. [4] T. K. Kim, S. C. Kee, and S. R. Kim, Real-time normalization and feature extraction of 3D face data using curvature characteristics, in Proc. 0th IEEE Int. Workshop on Robot and Human Interactive Communications, Sept. 200, pp. 74 79. [5] K. I. Chang, K. W. Bowyer, and P. J. Flynn, Face recognition using 2D and 3D facial data, in Proc. ACM Workshop on Multimodal User Authentication, Santa Barbara, California, Dec. 2003, pp. 25 32. [6] F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis, Face localization and authentication using color and depth images, IEEE Trans. on Image Processing, Vol. 4, No. 2, pp. 52 68, Feb. 2005. Figure 7: Correct recognition rate vs. correct rejection rate for a watch-list of 5 subjects. The performance of the multimodal color+depth identification system is compared to the performance of the unimodal systems. relit from a frontal direction, is generated. Significant improvements in face classification accuracy were obtained using this scheme, as shown by experimental evaluation based on a database of significant size containing several facial appearance, as well as environmental variations. Acknowledgments This work was supported by research project PASION - Psychologically Augmented Social Interaction over Networks (contract No FP6-027654) under Information Society Technologies (IST) priority of the 6th Framework Programme of the European Community. References [] P. J. Phillips, P.J. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and J. M. Bone, Face recognition vendor test 2002, Evaluation report, Defense Advanced Research Projects Agency and National Institute of Justice, March 2003. [7] S. Malassiotis and M. G. Strintzis, Robust face recognition using 2D and 3D data: Pose and illumination compensation, Pattern Recognition, Vol. 38, No. 2, pp. 2537 2548, Dec. 2005. [8] B. Moghaddam, W. Wahid, and A. Pentland, Beyond eigenfaces: probabilistic matching for face recognition, in Proc. Int. Conf. on Automatic Face and Gesture Recognition (FGR 98), Nara, Japan, April 998, Vol., pp. 30 35. [9] L. Zhang and D. Samaras, Face recognition under variable lighting using harmonic image exemplars, in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR 03), June 2003, Vol., pp. 9 25. [0] R. Snelick, Umut Uludag, Alan Mink, Micahel Indovina, and Anil Jain, Large-scale evaluation of multimodal biometric authentication using state-of-the-art systems, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 27, No. 3, pp. 450 455, March 2005. [] P. Grother, Face Recognition Vendor Test 2002, Supplemental Report NISTIR 7083, National Institute of Standards and Technology, US, Feb. 2004. [2] F. Li and H. Wechsler, Open set face recognition using transduction, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 27, No., pp. 686 697, Nov. 2005. [2] Kevin W. Bowyer, Kyong Chang, and Patrick J. Flynn, A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition, Computer Vision and Image Understanding, Vol. 0, No., pp. 5, Jan. 2006. [3] F. Tsalakanidou, F. Forster, S. Malassiotis, and M. G. Strintzis, Real-time acquisition of depth and color images using structured light and its application to 3D face recognition, Elsevier Real-Time Imaging, Special Issue on Multi- 6