A Dissertation. Submitted to the Graduate School. of the University of Notre Dame. in Partial Fulfillment of the Requirements.

Size: px

Start display at page:

Download "A Dissertation. Submitted to the Graduate School. of the University of Notre Dame. in Partial Fulfillment of the Requirements."

Madison Cook
6 years ago
Views:

1 NEW MULTI-BIOMETRIC APPROACHES FOR IMPROVED PERSON IDENTIFICATION A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Kyong I. Chang, B.S., M.S. Keven W. Bowyer, Director Patrick J. Flynn, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana December 2004

2 NEW MULTI-BIOMETRIC APPROACHES FOR IMPROVED PERSON IDENTIFICATION Abstract by Kyong I. Chang Multiple modality biometric approaches are proposed integrating two-dimensional face appearance with ear appearance, three-dimensional face shape, and the pattern of heat emission on face. A single source biometric recognition method, such as face, has been shown to improve its identification rate by incorporating other biometric sources. The investigation of multi-modal biometrics involves a variety of sensors. For the recognition task, each sensor captures different aspects of human facial features; for example, appearance representing the levels of brightness on surface reflectance by a light source, shape data representing depth values defined at points on an object, and the pattern of heat emitted from an object. The results of our multiple biometric approach shown in this investigation appear to support the conclusion that the path to higher accuracy and robustness in biometrics involves the use of multiple biometrics rather than the best possible sensor and algorithm for a single biometric. A new evaluation scheme is designed to assess the improvement gained by multiple biometrics. Because multi-modal recognition employs multiple samples of facial data, it is also possible that the improvement achieved over considering multiple samples from all modalities for recognition. Therefore, this evaluation scheme will determine the recognition accuracy gained by multiple modality approach and multiple sample approach.

3 Kyong I. Chang Also, a new algorithm for 3D face recognition is proposed for handling expression variation. It uses a surface registration-based technique for 3D face recognition. We evaluate and compare the performance of approaches to 3D face recognition based on PCA-based and on iterative closest point algorithms. The proposed 3D face recognition method is fully automatic to use to initialize the 3D matching. The evaluation results show that the proposed algorithm substantially improves performance in the case of varying facial expression. This is the first study to compare the PCA and ICP approaches to 3D face recognition, and the first to propose a multiple-region approach to coping with expression variation in 3D face recognition. The proposed method outperforms 3D eigenfaces when 3D face scans were acquired in different times without expression changes and also with expression changes.

4 DEDICATION To my wife Jung-A, our lovely children June and Hoon ii

5 CONTENTS FIGURES vi TABLES x CHAPTER 1: A SURVEY OF 3D FACE RECOGNITION AND MULTI- PLE BIOMETRICS METHODS Introduction Survey of 3D Face Recognition Survey of Multi-biometrics Summary Conclusion CHAPTER 2: MULTIPLE BIOMETRICS USING FACE AND EAR Introduction Eigen-faces and Eigen-ears Experimental Results: Face versus Ear Experimental Results: Face Plus Ear Multi-modal Biometric Summary and Discussion CHAPTER 3: MULTI-MODAL 2D+3D FACE RECOGNITION Introduction Methodology Normalization Normalization of 2D Images Normalization of 3D Images Eigenvector Tuning Distance Metrics Data Fusion Data Collection Experiments X Y Resolution Effects in Identification Accuracy iii

6 3.4.2 Depth Resolution Effects in Identification Accuracy Experimental Results: 2D versus 3D face - Single probe study Experimental Results: Multi-modal biometrics using 2D and 3D Experimental Results: 2D face versus 3D face in biometrics - multiple probe study Summary and Discussion CHAPTER 4: MULTIPLE BIOMETRICS USING 2D, 3D AND INFRA- RED FACE Introduction Methodology Normalization of IR Images Eigenvector Tuning Data Fusion Data Collection Experiments Experimental Results: Single biometrics Experimental Results: Multiple biometrics Summary and Discussion CHAPTER 5: MULTI-MODAL 2D+3D VERSUS MULTI-SAMPLE 2D Introduction Methodology Data Collection Eigenvector Tuning Experiments Single-Modality and Single-Sample (SMSS) Single-Modality and Multiple-Sample (SMMS) Performance Effect on Number of Samples in SMMS Multiple-Modality and Single-Sample (MMSS) Results and Summary Limitations of Current 3D Sensing Technology Illumination Invariance Summary and Discussion CHAPTER 6: A LOCAL SURFACE BASED 3D FACE RECOGNITION ALGORITHM Introduction Related Works Baseline Performance: ICP and PCA Facial Expression Analysis iv

7 6.3 Methods Description Overall Framework Skin Region Detection Preprocessing Face Segmentation Using Surface Curvature Surface Curvature Local Curvature Estimation Surface Segmentation Surface Classification Searching for Regions of Interest Pose Correction Surface Extraction Face Matching in Identification Data Collection Experiment Performance Effects on Time Variation Performance Effects on Expression Variation Scalability of 3D Face Recognition Multiple Probe Study Summary and Discussion Multiple Surfaces for Matching Feasibility of Other Areas for Matching CHAPTER 7: CONCLUSIONS AND FUTURE RESEARCH Contributions of Dissertation Future Research APPENDIX A: INTRODUCTION TO SURFACE CURVATURE A.1 Gaussian Curvature and Mean Curvature A.2 Local Coordinate System Using PCA A.3 Least Squares Fit A.4 3D Geometric Surface Feature - Curvature Estimation BIBLIOGRAPHY v

8 FIGURES 1.1 Example 2D appearance image and different representations of the 3D data of the same person Illustration of points used for geometric normalization of face and ear images. The Triangular Fossa is the upper point on the ear image, and the Antitragus is the lower point An example of the gallery and probe face and ear images used in this study Recognition performance comparison between face and ear Recognition performance of face, ear and combined face-ear Performance based on different selection of eigenvectors in face and ear spaces Landmark (control) points specified in (A) 2D image and (B) 3D image D Pose normalization. The input image is rotated around the optical axis using two manually selected points on eye outer tips so that the line passing through two eye points are parallel to a horizontal line (X -axis). Once the pose is corrected, the image is scaled so that the distance between those two points are approximately 110 pixels apart. Once the face region is extracted, histogram equalization is applied to normalize the brightness level. Finally, a template mask is applied producing a final image in (D) D Pose normalization After a face region is extracted, spikes are removed first and then missing data points are defined by linear interpolation Examples of masked images in 2D and 3D D and 3D images in different X Y (spatial) resolutions Experiment in X Y (spatial) resolutions changes (from left to right, , , and ) vi

9 3.8 Example of images in different depth resolutions (in mm). They are shown from top to bottom and left to right, 10 6, 10 5, 10 4, 10 3, 10 2,10 1, 0.5, 1, 2, 3, 4, 5, 6, 7(mm) Performance results in different depth resolutions Performance results in single probe study Performance results in multiple probe study Performance results of fusion schemes used Examples of 2D, 3D (2nd row) and IR faces (3rd row); one gallery image and four probe images of a subject. Dark intensity value represents greater depth value in 3D image and lower temperature in IR image The eigenvector images of seven largest eigenvalues used for each facespace Single biometrics performance results in ROC and CMC curves Multi-modal biometrics performance results in ROC and CMC curves Example matches improved by integrated schemes; while the 3 images shown in the left column are the gallery images of each modality, the 3 images in the right column are the probe images Two different sessions acquired in 6 weeks apart. Four 2D images (left to right: FALM, FBLM, FALF and FBLF) and one 3D image (rightmost: FALT) of a subject are acquired in each session Rank-one performance rates of during the eigenvector tuning process. M is 3 due to the early performance drop shown in 3D. N is chosen at 6 since all the rank-one recognition rates were maintained the original performance rate up to that point Three biometric approaches experimented in this study This SMMS approach makes a decision based on 4 matches when a subject is represented by 2 images in the gallery and the probe Example matches improved by integrated schemes; while the matches shown in the left column illustrate that both SMMS and MMSS improved the baseline performance rates, the matches in the right column show the effectiveness of the MMSS scheme vii

10 5.6 The baseline performance rates of the multiple probe study reported in CMC and ROC curves. (SMSS result with FALM:FALM, SMMS result with FALM FALF:FALM FALF based on four matches and MMSS result with FALM FALT(3D):FALM FALT(3D) for a multiple probe study are reported Example of 3D shape models of the same person produced under different lighting conditions by three different current commercial 3D sensors Sample images (a gallery on the left and a probe on the right in each column) used for the baseline performance. Nearly the entire face region is considered for both recognition methods. For the ICPbaseline, the probe face is approximately 10% smaller coverage area than the face in the gallery set Performance Degradation with Expression Change. Recognition rates when the probe has expression change are approximately 30% lower than without expression change. Probes with neutral expression (top), Probes with non-neutral expression (bottom) Example images in 2D and 3D with different expressions. There are seven different expressions considered in this study. Left to right, these are neutral, angry, happy, sad, surprised, disgusted and puffy cheek. The first six of these correspond to some of Ekman s human basic expressions[32] The overview of the proposed method Face region extraction The outlier removal process. (A) outliers are scattered to the back of the face. (B) outliers are removed (subsampled version) Images of a single person with different expressions rendered based on surface types. Many surface types are changed as the deformation is introduced. As cheeks are lifted shown in happy and disgusted, peaks are detected at the upper cheeks in both sides or in lips Example of different surface types classified and detected. The pit regions in eye cavities are found first followed by nose tip classified as a peak region. Lastly, nose bridge is detected. The ROIs are automatically extracted based on the locations of these regions Two examples of pose correction. Each pair of two sub-regions is registered and a transformation matrix is produced Matching surface extraction for a gallery and three probes viii

11 6.11 As three local surfaces are matched against a gallery, different fusion strategies may be considered to combine three RMS values produced by each probe surface Rank-one recognition with same neutral expression as gallery. A number in the parentheses is the number of probe images. The product rule obtained 94.6% by ICP-auto and 96.5% by ICP-manual. Refer the baseline rates to Section Rank-one identification results obtained with different expressions. The product rule obtained 81.6% by ICP-auto and 86.1% by ICPmanual. Refer the baseline rates to Section Rank-one identification rates obtained to probes in different sizes. Probes with neutral expression (top), Probes with non-neutral expression (bottom) Multiple probe study with neutral and non-neutral expressions performed by different methods Two pairs of distributions based on expression changes in probe sets. The arrow shows that the increase in distance metrics of correct matches as expressions are varied. The distributions in this figure are generated with PCA-baseline One of the cases when localized face regions for matching would be more advantageous than when using whole face being obscured by hair. This subject was incorrectly identified with the PCAbased method but was successfully matched by our new ICP-based method. Problems like facial occlusion due to hair or mustache can be resolved by a local region based matching. Intensity image is shown for illustration purposes only The shapes (in different sizes) that we attempted for relatively static regions but performance rates obtained with these shapes was lower than the shapes used in Figure Other locations also show relatively static movement under facial expression changes, however they are not ideal candidate regions for matching. (A) a chin obscured by beard, (B) a forehead obscured by hair, (C) unreliable sensing in regions near silhouette, such as the temples Example cases where the automatic facial feature finding method failed (132 out of 4,485 scans) to extract ROI regions for face matching. Failed in (A) skin detection (B) pose correction (C) ROI extraction ix

12 TABLES 1.1 3D FACE RECOGNITION STUDIES D FACE RECOGNITION STUDIES (CONT.) MULTIPLE BIOMETRICS STUDIES MULTIPLE BIOMETRICS STUDIES (CONT.) A NUMBER OF EIGENVECTORS USED TO CREATE THE EIGENSPACE RANK-ONE RECOGNITION RATES ACHIEVED BY FUSION METHODS IN 2D+3D+IR SMSS RANK-ONE RECOGNITION RATES Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets SMMS RANK-ONE RATES IN 2-GALLERY AND 2-PROBE MODE Rank-one rates are produced by a different number of matches that use 2 gallery images and 2 probe images. Rates written inside the parentheses are based on two matches and other rates are based on four matches. Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization SMMS RANK-ONE RATES IN 3-GALLERY AND 3-PROBE MODE Rank-one rates are produced by different number of matches while 3 galleries and 3 probes are available. Rates written inside the parentheses are based on three matches and other rates are based on nine matches. Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization MMSS RANK-ONE RECOGNITION RATES Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization x

13 6.1 H AND K SIGN TEST FOR SURFACE CLASSES [8] STATISTICS DATASET USED IN THIS STUDY DEMOGRAPHY OF PARTICIPANTS BY AGE DEMOGRAPHY OF PARTICIPANTS BY RACE DEMOGRAPHY OF PARTICIPANTS BY GENDER ICP-AUTO RESULTS BY EACH PROBE SURFACES IN NEU- TRAL EXPRESSION ICP-AUTO RESULTS BY EACH PROBE SURFACES IN NON- NEUTRAL EXPRESSION xi

14 CHAPTER 1 A SURVEY OF 3D FACE RECOGNITION AND MULTIPLE BIOMETRICS METHODS 1.1 Introduction Evaluations such as the Face Recognition Vendor Test 2002 [66] make it clear that the current state of the art in face recognition is not yet sufficient for the more demanding applications, such as surveillance at airports or border crossings 1. However, biometric technologies that currently offer greater accuracy, such as fingerprint and iris, require much greater explicit cooperation from the user. For example, fingerprint requires that the subject cooperate in making physical contact with the sensor surface. This raises issues of how to keep the surface clean and germ-free in a high-throughput application. Iris imaging currently requires that the subject cooperate to carefully position their eye relative to the sensor. This can also cause problems in a high-throughput application. Thus it appears that there is significant potential application-driven demand for improved performance in face recognition. The term face recognition is used informally to refer to two different application scenarios, one of which is called recognition, identification, or 1-to-many 1 This chapter is based on the paper, A Survey Of Approaches To Three-Dimensional Face Recognition, presented at in International Conference on Pattern Recognition,

15 matching, and the other of which is called authentication, verification or 1- to-1 matching. In either scenario, face images of known persons are initially enrolled into the system, and this set of persons is sometimes referred to as the gallery. Later, images of these or other persons are used as probes to match against images in the gallery. In a recognition scenario, the matching is one-tomany, in the sense that a probe is matched against all of the gallery to find the best match above some threshold. In an authentication scenario, the matching is one-to-one, in the sense that the probe is matched against the gallery entry for a claimed identity, and the claimed identity is taken to be authenticated if the quality of match exceeds some threshold. The recognition scenario is more technically challenging than the authentication scenario. One reason is that in a recognition scenario a larger gallery tends to present more chances for incorrect recognition. Another reason is that the whole gallery must be searched in some manner on each recognition attempt. Most research results are presented in the context of either recognition or authentication, but the core 3D representation and matching issues are essentially the same. In fact, the raw matching scores underlying the cumulative match characteristic (CMC) curve for a recognition experiment can readily be tabulated in a different format to produce the receiver operating characteristic (ROC) curve for an authentication experiment. The CMC curve summarizes the percent of a set of probes that is considered to be correctly matched as a function of the match rank that is counted as a correct match. The rank-one recognition rate is the most commonly stated single number from the CMC curve. The ROC curve summarizes the percent of a set of probes that is correctly authenticated as a tradeoff against the percent that is incorrectly denied authentication. The equal 2

16 error rate (EER) is the most commonly stated single number from the ROC curve. The EER is the point on the ROC curve at which the false accept rate (FAR) is equal to the false reject rate (FRR). This is the point at which the false accept rate and the false reject rate are the same. The vast majority of face recognition research, and all of the major commercial face recognition systems, use normal intensity images of the face. We will refer to these as 2D images. In contrast, a 3D image of the face is one that represents three-dimensional shape. A distinction can be made between representations that include only the surface of the face and those that include the whole head. In this distinction, the face surface would be 2.5-D and the whole head would be 3D. We will ignore this distinction here, and refer to the shape of the face surface as 3D. The 3D shape of the face is commonly sensed in combination with a 2D intensity image. In this case, the 2D image can be thought of as a texture map overlaid on the 3D shape. An example of a 2D image and the corresponding 3D shapes represented in different views are shown in Figure 1.1, with the 3D shape in the form of points defined as depth measurements from a sensor and in the form of a shaded 3D model. A depth image, also sometimes called a range image, is an image in which the pixel value reflects the distance from the sensor to the imaged surface. A range image, a shaded model, and a wire-frame mesh are common alternatives for displaying 3D face data. A recent survey of face recognition research is given in [88], but does not include research efforts based on matching 3D shape. Here, we focus specifically on algorithms that match 3D shape descriptions. We are particularly interested in 3D face recognition because it is commonly thought that the use of 3D has the potential for greater recogni- 3

17 2D intensity 3D points 3D mesh 3D surface Figure 1.1. Example 2D appearance image and different representations of the 3D data of the same person 4

18 tion accuracy than the use of 2D. For example, one paper states - Because we are working in 3D, we overcome limitations due to viewpoint and lighting variations [58]. Another paper describing a different approach to 3D face recognition states - Range images have the advantage of capturing shape variation irrespective of illumination variabilities [43]. Similarly, a third paper states - Depth and curvature features have several advantages over more traditional intensity based features. Specifically, curvature descriptors 1) have the potential for higher accuracy in describing surface based events, 2) are better suited to describe properties of the face in a areas such as the cheeks, forehead, and chin, and 3) are viewpoint invariant [39]. 1.2 Survey of 3D Face Recognition This section reviews 3D (only) face recognition methods. Face recognition in computer vision can be described as an instance of the object recognition problem. In the early 1980 s, as range sensors that acquire depth information became available, the level of expectation for good solutions to the object recognition problem increased in the computer vision research community. There is a growing interest in 3D face recognition in biometric community. There are several reasons: (1) cost reduction in 3D sensors as well as increase in computing power to process 3D data, (2) certain limitations shown in 2D image based approaches and (3) integration of 3D into an existing biometric system to improve recognition accuracy. The availability of depth information of a scene would only seem to provide additional information that can be exploited to solve the problem. Early work on 3D face recognition was done over a decade ago, e.g. [20, 39, 53, 62]. Recent years have seen a huge growth of interest in this area, as evidenced in 5

19 Tables 1.1, 1.2, 1.3 and 1.4. However, the amount of published work on 3D face recognition is still small relative to the enormous body of work dealing with face recognition using 2D intensity images. We have attempted to identify and discuss only the most recent and readily accessible publications. The similar level of expectation as was seen previously in the object recognition community can be observed currently in the biometric research community. Several assertions about the advantage of 3D over 2D in the context of face recognition include pose invariance, and illumination invariance. However, one of the critical elements of 3D data is sensing, and issues in sensing have generally been overlooked when claiming such advantages. Cartoux et al. [20] introduced 3D face recognition by the independent use of 3D face and profile. They do not combine these modes explicitly; rather, the profile is used to aid the overall process for face matching. At first, a symmetry plane of the face along the nose tip is detected and profiles (line intersecting to face plane) are obtained. The location of a profile is iteratively refined to produce the necessary transformation matrix later used for face matching. The separate results are presented using profile (curve) matching and then face (surface) matching using the nearest neighbor of each face surface. They claim that 100% accuracy is achieved for their dataset and that the performance is affected by quality of data more in profile than in frontal face. Lee and Milios proposed to represent faces by Extended Gaussian Images (EGI) for matching two face surfaces [53]. Faces are segmented by signs of principle curvatures into a set of convex regions (nose, cheek, chin and forehead), where an EGI is defined. Using the correlation coefficient of two EGIs, the similarity is measured during matching. They used an OR rule to determine the correct 6

20 TABLE 1.1 3D FACE RECOGNITION STUDIES Source (year) No. of No. of Data Reported Sensor subjects images variations performance type Cartoux ( 89) 5 18 training None 100% range [20] (3 4 ea.) sensor Lee ( 90) 6 6 None not range [53] reported sensor Gordon ( 91) 26 train 26 train None 100% cyberware [38] 8 test 24 test (3 ea.) Nagamine ( 92) None 100% range [62] (10 ea.) sensor Achermann ( 97) train 5 Not range [4] 120 test poses reported sensor (5 ea.) Chua ( 98) % range [30] (4 ea.) expr. sensor Tanaka ( 98) None Not range [77] reported sensor Hesher ( 03) % range [44] (6 ea.) expr. sensor Lao ( 00) poses & 91% 3-camera [52] (4 ea.) 2 illum. stereo 7

21 TABLE 1.2 3D FACE RECOGNITION STUDIES (CONT.) Source (year) No. of No. of Data Reported Sensor subjects images variations performance type Lee ( 03) time 94% at range [54] (2 ea.) Rank-5 sensor Medioni ( 03) % 2-camera [58] (7 ea.) poses stereo Moreno ( 03) expr. & 78% range [61] (7 ea.) 3 poses sensor Pan ( 03) pose 3 to 7% range [64] EER sensor Xu ( 04) expr. 96% on 30 range [86] 72% on 120 sensor Russ ( 04) time 98% on 200 range [72] sensor Lu ( 04) expr. & 96% range [55] poses. sensor 8

22 matching among those convex regions. Gordon [38] proposed geometrical features (principal curvatures), surface descriptors (ridge lines and valley lines) and structural descriptors (location, width and height) to characterize a face being modeled as a surface. Also, umbilic points (points on a surface at which the curvature is the same in any direction) have been calculated to obtain rich information to describe the human face. Depth template comparison and feature difference comparison are applied on both source and target surfaces. The volume difference is calculated to decide the identification of the subject in question. With such sophisticated features, rather limited experimentation is reported in the evaluation. It is not clear, however, how the method automatically locates the eyes starting from a nose position. Nagamine et al. proposed a 3D face identification method using the curves formed by intersections of vertical plane, horizontal plane and cylinder on the face surface [62]. After averaging nine images (out of ten available) for each subject, feature vectors consisting of section curves are extracted. In matching, Euclidean distance is computed to match a feature vector set of the averaged image against all others. While both the vertical profile and circle around the nose show 100% accuracy, the horizontal profile performs at 96.3% recognition rate. The location of profile needs to be accurately and repeatedly extracted for such an approach to be considered robust. Achermann et al. compared the use of Eigenface and Hidden Markov Model (HMM) methods on range images [4]. The motivation of this study is to make a face recognition system invariant to illumination changes or visual appearance changes using 3D range data where shape representations for a face are possible. Upon having 3D range data points from two views, one view is merged into the 9

23 other one using a z-buffering technique. In the experiment, 24 subjects with 5 different angle views were compiled. The result shows that the eigenfaces method in range images performs better than the HMM method. Chua et al. utilized point signature - representation for free-form surface measures to characterize a surface constructed from a 3D human face [30]. The proposed method first registers faces with different expressions by using point signatures in the rigid surface region in a face. This is due to the fact that point signature represents 3D rigid objects. Once the model is constructed, an adaptive threshold is used to classify a given surface area as rigid or non-rigid. As a matching process, the point signatures are compared as a voting scheme in order to verify the observed model among others in the model database. Tanaka et al. treat 3D face recognition as a 3D shape recognition problem with free-form curved surfaces [77]. Each 3D face is represented by an Extended Gaussian Image (EGI) and transformed into a set of 3D vectors with correlation in which the principal curvatures and the directions are contained. The points being calculated are sampled at high curvature values (12% of total surface points) and are shown to be effective. Two unit shares are constructed by minimum and maximum curvatures. Similarity measures use Fisher s spherical correlation in the matching process. They claimed that the method is robust to external factors such glasses, facial hair or hair style. The reported results were not clearly stated for those 37 subjects. However, they report that the average similarity for a correct match is 44% and for incorrect matches is 13%. Hesher et al. examined rendered 3D frontal images with six expressions (the types of expressions are not reported) for recognition using PCA and independent component analysis (ICA) [44]. After the images are geometrically normalized, 10

24 each face is translated to the center of the X Y plane and predefined Z position. The size of the training set is varied in experiments. Their results show that the best performance was achieved when the largest training set is used and ICA with first 10 independent components is used. Also, using their dataset, ICA performs slightly better than PCA. However, the performance changes with respect to different expressions are not reported. Lao et al. proposed 3D template matching for face recognition using face models acquired by a 3-camera stereo-based system [52]. Once the face models are constructed, the system searches for irises and the center of the mouth. The locations of such parts is used for correcting pose variations. The template matching involves (i) placement of models in the same coordinates, (ii) partitioning the models into non-overlapping regions of 5mm 5mm, (iii) mean distance between local regions, and (iv) selection of the smallest mean distance in depth as the answer. The performance accuracy rate is not reported. They claimed that the method shows reliable performance in pose variations (tested +/- 15 and 30 degrees in both left-right and upper-lower) ranging from 87% to 96% of one subject. Statistical features of 3D face data are used to recognize a person s identity by Lee et al. [54]. At first, a nose tip is detected using an adaptive threshold. Once a face is geometrically normalized, the enclosed boundary of face regions is extracted for further computation. Average and variance features are computed with a 5 5 kernel after data points are subsampled. The similarity measure for matching considers the Euclidean distance at given contour line threshold values. The results show that the overall performance improves when the facial area by larger contours. Medioni and Waupotitsch used a stereo camera system creating face models. 11

25 Then the models are used for face identification and verification using the statistics yielded from a surface matching between two meshes [58]. The system is used in experiments under some measured lighting conditions ( lux) and the performance accuracy is measured under different poses. The 3D face recognition system proposed by Moreno et al. [61] involves (1) segmentation of predefined regions in terms of (two) lines and (seven) areas, (2) feature extraction of those regions and, (3) a linear classification by matching the computed features using Euclidean distance. The regions and lines of interest are selected based on the degree of shape change by computed curvature. Thus most of such areas are located around the eyes and nose rather than around the forehead or chin. The features in 3D shape include areas, distances, angles, curvature average of a region etc.. The features are sorted by discriminating power as measured by Fisher coefficients. The best recognition rate was achieved using a training set in which 35 features (out of 86) are used. The result shows that 78% accuracy was obtained with frontal view images while 62% accuracy was obtained with smile expressions. Results of other expressions or poses are not clearly reported. Pan et. al [64] experiment with 3D face recognition using both a Hausdorff distance approach and a PCA-based approach. In experiments with images from the M2VTS database [59] they report an equal-error rate (EER) in the range of 3% to 5% for the Hausdorff distance approach and an EER in the range of 5% to 7% for the PCA-based approach. Xu and co-workers [86] developed a method for 3D face recognition and evaluated it using the database from Beumier and Acheroy [11]. The original 3D point cloud is converted to a regular mesh. The nose region is found, and used as an anchor to find other local regions. A feature vector is computed from the data 12

26 in the local regions of the mouth, nose, left eye and right eye. Feature space dimensionality is reduced using PCA, and matching is based on minimum distance using both global and local shape components. Experimental results are reported for 120 persons in the dataset and for a subset of 30 persons, with performance of 72% and 96%, respectively. This illustrates the general point that reported experimental performance can be highly dependent on the dataset size. Most other works have not considered performance variation with dataset size. It should be mentioned that the reported performance was obtained with five images of a person used for enrollment in the gallery. Performance would generally be expected to be lower with only one image used to enroll a person. Russ and co-workers [72] present results of Hausdorff matching on range images. They use portions of the Notre Dame dataset used in [26] in their experiments. In a verification experiment, 200 persons were enrolled in the gallery, and the same 200 persons plus another 68 impostors were represented in the probe set. A probability of correct verification as high as 98% (of the 200) was achieved at a false alarm rate of 0 (of the 68). In a recognition experiment, 30 persons were enrolled in the gallery and the same 30 persons imaged at a later time were represented in the probe set. A 50% probability of recognition was achieved at a false alarm rate of 0. The recognition experiment uses a subset of the available data because of the computational cost of the current algorithm [72]. Lu and co-workers [55] report on results of an ICP-based approach to 3D face recognition. This approach assumes that the gallery 3D image is a more complete face model and the probe 3D image is a frontal view that is likely a subset of the gallery image. In experiments with images from 18 persons, with multiple probe images per person, a recognition rate of 97% was achieved. 13

27 1.3 Survey of Multi-biometrics This section reviews studies that deal with multiple biometrics in order to improve robustness and accuracy of an existing single biometric method. Methods that use multiple types of biometric sources for identification purposes (multimodal biometric) are reviewed. The term multi-modal biometrics is used here to refer to the use of different sensor types without necessarily indicating that different parts of the body are sensed such as appearance and shape of face. Also, the term includes studies that use different parts of the body with the same sensor, such as face and speech. The important aspects of these multi-modal studies are summarized in Tables 1.3 and 1.4. A person identification system using two biometrics, face and speech, is presented by Brunelli and Falavigna [17]. A text-independent speaker recognition based on Vector Quantization (VQ) is used. The static and dynamic codebooks are designed based on VQ. The codebooks are used to represent a person s acoustic features in 8-dimensional vectors. A matching process compares the distances of static and dynamic vector sequences of a given an input stream to codebooks by the corresponding person. Face recognition method uses a template matching in local windows containing eyes, nose and mouth. After a normalization is applied to these sub-images, L 1 distance is calculated. During the fusion process, rankand measurement-based integration is considered. A significant improvement is achieved when multiple biometric sources are used in person identification. Chang et al. [27] proposed so called a phase-out vector filter which has been used for discriminating filters in signal processing into face recognition. This is face identification using range data and intensity image. A vector phase-out filter has been used widely in matched filters application. Those filters are computed 14

28 TABLE 1.3 MULTIPLE BIOMETRICS STUDIES Source (year) Biometric Combination Fusion No. of sources scheme level subjects Brunelli ( 95) Face + Voice Average Metric/ 89 [17] Rank Chang ( 97) 3D facial shape + Vector Pixel 10 [27] 2D face Combination Kittler ( 98) Face + Profile Sum,Min,Max Metric 37 [49] + Voice Product,Median Hong ( 98) Face + N/A Metric/ 64 [45] Fingerprint (replacement) Rank Ben-Yacoub ( 99) Face + Voice SVM Metric 37 [6] Verlinde ( 99) Face + Profile Weighted Metric 37 [80] + Speech Sum Frischholz ( 00) Face + Voice Product Metric 150 [36] + Lip movement Beumier ( 00) 2D frontal + Weighted Metric 120 [10] 3D profiles Sum Ross ( 01) Face + Hands Sum, Decision Metric 50 [71] + Fingerprint Tree Shakhnarovich ( 02) Face + Gait Sum,Min,Max Metric 26 [74] Product,Mean 15

29 TABLE 1.4 MULTIPLE BIOMETRICS STUDIES (CONT.) Source (year) Biometric Combination Fusion No. of sources scheme level subjects Poh (02 ) Face + Speech MLP Metric 30 [68] (multi- samples Bronstein (03 ) 2D face + 3D Unknown Metric 157 [16] face shape Wang (02 ) 2D Face + 3D SVM Metric 90 [83] face shape Sanderson (03 ) Face + Speech Sum Metric 12 [73] Chang ( 03) [21] Face + Ear Concatenation Pixel 308 Iwano ( 03) Ear + Speech Weighted Metric 38 [48] sum Aleksic ( 03) [5] Face + Speech HMM Metric Unknown Chang ( 03) 2D face + Sum,Product Metric/ 275 [22] 3D facial shape Min Rank Chen ( 03) 2D face + Facial Sum Rank 383 [28] heat pattern Hazen (03 )[42] Face + Speech Product Metric 35 16

30 from the normal vectors from the range images and used to discriminate between range faces. Given rendered faces, range face and intensity faces, each input image is normalized prior to recognition. They computed normal vectors from range data points to create rendered images, and also gray images are produced from x, y and z components of range image. Finally, a correlation vector is used to measure the similarity of extracted images, range, rendered and intensity image. Kittler et al. formalized the framework for combining multiple classifiers [49]. They experimented with the framework for combining three biometrics including frontal face, face profile and voice. The sum rule performs the best among the rules evaluated. Later, the combination rules are compared by combining multiple classifiers on handwritten digits. They concluded that the sum rule is reliable in decision making and remarkably robust. Hong et al. experimented with fingerprint, face and hand geometry [45]. Hand geometry involves matching the geometric structure of hands. Fourteen feature values (length and width of fingers, width of palm, etc.) are extracted to be compared. The verification accuracy of individual biometrics shows that fingerprint performs better than face and hand geometry. The accuracy of their tri-modal biometrics shows the highest performance among other bi-modalities including individual biometrics. Ben-Yacoub et al. proposed a multi-modal person authentication system in which a number of experts (face recognizer, speaker recognizer, etc.) give their opinion about the identity [6]. The opinions can be combined to form a final decision (rejecting or accepting the claim). They show that the final decision is a binary classification problem and propose to solve it by a SVM (Support Vector Machine). The introduced approach is compared with other methods for 17

31 an identical verification task and results show that it leads to considerably higher performance. Verlinde and Chollet [80] proposed to combine three biometric sources (profile, frontal face, speech). A K-NN classifier, a decision tree and a logistic regression model-based classifiers are compared for this tri-modal application. A logistic regression model performs the lowest in total error rate. A weight for a logistic regression model is estimated for each modality based on the training process. Speech was the most important while the frontal face is the least important during the decision process. However, regardless of the classifiers experimented with, the system accuracy improves when multiple biometrics are considered. Frischholz and Dieckmann proposed a multimodal biometric identification called BioID, which uses physical features to check a person s identity and ensures much greater security than password and number systems [36]. Biometric features such as the face or a fingerprint can be stored on a micro chip in a credit card, for example. A single feature, however, sometimes fails to be exact enough for identification. Another disadvantage of using only one feature is that the chosen feature is not always readable. A multi-modal identification system that uses three different features-face, voice, and lip movement-to identify people. With its three modalities, BioID achieves much greater accuracy than single-feature systems. They combined voice, facial, and lip imagery data. They also explain the classification principles used for optical features and the sensor fusion options (the combinations of the three results-face, voice, and lip movement-to obtain varying levels of security). Beumier estimated 3D profiles by incorporating intensity profiles extracted from an intensity image [10]. The full facial surface is constructed based on geo- 18

32 metric features of the external contour along with a profile-based approach. For the recognition phase, histogram difference and distance values between profiles determine the degree of similarity. A weighted sum of the 2D and 3D scores is used to perform the fusion process. They report that performance improves when 3D profiles and 2D intensity profiles are combined. The multi-biometrics introduced by Ross and Jain [71] use of three biometrics sources: face, fingerprint and hand geometry. The eigenface method is used for face verification. Its match is determined by the Euclidean distance between points in the facespace created with an optimized set of eigenvectors. Fingerprint verification uses unique pattern of ridges and furrows on finger tips. A system that combines face and gait recognition using a four-camera system is described by Shakhnarovich et al. [74]. The normalization technique operates on input sequence images. At the fusion step, they combined a set of images belonging to the same subject to establish a single class. This integrates multiple images acquired at different times to be more representative of a subject. The results show that combined method considered product or sum rules outperforms a single biometric source. An interesting approach was proposed by Poh et al. [68] and can be classified as hybrid multiple biometrics. Multiple samples are collected from multiple biometrics. Five samples for each of face and voice are collected. For a classification method, the Multi-Layer Perceptron (MLP) is considered. It is interesting to observe two facts from the results presented. First, as the number of sample is incremented, the accuracy improvement rate in face is faster than that for speech. Second, a faster improvement rate is observed with multi-biometrics than with multiple samples. This improvement partly comes from the fact that 19

33 multi-biometrics increases both the source of biometrics and the number of images (samples). The frontal views acquired in both CCD and structured light range camera have been experimented for face identification by Wang et al. [83]. From ten manually selected points in intensity images, Gabor filters are applied to compute 2D features and point signatures are estimated for 3D features around manually selected 4 points in range data. Similarity function and SVM determine the identification of a target person from the computed features. They reported higher true-positive rate from the SVM when compared to similarity function. Also, they claimed that performance rate is better when 2D and 3D visual data are combined than from 2D alone. Bronstein et al. proposed a 3D face recognition system that is claimed to be invariant to facial expression in conjunction with 2D face images [16]. 3D face is represented with isometric surfaces so that the surface deformations due to facial expression (i.e. isometric transformation) might be mitigated. The eigenspaces for 2D (flattened texture) and 3D (canonical image) are constructed. The weighted Euclidean distance between two subjects is measured during matching process. However, no experimental result is reported. Sanderson et al. shows that face and speech based multi-biometrics method with higher accuracy in biometric authentication task [73]. The speech recognition method uses Bayesian Maximum a Posteriori (MAP) approach. Two HMMs are used for synchronous alignment to retrieve the best path through the models. The eigenfaces method is used as the face recognition method. In the fusion process, a weight is considered based on the recognition accuracy rate of individual modality. As a part of the fusion process, a system determines the weights depending on 20

34 the hypothesis test per modality under the assumption of Gaussian distribution (e.g. the genuine access versus the imposter access). The results reported in this study show that a combined method yields better performance than the individual biometrics. An integrated biometric authentication system using a person s ear and speech is described by Iwano et al. [48]. While speech can be a useful biometric source to reveal a person s identity in a mobile environment, it needs to be robust in a practical setting. Audio features are modeled in HMMs and ear features obtained by the PCA approach are modeled in Gaussian Mixture Models (GMM). Matching scores are then estimated via product of log-likelihood of the posterior probabilities and weights are applied to individual scores during the integration. Their experimental results show that the overall system accuracy becomes more robust to additive white noise with speech when ear is integrated. A multi-biometric system using face and speech is proposed by Aleksic et al. [5]. The person s facial features represented as facial animation parameters (FAP) to speech features are modeled using HMMs. The motivation of the use of FAP is to examine the facial movement while the observed person speaks. After the reduced facial features are interpolated to synchronize with speech, observation features (in vectors) of each modality are integrated and passed to a HMM engine for training. The identification and verification scores are computed by loglikelihood of posterior probabilities independently. The robustness of the speech sub-system improves when facial features are integrated in both identification and verification. Different sensors acquire different aspects of the human face. In Chang et al. [22], frontal face shape and appearance are combined to improve an identification 21

35 accuracy over that of a single biometric. Range cameras acquire depth value from the camera to a subject while CCD captures the appearance. After normalization in pose and brightness is applied, face and shape recognition method is performed using eigenfaces. When two different biometric sources are combined at the metric level, it shows significant improvement in identification rate by a single biometric. Chen et al. [29] adapted an approach similar to that in [22] but the heat pattern of face obtained with a infra-red sensor is combined with face appearance. Discrete values acquired by infra-red represents the thermal flux emitted from an object or a scene. The motivation of the use of the heat pattern is to examine the value of uniqueness in heat pattern observed in individual face toward the identification purpose. The recognition rate from the combined method shows higher accuracy than either appearance or heat pattern alone. However, it reports that heat pattern provides poor recognition accuracy when probe and gallery images are taken with time-lapse. A multi-biometrics method combining face and speech is proposed by Hazen et al. [42]. The system is designed for person identification in mobile environments. Utterance is a useful variation that can be used to characterize a person s identity along with vocal tract size and vocal fold length. Speech recognition in the study matches a spoken utterance against a prompted utterance. After a phonetic transcription in a text-dependent system is generated from the input utterance, the method determines a score in a given speech segment. SVM classifiers based face recognition method is considered. Once SVM is trained, a test image is ranked based on selected n normalized distances from the zero-mean decision hyper-plane. Speech recognition outperforms face recognition. However, the combined method produces higher user verification rate than speech recognition. 22

36 1.4 Summary All of the studies reviewed in Tables 1.3 and 1.4 claim that multi-biometrics improves over an individual biometric. Biological features of the human face or other parts have different properties for different sensors. Each property can be characterized as better or worse depending on how the data of the entity is acquired for identification purposes. However, the integration of multiple biometrics sources clearly shows that it increases robustness and accuracy in recognition rate relative to a single biometric. Performance of human face recognition fluctuates severely depending on the dataset used. Most of the studies in this review have a limited number of subjects and images in the evaluation dataset. Designing and collecting the evaluation dataset is a very critical process for any face recognition system. Another important component of multi-biometrics methods is what and how to combine multiple biometric sources. There are different ways of fusing multiple biometric sources and multiple algorithms. Four different approaches are proposed, depending on the number of samples and the number of biometric sources in the context of multi-biometric studies [68]; single-sample single-source (SSSS), multi-sample single-source (MSSS) [50, 63], single-sample multi-source (SSMS) and multi-sample multi-source (MSMS) are defined and described. Also, multiple decision makers or experts (i.e. classifiers) of the same biometric source [56, 57, 70, 81] have been proposed. This can be considered as single-sourcemulti-algorithm (SSMA). The different types of multiple biometric approaches are revisited in Chapter 5. 23

37 1.5 Conclusion Studies using the shape of the face as a biometric feature are reviewed in the context of person identification. Important elements are summarized in Tables 1.1, 1.2, 1.3 and 1.4. Among the studies reported in the previous section, four [16, 30, 44, 61] studies considered the dataset with facial expression changes including only one study by [61] that explicitly devised the method according to facial expressions. (The study by [16] has no result of their method.) However, there is plenty of room to improve the overall accuracy of 3D recognition methods coping with facial expressions. The first study investigates face recognition in combination with recognition based on one s right ear (Chapter 2). Then, Chapter 3 and Chapter 4 combine different features of the face, appearance (2D) + shape (3D) and appearance + shape + heat pattern (IR) of the face, respectively. Chapter 5 discusses the value of multiple biometrics sources (appearance and shape) compared to multiple samples of 2D appearance images. A newly proposed 3D face recognition method using local facial surface is described in Chapter 6. This method is motivated by the problematic situations where facial expressions changes are present in 3D face shape. The proposed method involves multiple local surface regions in a face based on muscle movement observed from expressions. The datasets tested in each chapter are selected independently, thus the number of subjects and the condition of image quality are not the same across experiments. However, all experimental designs are the same in the sense that a certain technique is used for face recognition considering a gallery and a probe set. The results are presented by the CMC curve for the gallery and probe set or by the ROC curve for a verification study. Statistical significance of the difference 24

38 between rank-one recognition rates obtained by single biometrics and multiple biometrics is determined. 25

39 CHAPTER 2 MULTIPLE BIOMETRICS USING FACE AND EAR 2.1 Introduction While good face recognition performance has been reported under certain conditions, there is still a great need for better performance in biometrics appropriate for use in video surveillance 1. Possible avenues for improved performance include the use of a different source of biometric information, and / or the combination of information from multiple sources. One other possible biometric source is the ear. Iannarelli performed important early research on a manual approach to using the ear for human identification [47]. Recent works that explore computer vision techniques for ear biometrics include those of Burge and Burger [18] and Hurley et.al. [46]. In particular, Burge and Burger assert that the ear offers the promise of similar performance to the face: Facial biometrics fail due to the changes in features caused by expressions, cosmetics, hair styles, and the growth of facial hair as well as the difficulty of reliably extracting them in an unconstrained environment exhibiting imaging problems such as lighting and shadowing.... Therefore, we propose a new class of biometrics for passive identification based upon ears which have both reliable and robust features which are extractable from a distance.... identification by ear biometrics is promising because it is passive like face recognition, but instead 1 This chapter is based on the paper, Multimodal Biometrics in Face and Ear Using PCA, published in IEEE Transactions on Pattern Analysis and Machine Intelligence,

40 of the difficult to extract face biometrics, robust and simply extracted biometrics like those in fingerprints can be used. ([18], page 275) In the context of Iannarelli s earlier work and the current popularity of face recognition research, this assertion that the ear could offer improved biometric performance relative to the face deserves careful evaluation. The experiments reported in this paper are aimed at (1) testing the hypothesis that images of the ear provide better biometric performance than images of the face, and (2) exploring whether a combination of ear and face images may provide better performance than either one individually. The results reported here follow up on those reported in an earlier study [82]. Using larger data sets and more rigorous assurance of similar relative quality in the ear and face images, we obtain somewhat different results than in the earlier study. In the experiments reported here, recognition performance is not statistically significantly different using ear images or face images, and combining the two for multi-modal recognition results in a statistically significant performance improvement. For example, in one experiment the rank-one recognition rates for face and ear were 70.5% and 71.6%, respectively, whereas the corresponding multi-modal recognition rate was 90.9%. 2.2 Eigen-faces and Eigen-ears Extensive work has been done on face recognition algorithms based on principal component analysis (PCA), popularly known as eigenfaces [79]. The FERET evaluation protocol [67] is the de facto standard in evaluation of face recognition algorithms, and currently uses PCA-based recognition performance as a baseline. A standard implementation of the PCA-based algorithm [12] is used in the experiments reported here. This implementation requires the location of two landmark 27

Figure 2.1. Illustration of points used for geometric normalization of face and ear images. The Triangular Fossa is the upper point on the ear image, and the Antitragus is the lower point.

Manually identified eye center coordinates are supplied with the face images in the Human ID database.

41 Figure 2.1. Illustration of points used for geometric normalization of face and ear images. The Triangular Fossa is the upper point on the ear image, and the Antitragus is the lower point. points for image registration. For the face images, the landmark points are the centers of the eyes. Manually identified eye center coordinates are supplied with the face images in the Human ID database. For the ear images, the manually identified coordinates of the Triangular Fossa and the Antitragus [47] are used. See Figure 2.1 for an illustration of the landmark points. The PCA-based approach begins with using a set of training images to create a face space or ear space. First, the landmark points are identified and used to crop the image to a standard size located around the landmark points. In our experiments, original face images are cropped to 768x1024 and original ear images to 400x500. In these images, one pixel covers essentially the same size area on the face or the ear. Next, the cropped images are normalized to the 130x150 size used by the PCA software. At this point, one pixel in an ear image represents a finer-grain metric area than in a face image. The normalized images are masked to gray out the background and leave only the face or ear, respectively. The face images use the mask that comes 28

42 with the standard implementation [12]. For the ear images, we experimented with several different levels of masking in order to tune this algorithm parameter for good performance. Lastly, the image is histogram equalized. The eigenvalues and eigenvectors are computed for the set of training images, and a face space or ear space is selected based on the eigenvectors associated with the largest eigenvalues. Following the FERET approach, we use the eigenvectors corresponding to the first 60% of the large eigenvalues and drop the first eigenvector as it typically represents illumination variation [67]. This approach uses the same dimension of face space and ear space, 117 in this case (Table 2.1). Another approach is to use whatever number of eigenvectors accounts for some fixed percent of the total variation, resulting in a different dimension of face space and ear space. Which of these approaches is used does not substantially affect our conclusions, as is shown later in the paper. The set of training images consists of data for 197 subjects, each of whom had both a face image and an ear image taken under the same conditions at the same image acquisition session. These images were acquired at the University of South Florida (USF) between August 2000 and November A subject s images were dropped from our study if either the face or ear was substantially obscured by hair, if the subject wore an earring or analogous face jewelry, or if either image had technical problems. Some of the gallery and probe images for the first experiment were acquired at USF during the same time frame. Additional gallery and probe images for the first experiment, and all gallery and probe images for the second and third experiments, were acquired at the University of Notre Dame in November There is a separate (gallery, probe) dataset for each of three experiments. The 29

43 gallery images represent the watch list, that is, the people who are enrolled in the system in order to be recognized. A probe image is an image given to the system to be matched against the gallery. Each of the three experiments represents a single factor being varied in a consistent way between the gallery and probe. For the day variation experiment, eighty-eight subjects had both an ear and a face image taken under the same conditions in one acquisition session, and then another ear and face image taken under the same conditions on a different day. The face images are the standard FERET F A ( normal expression ) images [67]. The ear images are of the right ear. For each subject, the earlier image is used as the gallery image and the later image is used as the probe image. This experiment looks at the recognition rate when gallery and probe images of a subject are obtained on different days, but under similar conditions of pose and lighting. For the lighting variation experiment, 111 subjects had an ear and a face image taken under the same conditions in one session, and then another face and ear image taken in the same session, but under a different lighting condition. The standard lighting uses two side spotlights and one above-center spotlight, and the altered lighting uses just the above-center spotlight. The images taken under the standard lighting are gallery images, and the images taken under altered lighting are probe images. This experiment looks at the recognition rate when gallery and probe images of a subject are obtained in the same session and with similar pose, but under distinctly different lighting. For the pose variation experiment, 101 subjects had both an ear and a face image taken under the same conditions in one acquisition session, and then another face and ear image taken at 22.5 degree rotation in the same acquisition session. 30

Gallery image Day variation Lighting variation Pose variation

2. An example of the gallery and probe face and ear images

The images taken from a straight-on view are the gallery set,

probe images of a subject are obtained in the same session and

An example of the gallery and different probe conditions for

Not all subjects attended all acquisition sessions, and some

quality control checks, and so the three experiments have

44 Gallery image Day variation Lighting variation Pose variation Figure 2.2. An example of the gallery and probe face and ear images used in this study. The images taken from a straight-on view are the gallery set, and the images taken at a 22.5 degree rotation are the probe set. This experiment looks at the recognition rate when gallery and probe images of a subject are obtained in the same session and with the same lighting, but with a different pose. An example of the gallery and different probe conditions for one subject appear in Figure 2.2. Not all subjects attended all acquisition sessions, and some subjects were dropped from some experiments after image quality control checks, and so the three experiments have different numbers of subjects. The same standard face and ear images of some subjects may appear in the gallery set for each of the three experiments. However, since the probe sets are the changed conditions, there are no images in common across the three probe sets. 31

45 2.3 Experimental Results: Face versus Ear The null hypothesis for these experiments is that there is no significant difference in performance between using the face or the ear as a biometric, given (1) use of the same PCA-based algorithm implementation, (2) the same subject pool represented in both the gallery and probe sets, and (3) controlled variation in one parameter of image acquisition between the gallery and probe images. The recognition experiment is to compute the cumulative match characteristic (CMC) curve for the gallery and probe set, and to consider the statistical significance of the difference in rank-one recognition rates. The baseline is the day variation experiment. This experiment looks at the recognition performance for gallery and probe images taken under the same conditions but on different days. The CMC curves for face and ear recognition are shown in Figure 2.3. The CMC curves are computed in two ways. One uses the 197-image training set that has no subjects in common with the gallery and probe sets. The other uses the gallery set as the training set. There is no substantial difference in the results between the two training methods. Numbers reported for statistical significance tests are taken from the results using the 197-image training set. Note that the ear and face performance represented in the CMC curves is quite similar, with the curves actually crossing at some point. The rank-one recognition rates of 70.5% for face and 71.6% for ear are not statistically significantly different at the 0.05 level using a McNemar test [13]. Relative to the baseline experiment, the lighting variation experiment looks at how a lighting change between the gallery image and the probe image affects the recognition rate. Performance for either the face or the ear is slightly lower than in the baseline experiment. Similar to the baseline experiment, there is relatively 32

46 1.0 Face vs. Ear : Day Variation (Training on 197 subjects) 1.0 Face vs. Ear : Day Variation (Training on the gallery set) SCORE 0.5 SCORE Face (88 subjects) Face (88 subjects) Ear (88 subjects) Ear (88 subjects) RANK RANK (A) Face and ear recognition performance in the day variation experiment. 1.0 Face vs. Ear : Lighting Variation (Training on 197 subjects) 1.0 Face vs. Ear : Lighting Variation (Training on the gallery set) SCORE 0.5 SCORE Face (111 subjects) Face (111 subjects) Ear (111 subjects) Ear (111 subjects) RANK RANK (B) Face and ear recognition performance in the lighting variation experiment. 1.0 Face vs. Ear : Pose Variation (Training on 197 subjects) 1.0 Face vs. Ear : Pose Variation (Training on the gallery set) SCORE 0.5 SCORE Face (101 subjects) Face (101 subjects) Ear (101 subjects) Ear (101 subjects) RANK RANK (C) Face and ear recognition performance in the pose variation experiment. Figure 2.3. Recognition performance comparison between face and ear. 33

47 little difference between the CMC curves for the face and the ear, especially at lower ranks. The rank-one recognition rates of 64.9% for face and 68.5% for ear are not statistically significantly different at the 0.05 level using a McNemar test. Relative to the baseline experiment, the pose variation experiment looks at how a 22.5 degree rotation to the left between the gallery and the probe images affects the recognition rate. Performance in this case is much lower than for either the baseline or the lighting change experiment. There also appears to be a larger gap between face and ear performance than in the other two experiments, but still the difference is not statistically significant. In any case, performance at this low of a level is not likely to be practically meaningful. Overall, the results of our experiments do not provide any significant evidence for rejecting the null hypothesis that the face and the ear have equal potential as the source for appearance-based biometric recognition. Of course, there may still be some biometric algorithm, other than PCA, for which one of the face or the ear offers significantly better recognition performance than the other. Also, there may be particular application scenarios in which it is not practical to acquire ear and face images that meet similar quality control conditions. For example, in an outdoor sports context many people may wear sunglasses, or in a formal indoor event many people may wear earrings. 2.4 Experimental Results: Face Plus Ear Multi-modal Biometric Another experiment was performed to investigate the value of a multi-modal biometric using the face and ear images. A very simple multi-modal combination technique is used. The normalized, masked ear and face images of a subject are concatenated to form a combined face-plus-ear image. This was done with the 34

48 data from each of the three experiments, and Figure 2.4 shows the resulting CMC curves. The CMC curves for the day variation and lighting variation experiments suggest that the multi-modal biometric offers substantial performance gain. The difference in the rank-one recognition rates for the day variation experiment using the 197-image training sets is 90.9% for the multi-modal biometric versus 71.6% for the ear and 70.5% for the face. A McNemar s test for significance of the difference in accuracy in the rank-one match between the multi-modal biometric and either the ear or the face alone shows that multi-modal performance is significantly greater at the 0.05 level. Of the 88 probes, the multi-modal and the ear are both correct on 62, both incorrect on 6, multi-modal only is correct on 18, and ear only is correct on 2. The difference between the multi-modal biometric and either the face or the ear alone is again statistically significant in the lighting change experiment, 87.4% rank-one recognition rate versus 64.9% or 68.5%, for the face or ear, respectively. However, because the overall performance is so low, the difference in the pose change experiment is not statistically significant. These results suggest that it is worthwhile to explore the combination of multiple biometric sources that could be acquired in a surveillance scenario. 2.5 Summary and Discussion Overall, our experimental results suggest that the ear and the face may have similar value for biometric recognition. Our ear recognition results do not support a conclusion that an ear-based or face-based biometric should necessarily offer better performance than the other. Of course, this is not the same as proving that there is no useful biometric algorithm for which one would offer better performance. Research into new algorithms that take advantage of specific features of 35

49 1.0 Face vs. Ear vs. Face+Ear : Day Variation (Training on 197 subjects) 1.0 Face vs. Ear vs. Face+Ear: Day Variation (Training on the gallery set) SCORE 0.5 SCORE Face (88 subjects) 0.1 Face (88 subjects) Ear (88 subjects) Ear (88 subjects) Face+Ear (88 subjects) Face+Ear (88 subjects) RANK RANK (A) Face combined with ear recognition performance in the day variation experiment. 1.0 Face vs. Ear vs. Face+Ear : Lighting Variation (Training on 197 subjects) 1.0 Face vs. Ear vs. Face+Ear : Lighting Variation (Training on the gallery set) SCORE 0.5 SCORE Face (111 subjects) 0.1 Face (111 subjects) Ear (111 subjects) Ear (111 subjects) Face+Ear (111 subjects) Face+Ear (111 subjects) RANK RANK (B) Face combined with ear recognition performance in the lighting variation experiment. 1.0 Face vs. Ear vs. Face+Ear : Pose Variation (Training on 197 subjects) 1.0 Face vs. Ear vs. Face+Ear: Pose Variation (Training on the gallery set) SCORE 0.5 SCORE Face (101 subjects) 0.1 Face (101 subjects) Ear (101 subjects) Ear (101 subjects) Face+Ear (101 subjects) Face+Ear (101 subjects) RANK RANK (C) Face combined with ear recognition performance in the pose variation experiment. Figure 2.4. Recognition performance of face, ear and combined face-ear. 36

50 1.0 Different Eigenvector Selections for Day Experiment (Face) (Training on 197 subjects) 1.0 Different Eigenvector Selections for Day Experiment (Ear) (Training on 197 subjects) SCORE 0.5 SCORE % of total eigenvectors (117 eigenvetors) 90% of energy variation (86 eigenvectors) RANK % of total eigenvectors (117 eigenvectors) 90% of energy variation (76 eigenvectors) RANK Figure 2.5. Performance based on different selection of eigenvectors in face and ear spaces. the ear, or the face, may produce improved performance using one or the other. The results presented so far are based on using the same fixed number of eigenvectors for both the face and ear space. It is also possible to create the spaces based on the same percent of energy, allowing the number of eigenvectors to vary as appropriate. CMC curves computed using both spaces for the day variation experiment appear in Figure 2.5. Performance is essentially the same whether the spaces are created based on a fixed number of eigenvectors in this case, or a floating number of eigenvectors corresponding to a fixed percent of total energy. The PCA-based face recognition approach has been informally tuned through use over time, and inevitably an accumulation of expertise is embedded in the standard implementation [12]. Several options were explored in an attempt to ensure that the use of the PCA approach was appropriately tuned for use with ear images. For example, five different levels of masking for the ear images were tried. Also, a total of four landmark points were marked on each ear image 37

51 TABLE 2.1 A NUMBER OF EIGENVECTORS USED TO CREATE THE EIGENSPACE Eigenvector Selections Face Ear Face Plus Ear First 60% of total eigenvectors Eigenvectors used in 90% energy variation and experiments were run with a different pair of landmark points. The results reported here are for the best level of masking and pair of landmark points. Our results are obtained using the PCA-based algorithm, whereas Burge and Burger [18] and Hurley et.al. [46] each propose a different approach. Thus one possible reservation to our conclusion is that it may be dependent on the particular algorithmic approach. However, we know of no experimental results in the literature for either of the other proposed approaches. In our own efforts to implement one of the approaches, we found the basic ear description used to be rather unstable. The description is an attributed graph obtained from the Voronoi diagram of the edges detected in the ear image [18]. One problem is that the edges detected from the ear image can be very different for relatively small changes in camera-to-ear orientation or in lighting. The edges detected in an image of the ear arise mostly from occluding contours, rather than from surface discontinuities or surface-marking-like effects. Thus the edges will naturally be substantially different if there are changes in orientation or lighting. 38

52 We have tried to make the face versus ear aspect of this experiment fair in the sense of having equivalent quality control rules for each type of image. For example, all images are of subjects not wearing earrings or any face jewelry, and all images had no substantial amount of the ear or face obscured by hair. These restrictions are in one sense equal in terms of quality of images used in the experiments, but are not necessarily equal in the sense of being equally likely to be true of images acquired in practice. 39

53 CHAPTER 3 MULTI-MODAL 2D+3D FACE RECOGNITION 3.1 Introduction This chapter investigates face recognition using 2D and 3D 1. Each modality captures different aspects of facial features, 2D intensity representing surface reflectance and 3D depth values representing face shape. Even though each modality has its own advantages and disadvantages depending on certain circumstances, there is often some expectation that 3D data should yield better performance. Not only are depth relationships between range-image regions explicit, the three-dimensional(3d) shape of image regions approximates the 3D shape of the corresponding object surfaces in the field of view. Since correct depth information depends only on geometry and is independent of illumination and reflectivity, intensity-image problems with shadows and surface markings do not occur. Therefore, the process of recognizing objects by their shape should be less difficult in range images than in intensity images. ([7], page 76) However, no rigorous experimental study has been reported to validate this expectation. The experiments reported in this study are aimed at (1) examining the spatial / depth resolution needed for 3D face recognition (2) testing the hypothesis that 3D face data provides better biometric performance than 2D face data, using the PCA-based method, and (3) exploring whether a combination of 1 This chapter is based on the paper, Multimodal 2D and 3D Biometrics for Face Recognition, presented in ACM Workshop on Multimodal User Authentication,

54 2D and 3D face data may provide better performance than either one individually in both a single probe study and a multiple probe study. 3.2 Methodology The PCA method is used as in Chapter 2, for both 2D and 3D face recognition. The method can be easily adapted to 3D images. Every pixel in a 3D image represents a distance value from the 3D image plane to the object in the scene Normalization The main objective of the normalization process is to minimize the uncontrolled variations that occur during the acquisition process and to maintain the variations observed in facial feature differences between individuals. The normalized images are masked to omit the background and leave only the face region (See Figure 3.2-(D)). While each subject is asked to gaze at the camera during the acquisition, it is inevitable to obtain data with some level of pose variation between acquisition sessions Normalization of 2D Images The main purpose of the normalization step is to minimize, to the degree possible, the uncontrolled variations that occur during the acquisition process and to maintain the variations in facial appearance between individuals. After color images are converted into gray level images, normalization steps for geometry and brightness are applied to the 2D images. The 2D images are treated as having pose variation only around the Z axis, the optical axis. Two control points (1 and 2) at the centers of the eyes are selected manually for geometric normalization 41

Landmark (control) points specified in (A) 2D image and (B) 3D image. noise.

55 to correct for rotation, scale, and position of the face (as shown in Figures 3.1, 3.2). Finally, median filtering is applied with a 7 7 kernel window to suppress (A) (B) Figure 3.1. Landmark (control) points specified in (A) 2D image and (B) 3D image. noise. The face region is interpolated into a template that masks out the background. This scales the original image so that the pixel distance between the eye centers is 80. Histogram equalization is applied to standardize the intensity distribution. This attempts to minimize the variation caused by illumination changes between images. 42

(A) Input image (B) Rotated image (C) Resized image (D) Final image (dimension: 130 x 150) Figure 3.2. 2D Pose normalization.

Once the pose is corrected, the image is scaled so that the distance between those two points are approximately 110 pixels apart.

2 Normalization of 3D Images Each point defined in 3D space for a range image has a depth value along the Z-axis. Only the geometric normalization is needed to correct the pose variation.

56 (A) Input image (B) Rotated image (C) Resized image (D) Final image (dimension: 130 x 150) Figure D Pose normalization. The input image is rotated around the optical axis using two manually selected points on eye outer tips so that the line passing through two eye points are parallel to a horizontal line (X -axis). Once the pose is corrected, the image is scaled so that the distance between those two points are approximately 110 pixels apart. Once the face region is extracted, histogram equalization is applied to normalize the brightness level. Finally, a template mask is applied producing a final image in (D) Normalization of 3D Images Each point defined in 3D space for a range image has a depth value along the Z-axis. Only the geometric normalization is needed to correct the pose variation. Four control points are manually selected to accomplish the task, as shown in Figure 3.1-(B). The pose in a 3D face is normalized image as follows. A transformation matrix is first computed based on the surface normal angle difference in X (roll) and Y (pitch) between manually selected landmark points (1, 2 and 3 in Figure 3.1-(B)) and predefined reference points of a standard face pose and loca- 43

57 tion. The outer eye corners rather than eye centers are used as landmark points because the eyeball is an artifact-prone region for the range sensor whereas the eye corners marked on the skin are more reliable. The landmark points for the eye corners and the center of chin are used to place the raw 3D image in a standard pose. Pose variation around the Z axis (yaw) is corrected by measuring the angle difference between the line across the two eye points and a horizontal line. At the end of the pose normalization, the nose tip (point 4 in Figure 3.1-(B)) of every subject is translated to the same point in 3D relative to the sensor (see Figure 3.3). The geometric normalization in 2D gives the same pixel distance between eye locations to all faces. This is necessary because the absolute scale of the face is unknown in 2D. However, this is not the case with a 3D face image, and so the eye locations may naturally be at different pixel locations in depth images of different faces. Thus, the geometric scaling was not imposed on 3D data points as it was on 2D. The Minolta sensor produces registered 2D and 3D images. Thus, in principle, it is possible to create a fully pose corrected 2D image by projecting the color texture from the pose corrected 3D. However, there are missing data points in the 3D. In an initial study, we found that missing data problems with fully posecorrected 2D outweighed the gains from the additional pose correction [22], and so we use the typical Z-rotation corrected 2D. Problems with the 3D are alleviated to some degree by preprocessing the 3D data to fill in holes (a region where there is missing 3D data during sensing) and remove spikes (see Figure 3.4). The raw 3D image is converted to a range image by the following process. The outer eye corners, nose tip, and the center of chin are marked as landmark points on the raw image 44

(A) X-Y plane (B) Y-Z plane Initial pose of a subject in 3D space (C) X-Y plane (D) Y-Z

image to create a range image from the depth values.

58 (A) X-Y plane (B) Y-Z plane Initial pose of a subject in 3D space (C) X-Y plane (D) Y-Z plane Corrected pose of a subject in 3D space Figure D Pose normalization. as shown 3.1-(B). Then a region around the marked nose tip is searched to refine the nose tip location by comparing Z values, if needed. The refined nose tip gives the centerline for cropping a region from the raw 3D image to create a range image from the depth values. The next step attempts to remove spike artifacts that can occur in the 3D where there are too-light or too-dark regions in the 2D images of the structured-light pattern. The variance in the Z value of the 3D is computed for an window around each pixel. If the 45

59 variance is larger than a threshold value, then the current pixel is considered to be part of a spike artifact and is eliminated, leaving holes in the data. Lastly, these holes and any originally occurring holes are removed by linear interpolation of missing values from good values around the edges of the hole. The process of creating the range image is fully automated after eye, nose and chin points are marked. We found that missing data problems with fully pose-corrected 2D outweighed the gains from the additional pose correction [23], and so we use the typical Z- rotation corrected 2D Eigenvector Tuning There are various ways of selecting the set of eigenvectors to create a face space. For instance, a face space can be created using a sequence of eigenvecotrs that retain certain amount of energy level as used in Chapter 2 (See Table 2.1). For the experiments reported in this chapter, the face space is created from a training set of 2D and 3D images. Initially, eigenvectors with largest eigenvalues are dropped in descending order of eigenvalue, and the rank-one recognition rates of each modality is recomputed each time, continuing until a point is reached where the rank-one recognition rate drops. The number of dropped eigenvectors with largest eigenvalues is denoted as M. Then, one vector at a time is dropped from the eigenvectors with the smallest eigenvalues in order of ascending eigenvalue, and the rank-one recognition is recomputed each time, continuing until a point is reached where the rank-one recognition rate gets worse rather than better. The number of dropped eigenvectors with smallest eigenvalues is also denoted as N. This process applies to both 2D and 3D in the same manner. After M and N are determined 46

for each modality, these numbers are fixed for all experiments. The face space for each modality was created individually and the ranges of eigenvectors for 2D and 3D are selected.

60 (A) Spike noise removal (B) Hole filling by interpolation in range data Figure 3.4. After a face region is extracted, spikes are removed first and then missing data points are defined by linear interpolation. for each modality, these numbers are fixed for all experiments. The face space for each modality was created individually and the ranges of eigenvectors for 2D and 3D are selected Distance Metrics The characteristics of the face space for PCA could potentially be very different between the modalities. Thus, during the decision process, certain metrics might perform better in one space than in the other. In this experiment, the Mahalanobis cosine distance metric was used during the matching process [87]. 47

61 It is our experience that the Mahalanobis cosine distance metric consistently outperformed other metrics, such as L 1 and L 2, for both 2D and 3D face recognition Data Fusion The pixel level provides perhaps the simplest approach to combining the information from multiple image-based biometrics. The images can simply be concatenated together to form one larger aggregate 2D-plus-3D face image. The fusion method applied in Chapter 2 combines pixel data to represent combined biometrics. This approach is considered as early fusion. The metric level fusion (late fusion) focuses on combining the match distances that are found in the individual spaces. Having distance metrics from two or more different spaces, a rule of how to combine the distances across the different biometrics for each person in the gallery can be applied. The ranks can then be determined based on the combined distances. Scores from each modality need to be normalized to be comparable to each other prior to fusion. There are several ways of transforming the scores including linear, logarithmic, exponential, logistic, etc. [3]. The scores are normalized so that the range is [0, 100] for each modality. There are ways of combining different metrics to achieve the best decision process, including majority vote, sum rule, multiplication rule, median rule, min rule, average rule and other combination rules (See Tables 1.3 and 1.4 in Chapter 1). It is known that the sum rule and the multiplication rule provide generally plausible results [3, 49, 74]. A weight is estimated based on the distribution of the top three ranks in each space. The motivation is that a larger distance between first- and second-ranked matches implies greater certainty that the first-ranked match is correct. The level 48

62 of the certainty can be considered as a weight representing the certainty. The weight can be applied to scores (metric) as the combination rules are applied. The multi-modal decision is made as follows. First the probe is matched against the gallery in each of the 3 spaces. This gives three sets of N distances, where N is the size of the gallery. A plain sum-of-distances rule would sum the distances for each gallery subject and select the subject with the smallest sum. We use a confidence-weighted variation of the sum-, multiplication- and min-of-distances rule. For each space, a confidence is computed using the first three distances as follows: W eight ij = metric 2 metric 1 metric 3 metric 1 for the ith gallery in the jth modality. metric k is the kth closest distance measure to a gallery from the observed probes. If the difference between the first and second distance metric is large compared to the typical distance, then this confidence value will be large. The confidence values are used as weights in the metric fusion. 3.3 Data Collection Images were acquired at the University of Notre Dame between January and May Two four-week sessions were conducted for data collection, approximately six weeks apart. The first session is to collect gallery images and the second session is to collect probe images for a single probe study. For a study with multiple probes, an image acquired in the first week is used as a gallery and images acquired in multiple later weeks are used as probes. Thus, in the single probe study, there are at least six and as many as thirteen weeks time lapse between the acquisition of gallery image and its probe image, and at least one and as many as thirteen weeks time lapse between the gallery and the probe in the 49

63 multiple probe study. A total of 275 different subjects participated in one or more data acquisition sessions. Among 275 subjects, 200 participated in both a gallery acquisition and a probe acquisition. Thus, there are 200 individuals in the single probe set, the same 200 individuals in the gallery, and 275 individuals in the training set. The training set contains the 200 gallery images plus an additional 75 for subjects whom good data was not acquired in both the gallery and probe sessions. And for the multiple probe study, 476 new probes are added to the 200 probes, yielding 676 probes in total. The training set of 275 subjects is the same as the set used in the single probe study. In each acquisition session, subjects were imaged using a Minolta Vivid 900 range scanner. Subjects stood approximately 1.5 meter from the camera, against a plain gray background, with one front-above-center spotlight lighting their face, and were asked to have a normal facial expression ( F A in FERET terminology [67]) and to look directly at the camera. Almost all images were taken using the Minolta s Medium lens and a small number of images was taken with its Tele lens. The height of the Minolta Vivid scanner was adjusted to the approximate height of the subject s face. The Minolta Vivid 900 uses a projected light stripe to acquire triangulation-based range data. It also captures a color image nearsimultaneously with the range data capture. The result is a 640 by 480 sampling of range data and a registered 640 by 480 color image. The average number of 3D data points acquired per image is approximately 83, Experiments There are three main parts to this study. The first part is to examine how the recognition performance is affected by the X Y resolution in both 2D and 3D and 50

A study of one gallery with four probes A study of one gallery with three probes Figure 3.

The second part is to evaluate the performance of 2D and 3D independently in both single

Data fusion is considered, in the third part, to combine results at the metric level with

The eigenvectors for each face space are tuned by dropping the first M and last N

64 A study of one gallery with four probes A study of one gallery with three probes Figure 3.5. Examples of masked images in 2D and 3D. depth resolution in 3D data. The second part is to evaluate the performance of 2D and 3D independently in both single and multiple probe studies. Data fusion is considered, in the third part, to combine results at the metric level with different fusion strategies. The eigenvectors for each face space are tuned by dropping the first M and last N eigenvectors to obtain a tuned set of eigenvectors. Thus, in general we expect to have a different set of eigenvectors representing 2D face space versus representing 3D face space. The cumulative match characteristic (CMC) curve is 51

65 generated to present the results. Both M and last N values are fixed throughout the experiments X Y Resolution Effects in Identification Accuracy This experiment looks at the performance rate changes while the spatial resolution is varied in texture and shape images. One average pixel in X axis produced by the Minolta Vivid 900 covers 0.98mm and one pixel in Y axis covers 0.98mm of surface area. A typical template size that we initially used was 130 x 150 pixels (a face coverage area of approximately 12.7cm x 14.7cm). Figure 3.6 shows examples of both 2D (top row) and 3D (bottom row) images used for this experiment, starting from the right most, converted to 25%, 50%, 75% and 100% of the original dimension. Thus, every pixel is retrieved in the step of 3.97mm, 1.96mm, 1.31mm and 0.98mm from the original X and Y data points in each image set. The performance results are shown in Figure 3.7. The graph is plotted using the first rank match performance rate. Both performance curves begin to drop at the resolution of 1.31mm in X Y, (in 2D, 89.0% to 85.0%, and 94.5% to 89.5% in 3D). However, the spatial resolution changes attempted in both 2D and 3D suggest that there is no significant difference in performance rates from the original resolution. We believe that performance degradation results from undersampling the face and missing differentiating features. The stiff performance drop between 50% and 25% seems to be due the insufficient facial features to differentiate between subjects in PCA method. However, the original template size ( ) is used for the rest of the experiments. 52

66 Figure D and 3D images in different X Y (spatial) resolutions Depth Resolution Effects in Identification Accuracy This experiment examines the depth resolution required to maintain the performance rate of the original depth resolution. According to the Minolta Vivid 900 specifications, it is capable of achieving depth resolution accuracy of 0.35mm. One way to vary the original resolution is to change the precision level in floating point values of the Z coordinate. A lower limit on the reported precision could be 10 6 mm. However, the camera-to-subject distance and lens combination used in our acquisition likely support an actual depth resolution of no better than about 0.5mm on average. Fourteen different resolutions were examined so that every pixel value representing the actual coordinate is rounded to the nearest 10 6 mm, 10 5 mm, 10 4 mm, 10 3 mm, 10 2 mm, 10 1 mm, 0.5mm, 1mm, 2mm, 3mm, 4mm, 5mm, 6mm and 7mm as shown in Figure 3.8. As shown in Figure 3.9, the general trend is that the performance rate decreases as the depth resolu- 53

67 Rank One Match Rank One Match 2D Performance Rates in Spatial Resolution Changes Default Resolution % 75% 50% 25% (Spatial Resolution) 3D Performance Rates in Spatial Resolution Changes Default Resolution % 75% 50% 25% (Spatial Resolution) Figure 3.7. Experiment in X Y (spatial) resolutions changes (from left to right, , , and ). tion gets coarser. Between 10 6 mm and 0.5mm as the depth resolution accuracy, the performance rate is essentially flat. There is a drop in performance going from 0.5mm to 1.0mm, which would coincide with our estimate of the average depth resolution accuracy. Performance degradation is especially pronounced after 3mm. However, it is interesting to note that the performance rates between 0.5mm and 3mm maintain remarkably close to the original resolution (within 2.5%). This may be partially because as the resolution gets coarser, random noise would be suppressed. As it gets even coarser, a face surface becomes overly contoured and identification suffers from such coarsely quantized surfaces. 54

68 Figure 3.8. Example of images in different depth resolutions (in mm). They are shown from top to bottom and left to right, 10 6, 10 5, 10 4, 10 3, 10 2,10 1, 0.5, 1, 2, 3, 4, 5, 6, 7(mm) Experimental Results: 2D versus 3D face - Single probe study The purpose of this experiment is to investigate the performance of individual 2D eigenface and 3D eigenface methods, given (1) the use of the same PCAbased algorithm implementation, (2) the same subject pool represented in training, gallery and probe sets, and (3) the controlled variation in one parameter, time of image acquisition, between the gallery and probe images. A similar comparison experiment between 2D and 3D acquired using a stereo-based system was also reported by Medioni et al. [58]. During the eigenvector tuning process, the rank-one recognition rate remains basically constant with from one to 20 eigenvectors dropped from the end of the 55

69 Rank One Match Default Floating Precision Level 3D Performance Results in Precision Changes (Unit : mm) Figure 3.9. Performance results in different depth resolutions. SCORE D vs. 3D vs. Fusion (Single Probe) 2D Eigenfaces 3D Eigenfaces Fusion (Linear Sum) RANK Figure Performance results in single probe study. 56

70 list. This probably means that more eigenvectors can be dropped from the end to create a lower-dimension face space. This would make the overall process simpler and faster. The rank-one recognition rate for dropping some of the first eigenvectors tends to improve at the beginning but it starts to decline as M gets larger. After the eigenvectors are tuned, both 2D and 3D are coincided at M = 3, and N = 0 to create the face spaces. With the given optimal set of eigenvectors in 2D or 3D, the results show that rank-one recognition rate is 89.0% for 2D, and 94.5% for 3D (see Figure 3.10) Experimental Results: Multi-modal biometrics using 2D and 3D The purpose of this experiment is to investigate the value of a multi-modal biometric using 2D and 3D face images, compared against the individual biometrics. The null hypothesis for this experiment is that there is no significant difference in the performance rate between uni-biometrics (2D or 3D alone) and multi-biometrics (both 2D and 3D together). According to Hall [41], a fusion can be usefully done if an individual probability of correct inference is between 50% and 95% with one to seven classifiers. Based on this rule of thumb, it is reasonable to fuse our two individual biometrics. Figure 3.10 shows the CMC with the rank-one recognition rate of 98.5% for the multi-modal biometric, achieved by combining modalities at the distance metric level. In the fusion methods that we considered, the multiplication rule was the most consistent across different score transformations. The min rule was lower performance than any of the other rules across different score transformations (see Figure 3.12). Also, when the distance metrics were weighted based on the confidence level during the decision process, 57

71 SCORE D vs. 3D vs. Fusion (Multiple Probes) 2D Eigenfaces 3D Eigenfaces Fusion (Linear Sum) RANK Figure Performance results in multiple probe study. all the rules resulted in significantly better performance than the individual biometric. A McNemar s test for significance of the difference in accuracy in the rankone match between the multi-modal biometric and either the 2D face or the 3D face alone shows that multi-modal performance is significantly greater, at the 0.05 level. However, there is no significant difference between 2D and 3D biometrics in a single probe study Experimental Results: 2D face versus 3D face in biometrics - multiple probe study In these experiments, there will be one or more probes for a subject who appears in the gallery, with each probe being acquired in a different acquisition session separated by a week or more. The multiple probe dataset consists of

72 probes in total. Subjects might have a different number of probes. For example, there are 200 subjects with 1 or more probes, 167 subjects with 2 or more probes, 121 with 3, 86 with 4, 59 with 5, 29 with 6, 13 with 7 and 1 with 8 probes yielding 676 in total. In the probe dataset, the number of probes can be up to 7 per subject. There might be different rules to determine a correct match given several probes to a gallery. In this experiment, a correct match is measured based on each individual probe rather than on some function of all probes per subject. By using the same set of eigenvectors tuned in the single probe study, we achieved similar results as in the previous sections. While 3D performance dropped a little, 92.8%, 2D performance is slightly better than the previous experiment, 89.5% (see Figure 3.11). After combining these two biometrics in the multiple probes, we also were able to obtain significantly better performance, at 98.8%, than for either 2D or 3D alone. The results of 2D and 3D combination show very similar performance behavior as the single probe study. The product rule performs better than minimum rule regardless of score transformation (see Figure 3.12). Most combination methods consistently perform significantly better than the single biometrics. A McNemar s test for significance of the difference in accuracy in the rank-one match between the multi-modal biometric and either the 2D face or the 3D face alone shows that multi-modal performance is significantly greater, at the 0.05 level. Thus, significant performance improvement has been accomplished by combining 2D and 3D facial data in both single and multiple probe studies. Also, there is significant difference between 2D and 3D biometrics in a multiple probe study. 59

73 Rank One Match Rank One Match Performance Results by Different Fusion Methods (Single Probe Study) P * P * S * P * S * P * M S M M Exponential Transformation Linear Transformation Logarithmic Transformation Weighted Linear Transformation Performance Results by Different Fusion Methods (Multiple Probe Study) * P * P S * P S P M S M M Exponential Transformation : Best Performance Rate Linear Transformation Logarithmic Transformation M M S S Weighted Linear Transformation Figure Performance results of fusion schemes used. 60

74 3.5 Summary and Discussion The value of multi-modal biometrics with 2D and 3D facial data is examined in a single probe study and a multiple probe study. The results show that each modality of facial data has roughly similar value as an appearance-based biometric. The combination of the face data from both modalities results in statistically significant improvement over either individual biometric. In general, our results appear to support the conclusion that the path to higher accuracy and robustness in biometrics involves use of multiple biometrics rather than the best possible sensor and algorithm for a single biometric. We also have investigated the effect of spatial and depth resolution on recognition performance. This was done by producing successively coarser versions of the original image. The original image has a depth accuracy at 0.35mm. We found that performance drops only slightly in going to a depth resolution of 0.5mm, but begins to drop drastically at 3mm. The pattern of results suggests that it would be interesting to determine a sensor accuracy level needed to meet a specific requirement of face recognition tasks. The accuracy requirement might vary under different conditions, such as a number of subjects, facial muscle movement, or imaging condition changes. Additional investigation in resolution variation would result in a more objectively decided resolution level for further experiments. The overall quality of 3D data collected using a range camera is perhaps not as reliable as 2D intensity data. 3D sensors in the current market are not as mature as 2D sensors. Common problems with typical range finder images include missing data in eyes, cheeks, or forehead as well as several types of noise. These problems would lower the 3D recognition rate in general even though there exist ways of recovering some data in such areas. 61

75 CHAPTER 4 MULTIPLE BIOMETRICS USING 2D, 3D AND INFRA-RED FACE 4.1 Introduction This chapter extends the multi-biometrics approach proposed in Chapter 3 by considering long-wave infra-red (IR) images that capture the pattern of heat emitted 1. Chen et. al have looked at multi-modal face recognition using infra-red (IR) plus 2D [29]. They found that IR face recognition did not perform as well as 2D face recognition when there was substantial time lapse between the acquisition of the enrollment image and the image to be recognized, but that the combination of IR plus 2D performed better than either one alone. Other IR related studies for the task of person identification can be found [75, 85]. Our own previous work in 3D plus 2D [22] found that face recognition using 3D performed slightly better than using 2D, and also found that the combination of the two performed better than either one individually. This chapter explores the approach taken in Chapter 3 by extending the comparison and combination of all three modalities, 2D, 3D and IR [25]. One Null Hypothesis statement for this work is that there is no significant different in recognition performance between three individual modalities. Another 1 This chapter is based on the paper, Multimodal Biometrics Using Appearance, Shape and Temperature, presented in The 6th IEEE International Conference on Automatic Face and Gesture Recognition,

76 is there is no difference between multi-modal performance and individual performance. We use a PCA-based eigen-face algorithm as the recognition engine for all three image modalities, and tune each face space using the same size set of training images from the same set of people. Thus, in a sense we are looking at the relative power of each modality given an equalized power of recognition engine. Our test data specifically incorporates time lapse between the acquisition of gallery and probe images of a person. For any given person, the gallery image is our earliest image acquired for the person and all probe images are acquired at least one week later. Where we have multiple probe instances for a person, each successive probe is acquired at least an additional week later. It is already known that time-lapse between gallery and probe images results in lower performance than if the gallery and probe images are acquired in the same session [24, 60, 66]. 4.2 Methodology Normalization of IR Images As described in Chapters 2, 3, the PCA method is used for 2D, 3D face recognition experimented in this chapter. The method can be easily adapted to 2D and 3D images as well as images with heat pattern. Every pixel in depth images represents a distance value from the 3D image plane to the object in the scene whereas temperature values are normalized into 0 to 255 in IR images. Although each subject is asked to look directly at the camera during the acquisition, it is inevitable to obtain data with some level of pose variations between acquisition sessions. Both 2D and IR image data are typically treated as having pose variation only around the Z axis, the optical axis. The PCA software uses 63

77 two landmark points (the eye centers) for geometric normalization to correct for rotation, scale, and position of the face for 2D matching. However, while histogram equalization is applied to normalize the brightness level in 2D images, only geometric normalization is applied to IR images. 2D facial appearance 3D facial shape IR facial heat pattern Figure 4.1. Examples of 2D, 3D (2nd row) and IR faces (3rd row); one gallery image and four probe images of a subject. Dark intensity value represents greater depth value in 3D image and lower temperature in IR image. 64

78 4.2.2 Eigenvector Tuning In this chapter, in order to select the optimal set of eigenvectors used in face space creation, a validation set is introduced. It consists of a gallery set of single images of 29 distinct subjects and the probe set of 3 images of the same 29 subjects in the gallery set for each modality. All of the images used in the validation set are excluded from the later experiments. At first, one vector is dropped at a time from the smallest eigenvalues, and the rank-one recognition rate is computed using the validation set again each time, continuing until a point is reached where the rank-one recognition rate gets worse rather than better. We denote the number of dropped eigenvectors of largest eigenvalues as M. Also, one vector at a time is dropped from the eigenvectors of the smallest eigenvalues, and the rank-one recognition is computed using the validation set again each time, continuing until a point is reached where the rankone recognition rate gets worse rather than better. We also denote the number of dropped eigenvectors of smallest eigenvalues as N. The rank-one recognition rate for dropping some of the first eigenvectors tends to improve at the beginning but it starts to decline as M gets larger. After M and N are determined separately for each modality, these numbers are used for the face space creation in the later experiments without any further tuning process, (M = 4 and N = 0 for 2D, M = 1 and N = 4 for 3D, M = 1 and N = 4 for IR) Data Fusion The same procedure as described in Section was used in this experiment. 65

79 2D eigenvector images (5th, 6th, 7th, 8th, 9th, 10th and 11th) 3D eigenvector images (2nd, 3rd, 4th, 5th, 6th, 7th and 8th) IR eigenvector images (2nd, 3rd, 4th, 5th, 6th, 7th and 8th) Figure 4.2. The eigenvector images of seven largest eigenvalues used for each facespace. 4.3 Data Collection Images were acquired at the University of Notre Dame between January and May Two four-week sessions were conducted for data collection, approximately six weeks apart. Thus, in the multiple probes, there are at least one and as many as thirteen weeks time lapse between the gallery and the probe in the multiple probe study. A total of 191 different subjects participated in one or more data acquisition sessions. Twenty nine subjects are used for a validation set (see sub-section Eigenvector Tuning ) which consists of a gallery set of 29 images and a probe set of 87 images (3 images per subject). Among the other subjects who are not included in the validation set, 162 subjects, they are used in the later experiments. There are 127 subjects who participated more than once are in the 66

80 gallery set and their subsequent images, 297, are included in the probe set. The other 35 subjects for whom good data was not acquired in both the gallery and probe sessions are used along with the gallery as the training set. In each acquisition session, subjects were imaged using a Minolta Vivid 900 range scanner for 2D and 3D images. Also, IR images were acquired with a Merlin Uncooled long wavelength IR camera, which provided a real-time, 60Hz, 12 bit digital data stream. Subjects stood approximately 1.5 meters from the camera, against a plain gray background and were asked to have a normal facial expression ( F A in FERET terminology [67]) and to look directly at the camera. The Minolta Vivid 900 uses a projected light stripe to acquire triangulation-based range data. It also captures a color image near-simultaneously with the range data capture. The result is a sampling of range data with a registered 640 x 480 color image, and IR images are produced in Experiments There are two parts to the experimental results. The first part looks at the performance of individual image modalities. The second part looks at multimodal performance. Results are presented in the form of CMC curves and ROC curves. The ROC curve is relevant to a verification scenario; that is, a scenario in which someone makes a claim for their identity and the biometric system is used to verify the identity (one-to-one matching). The CMC curve is relevant to an identification scenario; that is, a scenario in which no initial identity claim is made and the observed biometric needs to be compared against all enrolled images (one-to-many matching). The rank-one recognition rate is the fraction of the probe images for which the closest match in the gallery is the correct match. 67

81 A McNemar s test is used to test for the statistical significance of the observed difference between rank-one recognition rates Experimental Results: Single biometrics This experiment is to investigate the performance of individual 2D, 3D and IR eigenfaces. The null hypothesis is that there is no significant difference in the recognition rate between each modality, given (1) the use of the same PCA-based algorithm implementation, (2) the same subject pool represented in training, gallery and probe sets, (3) the controlled variation in image acquisition time between the gallery and probe images, and (4) individually tuned face space for each modality 0.14 Verification Rates by Multi Biometrics 1 Identification Rates by Single Biometrics FRR IR 2D 3D FAR=FRR SCORE D (PCA) 2D (FaceIt) 3D (PCA) IR (PCA) FAR RANK Figure 4.3. Single biometrics performance results in ROC and CMC curves. using the validation set. The EERs in the verification scenario for each of single 68

82 modalities are 0.02 for 2D, 0.03 for 3D and 0.06 for IR. The results show that rankone recognition rates of each modality in the identification scenario are 90.6% for 2D, 91.9% for 3D and 71.0% for IR (see Figure 4.3). The difference between 2D and 3D in rank-one recognition rates is clearly not statistically significant. However, IR shows significantly lower performance than 2D or 3D. Thus, the results of our experiment provide evidence for rejecting the null hypothesis in this case. We find a statistically significant difference in accuracy in PCA-based recognition using 2D or 3D face compared to IR face data. A commercial face identification software, F aceit r (Version G3), is considered as a separate experiment here to provide a relative performance of a given dataset against the eigen-face method used in the three modalities. F aceit r performs at rank-one recognition rate at 84.5% on the same 2D gallery and probe set, which is lower than our tuned 2D eigenface method Experimental Results: Multiple biometrics In this experiment, the value of a multi-modal biometric using 2D, 3D and IR face images is investigated and compared against individual biometrics. The null hypothesis for this experiment is that there is no significant difference in the performance rate between single biometrics and multi-biometrics. The ROC curves obtained by multi-modal biometrics are presented in Figure 4.4. The EERs for the multi-modal biometrics are for 2D+3D, for 2D+IR and for 3D+IR. Also, the CMC curves are shown with the rank-one recognition rates of 98.7%, 96.6% and 98.0% for 2D+3D, 2D+IR and 3D+IR, respectively. A McNemar s test for significance of the difference in accuracy between the dual modalities and single modalities shows that dual modality performance is signifi- 69

83 0.07 Verification Rates by Multi Biometrics 1 Identification Rates by Multi Biometrics FRR FAR IR 2D 3D 2D+3D 2D+IR 3D+IR 2D+3D+IR FAR=FRR SCORE D+3D 2D+IR 3D+IR 2D+3D+IR Note : Dashline curves are "Single Biometrics" RANK Figure 4.4. Multi-modal biometrics performance results in ROC and CMC curves. cantly greater, at the 0.05 level. Later, all the modalities are combined to form a multi-modal biometric with all three facial features. Figure 4.4 shows 100% rank-one match rate and equal-error rate in the given gallery and probe set. Due to the higher performance observed for bi-modal results (i.e. rates were already saturated), the improvement shown in this combination is not significant. However, the product rule consistently shows the best performance regardless of normalization methods in Table Summary and Discussion We first compared face recognition performance using the 2D images, 3D images, and IR images individually. Images were acquired at image acquisition sessions scheduled on a weekly basis. For a given person at a given acquisition session, IR, 2D, and 3D images are acquired in a short time period (about 10 minutes). This allows the training, gallery, and probe sets for each modality to 70

84 The IR probe was not recognized correctly but when it is combined with metrics by 2D or 3D, the probe was correctly identified. Each of IR and 2D probe was failed to recognize but the person was correctly identified only when multiple biometrics considered 3D in 2D+3D and 2D+3D+IR. All of the single modal biometrics failed in this case. However all of the multi-modal biometrics successfully identified the subject While 2D and IR probes were identified correctly, 3D probe was incorrectly identified. All of the combined methods, however, correctly identified the subject. Figure 4.5. Example matches improved by integrated schemes; while the 3 images shown in the left column are the gallery images of each modality, the 3 images in the right column are the probe images. 71

85 contain comparable images of the same persons. We found that there was no statistically significant difference in performance between 2D and 3D, but that the performance of IR was statistically significantly lower than either of these. We also compared the performance of individual modalities with multiple modalities. We found that each of the multi-modal performances improved over all of the individual modalities, and that the multi-modal 2D+3D+IR performed best of all. The differences between the various multi-modal performances were found not to be statistically significant. However, all of the performances were high in an absolute sense, and so the lack of significant differences may be due to saturation. The overall high performance is likely due at least in part to the inclusion of the gallery images in the training set for the face space. A larger total data set would allow experiments to be performed with a training set whose subjects are all distinct from the test set, and this would give more realistic performance estimates. Like the other multi-modal biometric studies listed in Table 1.3 and 1.4, our results showed that multi-modal biometrics obtained higher performance than the individual biometrics. However, the relevance of this type of result to practical application has to be considered carefully. An important question is the degree to which the performance improvement seen by going to multi-modal biometrics is due to the additional imaging modality versus simply the additional image. It is known that combining the recognition results from multiple images of the same modality improves on the performance from using a single image [24, 49, 60, 68]. Also, acquisition of images of a single modality is almost certainly simpler and cheaper than acquisition of multi-modal images. Therefore, a better experimental comparison to test whether multi-modal truly improves over an individual 72

86 modality is to compare against multiple images of the baseline modality. More specifically, the results of multi-modal 2D+3D+IR recognition might be compared against the results of recognition using three 2D images. This would equalize the number of images used to obtain each recognition result. It would mean that for recognition using 2D images, a person would be represented by a set of three 2D images in the gallery and a probe instance for attempting to recognize the person would also consist of three images. The particular method for controlling the variation among the three 2D images used to represent a person needs to be decided, and will have a direct impact on the results. The method for matching the multiple-image-per-person probe to the multiple-image-per-person gallery will also impact the results. The 2D+3D+IR result comes from the sum of three match distances, one in each modality s face space. But using three 2D images to represent a person in the gallery and as a probe means that the final match decision can be based on a sum of nine match distances. One element of our planned future research is to evaluate the multi-modal recognition result in this more rigorous method of comparison to multi-image 2D recognition. There still exists a limitation of projecting the presented results to practical setting. However, the results show a pattern of improvement as reasonable biometric sources that represent certain degree of complementary facial information are combined. There may still be some biometrics algorithm, other than PCA, for which one of the 2D face or the 3D face offers statistically significantly better recognition performance than the other. Also, there may be particular application scenarios in which it is not practical to acquire 2D and 3D face images that meet similar quality control conditions. Even though images were collected with attempts to control lighting, background and facial expression, there is still some 73

87 TABLE 4.1 RANK-ONE RECOGNITION RATES ACHIEVED BY FUSION METHODS IN 2D+3D+IR (Normalization) Sum Product Minimum Linear 99.7% 100.0% 91.6% Logarithmic 100.0% 100.0% 91.6% Exponential 91.6% 99.7% 91.6% Weighted 99.0% 99.7% 96.3% degree of limitation that cannot be controlled, such as slight movement around the lips area or eyes area. This affects the performance rate since it actually changes the shape of face data occurring around the missing area. It is generally accepted that performance estimates for face recognition will be higher when the gallery and probe images are acquired in the same acquisition session, compared to performance when the probe image is acquired after some passage of time [66]. As little as a week s time is enough to cause a substantial degradation in performance [33]. While many performance results reported in the literature are obtained with datasets where the probe and gallery images are acquired in the same session, most envisioned applications seem to occur in a scenario in which the probe image would be acquired some time after the gallery image. 74

88 CHAPTER 5 MULTI-MODAL 2D+3D VERSUS MULTI-SAMPLE 2D 5.1 Introduction It is often proposed that more accurate biometric recognition can be achieved by integrating multiple biometric sources, an approach called multi-biometrics 1. Various studies on different multi-biometric possibilities have all shown that a combination of biometrics can outperform individual biometrics. Tables 1.3 and 1.4 summarize important details of prior studies that report on multi-modal biometrics. One main purpose of multi-biometrics is to reduce the ambiguity between domain experts and thereby improve accuracy. The use of multiple biometric sources should also make it harder for an intruder to fake multiple biometric traits simultaneously [71]. Also, by considering multiple biometric source, the recognition system will include more subjects to be recognized. All of the studies reported in Tables 1.3 and 1.4 claim that the multi-modal performance improves over that of a single modality. However, at least two effects are mixed together in such results: (1) improvement resulting from the presence of multiple samples of data, and (2) improvement resulting from samples collected from multiple modes of sensing. It is not sufficient for the multi-modal 1 This chapter is based on the paper, An Evaluation of Multi-modal 2D+3D Face Biometrics, published in IEEE Transactions on Pattern Analysis and Machine Intelligence,

89 2D+3D face recognition to show improvement over single-sample 2D face recognition. Instead, since multi-modal 2D+3D recognition uses two samples of data, it should show improvement over using two 2D samples for recognition. One goal of this evaluation scheme is to determine how much of the improvement gained by multiple modality recognition actually comes from the complementary information acquired by different sensors, versus simply using multiple image samples employing a single mode to represent each subject. 5.2 Methodology The PCA method and normalization steps described in Chapter 3.2 were applied for this experiment for both 2D and 3D face recognition Data Collection Two four-week sessions were conducted for data collection, with approximately six weeks time lapse between these two sessions. A gallery image is one image that is enrolled into the system. A probe image is one test image to be matched against the gallery for recognition. The gallery image of a person is selected from the earliest session in which valid images were acquired for the person. In our experiment that has one probe instance for each person, (the single probe study 2 ), there are at least six and as many as fourteen weeks time lapse between the acquisition of a gallery image and its corresponding probe image. In our experiment that has multiple probe instances per subject, (the multiple probe study 3 ), there may be up to seven probe instances for a subject. In this case, 2 Single probe study refers to an experiment with a probe set that has a single image per subject. 3 Multiple probe study refers to an experiment with a probe set that has multiple images per subject. Multiple-sample refers to a type of multiple biometrics in which multiple images are 76

Session #1 Session #2 Figure 5.1. Two different sessions acquired in 6 weeks apart.

there is at least one and as many as fourteen weeks time lapse between the gallery and a probe.

Subjects were imaged using a Minolta Vivid 900 4 range scanner for the 3D scan.

The 2D images were acquired with a Canon PowerShot G2 digital camera.

90 Session #1 Session #2 Figure 5.1. Two different sessions acquired in 6 weeks apart. Four 2D images (left to right: FALM, FBLM, FALF and FBLF) and one 3D image (rightmost: FALT) of a subject are acquired in each session. there is at least one and as many as fourteen weeks time lapse between the gallery and a probe. For each subject, one 3D scan and four 2D images were acquired in each acquisition session. Subjects were imaged using a Minolta Vivid range scanner for the 3D scan. The height of the Minolta Vivid 900 scanner was adjusted to the approximate height of the subject s face, if needed. The 2D images were acquired with a Canon PowerShot G2 digital camera. Each subject was asked to have one normal expression (FA) and one smile expression (FB), once with three spotlights turned on (LM) and a second time with two side spotlights turned on (LF). The dimension of each imaging modality is different. A range image is produced by the 3D scanner and 1,704 2,272 (approximately 3.9 mega-pixel resolution) color used to represent a single person. 4 Specific manufacturers and model are mentioned only to specify the work in detail and do not imply an endorsement of the equipment or vendors. 77

91 images are produced by the 2D camera. A total of 275 different subjects participated in one or more data acquisition sessions. Of the 275 subjects, 198 had usable data acquired in 2 or more acquisition sessions. Thus, the single probe study has 198 individuals in the probe set, the same 198 individuals in the gallery, and 77 individuals in the training set who are not in the gallery. For the multiple probe study, 473 probes are added to the single probe study dataset, yielding 670 probes in total. The training set of 77 subjects is the same as that used in the single probe study. Because 3D image acquisition takes more time, just one 3D image was acquired for each subject at each acquisition session, with normal facial expression and one top light on (FALT). There are a total of 945 3D face images used in the study. There are 275 images, one for each of 275 subjects, 77 images used as a training set, and 198 images used as a gallery set. There are another 670 images of the 198 subjects in the gallery used for the probe set. There are four 2D images taken of each subject at each acquisition session, representing combinations of two facial expressions and two lighting conditions: FALM, FBLM, FALF, and FBLF (see Figure 5.1). This allows for some choice of which 2D image(s) are used to represent a subject in the gallery and probe sets for a 2D recognition experiment. Combining all possible choices of image condition for both the gallery and the probe set, there are sixteen possible variations of the basic 2D recognition experiment. It is also possible to formulate a 2D recognition experiment in which more than one image is used to represent a person; this is discussed more in a later section. 78

92 5.2.2 Eigenvector Tuning For the experiments reported in this chapter, the face space is created from a training set of 2D and 3D images for 77 subjects. This is similar to the approach described in Chapter When images from all four acquisition conditions for 2D are used for training, there are 308 (77 4) training images. Tuning the face space resulting from these 308 training images gives M = 26 and N = 62 for the 2D image face space, and M = 3 and N = 6 for the 3D image face space (See Figure 5.2). Previous researchers have reported dropping fewer than 26 of the largest eigenvectors in tuning the face space. However, note that the set of training images used here explicitly incorporates variation in lighting condition and facial expression. This naturally leads to an increased number of eigenvectors being dropped among the largest eigenvalues that represent image variation that is not relevant to subject identity. 5.3 Experiments The baseline performance for our experiments is 2D face recognition using FALM images in the gallery and the probe set. We will refer to this as FALM:FALM, where the labels are interpreted as a match between GALLERY and PROBE 5. Use of a single image of a single modality, single-modality-single-sample, to represent a person in the gallery as a probe is denoted as (SMSS). SMSS performance is investigated for 2D and 3D independently in Section Next, single-modalitymultiple-sample (denoted as SMMS) in 2D is investigated by adding a second 2D image to the representation of a person. SMMS is a multi-biometrics approach that 5 Imaging conditions denoted with four letters written in italic indicates a gallery set (e.g. FBLM) otherwise a probe set (e.g. FBLM). 79

93 1 Rank one rates obtained by dropping first "M" eigenvectors 1 Rank one rates obtained by dropping last "N" eigenvectors Rank one rate Rank one rate D FALM 2D FALF 2D FBLM 2D FBLF 3D FALT D FALM 2D FALF 2D FBLM 2D FBLF 3D FALT Number of first eigenvectors dropped (M) (A) Number of last eigenvectors dropped (N) (B) Figure 5.2. Rank-one performance rates of during the eigenvector tuning process. M is 3 due to the early performance drop shown in 3D. N is chosen at 6 since all the rank-one recognition rates were maintained the original performance rate up to that point. fuses multiple image samples of a single biometric source for recognition. Also, multiple-modality-single-sample (MMSS) performance is investigated by combining match scores in 2D and 3D. In MMSS systems, single images acquired with each of multiple sensors are integrated for recognition. The results of these multiple sample and multiple modality schemes are reported in Section and One of our goals is to find out how much of the improvement gained by multiple modality recognition actually comes from the complementary information acquired by different sensors, versus simply using multiple image samples from a single modality. Four possible single-image-per-subject gallery sets are collected based on imaging conditions (FALM, FALF, FBLM and FBLF). Four corresponding probe sets are then collected with images for the same subjects enrolled in the gallery but acquired in a later session. SMSS recognition using these galleries and probes 80

94 (A) Single Modality Single Sample (SMSS) (B) Single Modality Multiple Sample (SMMS) (C) Multiple Modality Single Sample (MMSS) Figure 5.3. Three biometric approaches experimented in this study. 81

95 shows the recognition results based on illumination and expression variation in images acquired at different times. For example, in the 2D SMSS scheme shown in Figure 5.3-(A), a probe image (FALM) is matched against the gallery images (FALM). To ensure fairness for the evaluation in two integrated approaches (SMMS and MMSS), an equal number of images is used for both schemes. The identification is performed by matching two probe images separately against two galleries. After matches are complete, the scores are normalized. For the SMMS scheme, matching scores from the FALM:FALM system and the FBLM:FBLM system are combined during the decision fusion process as shown in Figure 5.3-(B). For MMSS, individual results of any of four 2D images can be integrated with 3D matching results of every subject acquired on the same day. Figure 5.3-(C) shows one example of MMSS recognition. It fuses the matching scores of 2D and 3D face images, FALM:FALM and FALT:FALT, respectively. Finally, the overall performance rates are presented and analyzed using both CMC and ROC curves. During the face space creation using 2D images, there can be two ways of designing the training set to accomplish the task. The first option would encounter the same imaging condition used as the gallery sets. For instance, when FALF gallery is used for the identification, 77 FALF training images can be considered to create the face space. This would assume that imaging conditions of the gallery and the probe are known prior to identification. The other option would include all four sets of 77 training images for one face space, so that the training is based on multiple possible imaging conditions. This does not assume that imaging conditions of the gallery and the probe are known. These two approaches to creating the face space are compared in Section 5.3.1, and an approach selected 82

96 for use throughout the 2D experiments in SMMS and MMSS schemes Single-Modality and Single-Sample (SMSS) This experiment examines two separate SMSS schemes employing 2D face appearance and 3D face shape. In 2D, four different gallery and probe sets are collected, reflecting two choices of expression (FA or FB) and two choices of illumination (LM or LF). Thus, there are 16 possible recognition results for our 2D data. There is only one pair of gallery and probe in 3D (see Table 5.1). As described in the previous section, face spaces are created in two ways considering either (1) training images acquired under the same imaging conditions as the gallery images or (2) training images acquired under all possible four conditions. (The rank-one match rates in the former case are written in the parentheses in both Table 5.1 and Table 5.4.) Both 2D and 3D obtained similar performance rates when only lighting condition is varied. However, significant performance degradation was observed in 2D when matching subjects with different facial expression, (FA**:FB** or FB**:FA**). The experimental results show that expression variation causes a significant performance drop (40% to 50%) relative to the baseline performance (numbers written in boldface in the tables), yielding average performance rate of 41.5% and 46.9% for a single probe and a multiple probe study, respectively. The performance rates obtained under illumination changes were maintained close to the baseline performance rate. Lighting variation with constant expression affects the performance rate by a relatively insignificant amount partly because there was no significant illumination change between a gallery and a probe. When the face space is created by training images acquired under four different 83

97 TABLE 5.1 SMSS RANK-ONE RECOGNITION RATES Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results with one time-lapse probe per subject FALM FALF FBLM FBLF FALT(3D) FALM 90.9(86.7)% 81.3(75.8)% 63.6(41.9)% 62.6(34.9)% N/A FALF 83.3(81.3)% 82.3(83.3)% 60.6(40.4)% 66.7(38.4)% N/A FBLM 68.7(44.4)% 62.6(43.9)% 86.4(82.8)% 83.8(73.7)% N/A FBLF 62.6(40.9)% 66.7(47.5)% 86.4(75.8)% 83.8(79.3)% N/A FALT (3D) N/A N/A N/A N/A 88.9% Results with one or more time-lapse probe per subject FALM FALF FBLM FBLF FALT(3D) FALM 92.4(88.4)% 87.9(81.8)% 68.7(48.5)% 64.1(38.9)% N/A FALF 91.4(85.4)% 86.9(86.9)% 69.2(47.5)% 71.2(45.5)% N/A FBLM 72.7(49.0)% 68.7(47.5)% 92.4(87.4)% 91.9(85.9)% N/A FBLF 71.2(47.5)% 71.2(51.0)% 93.4(83.3)% 92.9(87.4)% N/A FALT (3D) N/A N/A N/A N/A 87.4% conditions, overall performance rates were improved. These results suggest that when a face space is created from a group of subjects with uniform expression, it would be susceptible to performance degradation when a subject being matched has a different expression, as discussed in the previous section. Thus, the face spaces trained with diverse expressions would show more robust performance in 2D. Even though there still exists significant performance changes relative to the baseline performance, the performance differences between expression changes and lighting changes have been reduced. Thus, the face space created by using images 84

98 acquired under all four imaging conditions is used for 2D in the rest of experiments Single-Modality and Multiple-Sample (SMMS) As the number of images used to represent a person increases, the performance rate tends to improve [60]. Therefore, this experiment examines the performance gained by a SMMS scheme when each person is represented by two 2D images. A score normalization step might be omitted in the SMMS fusion process since the matching scores are produced with essentially the same modality. Having four gallery sets available in 2D, gallery sets can be formed by combining any 2 out of 4 image combinations ( 4 2 ), yielding 6 different combinations. Therefore, the set of possible galleries can be listed as FALM FALF, FALM FBLM, FALM FBLF, FALF FBLM, FALF FBLF and FBLM FBLF, where the labels are interpreted as 1st GALLERY 2nd GALLERY. The same combination process is performed to prepare probe sets with the images acquired in later sessions. A pair of galleries can be matched with any of 6 probe pair sets acquired in different sessions. Thus, a total of 36 (6 6) matches is performed. Each experiment involves two separate probe-to-gallery matches followed by a decision fusion based on the two sets of matching scores obtained (see Figure 5.3-(B)). In effect, this experiment assumes that a separate gallery is kept for each lighting-expression combination and that the lighting-expression condition of a probe image is known so that the probe can be matched to the corresponding gallery. The rank-one rates (written inside the parentheses) of the 36 matches are summarized in Table 5.2. There is a general and substantial improvement over the results from the SMSS scheme. The average performance rate improves from 80.4% to 92.1% in a multiple probe study. Improvement can be interpreted as coming from the fact that an individual 85

99 becomes better represented during the match, on average, by adding the other images acquired under different conditions Performance Effect on Number of Samples in SMMS The previous section showed that the performance rates tend to improve as a person is represented by more than a single image during the recognition process. In the above experiment, the final score is based on the combination of two matches. Each of two probe images is being matched to a gallery for the same lighting-expression condition. However, given two gallery images and two probe images of each person, recognition could also be based on four matches (See Figure 5.4). Also when three images per person are used, recognition could be based on nine matches (sixteen possible matches when four images per person are used). In this case, each probe image is matched against multiple gallery images of varying lighting-expression condition. Tables 5.2 and 5.3 show that overall rank-one rates are improved by this procedure. In this experiment using three images out of four images to represent a person, there are 4 ( 4 3 ) possible ways of choosing images from our data, and so there are 16 (4 4) matches in the experiments. These results are shown in Table 5.3. Note that in Table 5.2, the performance ranges from 71.7% to 95.0% with average rank-one recognition rate of 88.2% for one probe instance per person, whereas in Table 5.3, performance ranges from 90.9% to 94.0% with average rank-one recognition rate of 92.6%. Thus, it appears that representing a person by an increasing number of images does more to prevent outliers than to raise performance in all cases as well as improving overall performance. We can also perform a single result experiment in which all four images are 86

Updated Scheme in Single Modality Multiple Sample (SMMS) Figure 5.4. This SMMS approach makes a decision based on 4 matches when a subject is represented by 2 images in the gallery and the probe.

100 Updated Scheme in Single Modality Multiple Sample (SMMS) Figure 5.4. This SMMS approach makes a decision based on 4 matches when a subject is represented by 2 images in the gallery and the probe. used to represent a person. For this experiment, using four images to represent a person in the gallery and to represent a person as a probe, the rank-one recognition rate is 95.5% based on sixteen matches and 93.9% based on four matches in single probe study (97.0% and 95.0% for a multiple probe study, respectively) Multiple-Modality and Single-Sample (MMSS) The baseline performance rates are 92.4% and 87.4% for the multiple probe study in 2D and 3D, respectively. In this experiment, one instance of a different modality acquired on the same day is added to an existing SMSS scheme. A modality that has higher confidence might help the other modality when the decision is not clear. The result of this experiment can be compared to SMMS recognition but with additional modality replacing additional sample. Four sets of gallery and probe are available in our 2D SMSS scheme. A new modality is added by including a set of 3D face shape images acquired in the same 87

101 session. The experimental results show a significant performance improvement over SMSS across all the scenarios, and are reported in Table 5.4. The overall performance rates obtained by the integration of 3D images result in significant improvement, from 93.6% to 97.2% on average rank-one rate in a multiple probe study. Comparing Table 5.4 to Table 5.2, it appears that representing a person by one 2D image plus one 3D image is generally preferable to representing a person by two 2D images Results and Summary Both SMMS and MMSS show significant improvement over the baseline SMSS results. This is to be expected because additional information is used to represent a person. Illumination seems to be relatively insignificant to PCA-based face recognition, possibly because much of the variation caused by illumination can be removed by histogram equalization and dropping selected eigenvectors in creating the face space. However, facial expression changes degrade performance greatly. Handling facial expression change is an area that requires greater attention in order to make face recognition more robust. For the comparison of SMMS with two 2D images and MMSS with one 3D image and one 2D image, a set of matches that have common imaging conditions are considered. A total of 4 matches that have FA** in both gallery and probe in MMSS (Table 5.4) can be compared to one of the SMMS schemes, FALM- FALF:FALM-FALF, in Table 5.2. Clearly, the MMSS scheme (97.5% in average for those four matches) shows a higher accuracy than the rank-one rate (94.4%) in SMMS recognition for a multiple probe study. Another set of matches that have FA**-FB**:FA**-FB** in both gallery and probe can be examined. Two 88

matches that have FB**-FALT:FB**-FALT are then compared against four SMMS recognition rates in Table 5.2.

3% obtained by the SMMS scheme. Figure 5.

However, the images in the right column show that even better recognition can be obtained by combining an additional modality rather

For instance, the 8th ranked 2D match is improved to the 5th rank when the second set of 2D images is considered but is matched

102 matches that have FB**-FALT:FB**-FALT are then compared against four SMMS recognition rates in Table 5.2. The MMSS scheme with a rank-one rate of 98.3% shows better recognition than the average performance rate of 94.3% obtained by the SMMS scheme. Figure 5.5 shows examples of improved cases where both SMMS and MMSS schemes provide better recognition than the SMSS scheme. However, the images in the right column show that even better recognition can be obtained by combining an additional modality rather than the combination of additional samples of a single modality. For instance, the 8th ranked 2D match is improved to the 5th rank when the second set of 2D images is considered but is matched correctly when the 2D image is combined with 3D. SMSS (baseline) 0th 22nd 8th 0th SMMS 0th 5th MMSS 0th 0th Figure 5.5. Example matches improved by integrated schemes; while the matches shown in the left column illustrate that both SMMS and MMSS improved the baseline performance rates, the matches in the right column show the effectiveness of the MMSS scheme. 89

103 The CMC curves in Figure 5.6-(A) summarize these results. The best rank-one correct identification rate for the baseline SMMS scheme is 94.4%, versus 97.5% for the MMSS scheme. The McNemar s test for significance of the difference in the rank-one match between the integrated biometrics (both MMSS and SMMS schemes) and either the 2D face or the 3D face alone shows that multi-modal performance is significantly greater (α = 0.05). However, we found no significant difference between 2D alone and 3D alone in SMSS recognition. For a verification scenario, FAR, FRR FRR and EER are summarized in the ROC curve in Figure 5.6-(B). As the ROC curves show, the MMSS scheme (0.019) achieves significantly lower EER than SMSS (0.043 for 2D, for 3D) and SMMS (0.048). Also, given the image sets used in the experiments here, the multiple-modality result clearly shows a higher accuracy than the multiple-sample result in both FAR and FRR. The EER of 3D SMSS shows very similar performance accuracy to that of 2D SMSS. However, the 2D rank-one match rate is greater than the 3D rank-one match rate, as shown in the CMC curves. It is important to note that the results presented in EERs should be carefully analyzed because the EERs represent only one operating point on the ROC curves for the comparison. The operating points in the function of FAR and FRR will be changed to meet the requirements of an application. Among fusion rules that we tested, the sum rule with linear score transformation provides the overall best performance. Both the sum rule and product rule consistently show the best performance across different score normalization methods. The minimum rule, however, shows lower performance than the other rules. The results show the effectiveness of combining multiple biometric sources 90

104 SCORE Performance rates by different biometric schemes using 2D and 3D facial data SMSS (2D) SMSS (3D) SMMS MMSS RANK (A) Performance rates in CMC. FRR Performance rates by different biometric schemes using 2D and 3D facial data SMSS (2D) SMSS (3D) SMMS MMSS FRR = FAR FAR (B) Performance rates in ROC. Figure 5.6. The baseline performance rates of the multiple probe study reported in CMC and ROC curves. (SMSS result with FALM:FALM, SMMS result with FALM FALF:FALM FALF based on four matches and MMSS result with FALM FALT(3D):FALM FALT(3D) for a multiple probe study are reported. acquired by different sensors. The overall performance rates provided by MMSS are higher than SMMS in all cases. However, the performance rate of SMMS may be improved by increasing the number and diversity (pose, expression etc.) of image samples. Because the cost of the scalability of SMMS recognition in 2D is likely lower than that of MMSS in 2D+3D, increasing the number of samples is a feasible approach to improve the performance of SMMS. Also, MMSS identification accuracy can be improved as the number of samples for each modality increases to represent a subject, multiple-modality-multiple-sample. Our results suggest that 3D imaging has significant potential to improve face recognition performance. Our results do not point to using 3D in place of 2D, so much as 3D in combination with 2D. However, it is probably important to 91

105 acknowledge that current 3D sensing technology is not as robust in a practical sense as current 2D camera technology. This issue in more detail is discussed in the next section. 5.4 Limitations of Current 3D Sensing Technology An ideal 3D face sensor would combine at least the following properties: (1) image acquisition time is similar to that of a typical 2D camera, (2) a large depth of field; e.g, a meter or more in which there is essentially no loss in accuracy of depth resolution, (3) dense sampling of depth values; perhaps 1,000 1,000, (4) no eye safety issues arising from projected light, and (5) depth resolution of better than 1mm. Evaluated by these criteria, there is currently no ideal 3D sensor for the task of person identification [14]. One important issue is whether or not the sensor is an active one; that is, whether it projects light of some form onto the scene. If it projects coherent light, then there are potential eye safety issues. If it does not project coherent light, then issues of depth-versus-accuracy tradeoff become more important. If the sensor projects a sequence of light stripes or patterns and acquires an image of each, then the effective acquisition time increases. In general, shorter acquisition times are better than longer acquisition times, in order to minimize the artifacts due to subject motion. The shortest image acquisition time possible would seem to be that of a single image, or multiple images taken truly simultaneously. In this regard, a stereo-based system would seem to have an advantage. However, stereo-based systems can have trouble getting a true dense sampling of the face surface. Systems that depend on structured light typically have trouble in regions such as eyebrows, and often generate spike artifacts when light undergoes multi- 92

106 ple reflections. Systems that depend on stereo correspondence often have sparse sampling of points in regions where there is not much natural texture, and may generate surfaces that are too smooth in such cases. There are of course many sources for natural variation in the 3D shape of the face. Subjects may have different facial expressions at different times. Subjects may have swelling or puffiness in some regions of the face at one time and not at another. Dramatic weight gain or loss, as well as disease symptoms, may affect face shape. Most 3D sensors make some attempt to address illumination variations by controlling the ambient illumination, at least at the moment of image acquisition. For example, the Cyberware has its own relatively bright illumination [1], the Geometrix can be equipped with additional illumination [2], and so forth. Creating a sensor that automatically adjusts to variations in the ambient illumination is certainly a major practical area for advance in 3D sensor technologies Illumination Invariance It is sometimes asserted that 3D sensing is, or should be, inherently better than 2D for purposes of face recognition [16, 44, 58]. One reason often asserted for the superiority of 3D sensing is that 3D is illumination independent whereas 2D appearance can be affected by illumination in various ways. It is true that 3D shape is illumination independent, in the sense that a given 3D shape exists the same independent of how it is illuminated. However, the sensing of 3D shape is generally not illumination independent - changes in the illumination of a 3D shape can greatly affect the shape description that is acquired by a 3D sensor. This is true for approaches based on stereo or on structured light. 93

107 The acquisition of 3D shape by either stereo or structured light involves taking one or more standard 2D intensity images. The 2D images are typically taken with commercially available digital cameras, possibly with filters for a particular type of light. A 2D sensor can receive light of an intensity that saturates the detector, and can also receive light levels too low to produce high quality images. The 2D image can have artifacts due to illumination, and the artifacts in the 2D images can lead to artifacts in the 3D images. The types of artifacts that can arise in the 2D image and the 3D image are different but related. The determination of which type of image inherently has more frequent or more important artifacts due to illumination is not clear, and is possibly sensor and application dependent. Figure 5.7 makes the point that the shape model acquired by currently available 3D sensors can be greatly affected by changes in illumination. The images in the first column are the shape models acquired by various sensors, using subject position and ambient lighting that are appropriate to the given sensor. The images in the second column are the shape models of the same subject under essentially the same conditions, but with one additional studio spotlight turned on. In both columns, the shape models have been rendered as smooth-shaded 3D meshes without any superimposed texture map. Models were converted to VRML format and then rendered as a shaded image. 5.5 Summary and Discussion This chapter has presented results of the largest experimental study to date of 3D and multi-modal 2D+3D face recognition. A data set in which 198 distinct persons are enrolled in the gallery is used. Results are presented for (a) an experiment in which recognition is attempted with one time-lapse probe per person, and 94

108 (b) an experiment in which recognition is attempted with as many different timelapse probes as are available for each person. For each image acquisition session, multiple 2D images were acquired under different lighting and facial expression conditions. Therefore, we are able to consider 2D recognition results over a range of experimental conditions. Perhaps the simplest level of results is to compare recognition using a single 2D image versus recognition using a single 3D image (See Table 5.1). Our results suggest that 2D recognition performance can be roughly equivalent to 3D recognition performance if the 2D face space training is sufficient and the lighting, and expression conditions are well matched between enrollment and probe images. Multi-modal 2D+3D recognition significantly outperforms plain 2D or 3D recognition, as evidenced by comparing Table 5.1 and Table 5.4. However, this simple comparison is biased in favor of the multi-modal results, because a person is represented by two images, one 2D and one 3D, rather than a single 2D or single 3D image. Our data collection allows us to explore results from 2D recognition based on using multiple images to represent a person, as summarized in Tables 5.2 and 5.3. Using two 2D images to represent a person erases roughly half of the performance difference between simple 2D recognition and multi-modal 3D+2D recognition. Using three or four 2D images to represent a person erases still more of the performance gap, but generally does not reach the performance level of multi-modal 2D+3D. Overall, we are led to conclude that improved face recognition performance will result from (a) the combination of 2D+3D imaging, and also (b) representing a person by multiple images taken under varied lighting and facial expression. Both of these topics should be the subject of substantial additional future work. The 95

109 topic of 3D face recognition has been only very lightly explored [14], at least in comparison to 2D face recognition [88]. The topic of multi-image representations of a person for face recognition is also not well explored. Due to the small number of public image databases available for multiple biometrics research, it might be a difficult task to investigate practical issues in face recognition. One such example might be What variations (pose or lighting) of a person would bring the most benefit in considering multiple samples of a single biometric source scheme while different number of images is stored in gallery and probe?. Another example would be How does the performance accuracy change with respect to the training size and collections of variety of multiple images?. These are some of the problems that the research community might be able to resolve with rigorous experiments with rich databases. Three dimensional face recognition research is currently limited by the robustness of practical 3D sensing. Currently, usable 3D images cannot be acquired under the same variety of conditions of lighting, depth of field, and timing. Thus, another important area of future research should be the development of better 3D sensing technology. 96

110 TABLE 5.2 SMMS RANK-ONE RATES IN 2-GALLERY AND 2-PROBE MODE Rank-one rates are produced by a different number of matches that use 2 gallery images and 2 probe images. Rates written inside the parentheses are based on two matches and other rates are based on four matches. Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization. Results with one time-lapse probe per subject FALM FALM FALF FALM FALF FBLM FALF FBLM FBLM FBLF FBLF FBLF FALM FALF 92.9(92.9)% 93.4(89.9)% 86.9(82.3)% 93.9(89.9)% 87.4(83.8)% 74.2(72.7)% FALM FBLM 93.4(91.9)% 93.9(93.9)% 91.4(92.4)% 93.9(92.9)% 90.9(88.9)% 91.9(86.4)% FALF FBLM 88.9(85.4)% 94.4(92.4)% 91.9(90.4)% 91.4(90.4)% 86.9(87.9)% 91.4(86.9)% FALM FBLF 92.9(90.9)% 95.5(95.0)% 91.9(90.4)% 95.0(93.9)% 90.9(89.9)% 90.4(86.4)% FALF FBLF 86.4(84.3)% 91.4(90.9)% 89.4(88.9)% 89.4(88.4)% 88.9(87.4)% 88.9(85.4)% FBLM FBLF 74.2(71.7)% 91.9(88.9)% 91.9(87.4)% 89.4(88.4)% 88.4(85.9)% 91.9(90.4)% Results with one or more time-lapse probe per subject FALM FALM FALF FALM FALF FBLM FALF FBLM FBLM FBLF FBLF FBLF FALM FALF 94.4(92.9)% 94.4(89.4)% 90.9(85.4)% 96.0(92.9)% 90.4(88.4)% 76.8(76.7)% FALM FBLM 94.4(89.9)% 95.0(93.9)% 93.4(93.9)% 95.0(96.0)% 93.9(94.4)% 96.5(95.5)% FALF FBLM 92.9(90.4)% 96.0(93.4)% 93.4(93.4)% 95.5(96.0)% 91.9(93.4)% 96.0(92.4)% FALM FBLF 94.4(93.4)% 95.5(94.4)% 95.5(94.4)% 95.5(95.5)% 94.4(95.5)% 97.0(93.4)% FALF FBLF 90.9(90.4)% 96.0(93.4)% 93.4(93.4)% 95.5(95.0)% 93.4(92.9)% 96.0(92.4)% FBLM FBLF 75.3(76.8)% 97.0(92.9)% 95.5(91.9)% 95.0(91.4)% 95.0(92.9)% 97.5(95.5)% 97

111 TABLE 5.3 SMMS RANK-ONE RATES IN 3-GALLERY AND 3-PROBE MODE Rank-one rates are produced by different number of matches while 3 galleries and 3 probes are available. Rates written inside the parentheses are based on three matches and other rates are based on nine matches. Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization. Results with one time-lapse probe per subject FALM FALM FALF FALM FALF FALF FBLM FBLM FBLM FBLF FBLF FBLF FALM FALF FBLM 94.4(92.4)% 95.0(90.9)% 93.9(91.9)% 95.5(93.4)% FALM FALF FBLF 94.4(92.9)% 94.4(93.4)% 91.9(91.9)% 94.9(93.9)% FALM FBLM FBLF 95.5(92.0)% 95.0(91.9)% 93.9(91.9)% 95.5(92.4)% FALF FBLM FBLF 93.4(92.9)% 92.4(92.4)% 92.4(92.4)% 93.4(93.9)% Results with one or more time-lapse probe per subject FALM FALM FALF FALM FALF FALF FBLM FBLM FBLM FBLF FBLF FBLF FALM FALF FBLM 95.0(96.0)% 96.0(94.4)% 95.0(93.4)% 96.5(95.5)% FALM FALF FBLF 95.5(94.4)% 96.5(95.0)% 96.5(93.9)% 97.0(94.4)% FALM FBLM FBLF 96.5(94.4)% 97.0(96.0)% 98.0(92.9)% 97.5(95.0)% FALF FBLM FBLF 96.0(96.5)% 96.5(94.4)% 96.5(93.4)% 97.5(96.0)% 98

112 TABLE 5.4 MMSS RANK-ONE RECOGNITION RATES Labels written in the top row (in italic) represent gallery sets and labels in the first column represent probe sets. Results are obtained by a sum rule after a linear score normalization. Results with one time-lapse probe per subject FALM FALF FBLM FBLF FALT(3D) FALT(3D) FALT(3D) FALT(3D) FALM FALT(3D) 95.0(94.4)% 95.0(96.0)% 90.9(87.4)% 90.4(76.3)% FALF FALT(3D) 95.0(94.4)% 95.0(96.0)% 92.4(86.4)% 88.9(77.8)% FBLM FALT(3D) 93.9(88.9)% 92.4(87.4)% 96.5(96.5)% 96.0(95.0)% FBLF FALT(3D) 91.4(87.4)% 91.9(86.9)% 96.5(94.9)% 98.5(96.0)% Results with one or more time-lapse probe per subject FALM FALF FBLM FBLF FALT(3D) FALT(3D) FALT(3D) FALT(3D) FALM FALT(3D) 97.5(96.0)% 97.5(97.0)% 95.5(87.9)% 94.4(81.3)% FALF FALT(3D) 98.0(97.5)% 97.0(97.5)% 96.0(91.9)% 96.0(83.3)% FBLM FALT(3D) 97.0(90.9)% 96.0(89.9)% 99.0(98.0)% 99.0(99.0)% FBLF FALT(3D) 97.0(90.9)% 97.0(89.9)% 98.5(98.0)% 99.5(98.5)% 99

113 Recommended lighting Additional directed top light source Models acquired with a sensor #1 Models acquired with a sensor #2 Models acquired with a sensor #3 Figure 5.7. Example of 3D shape models of the same person produced under different lighting conditions by three different current commercial 3D sensors. 100

114 CHAPTER 6 A LOCAL SURFACE BASED 3D FACE RECOGNITION ALGORITHM 6.1 Introduction This chapter specifically examines the issue of facial expression variation in 3D face recognition 1. We also compare the performance of approaches to 3D face recognition based on PCA-based ( eigenfaces ) and on iterative closest point algorithms. The method proposed in this chapter is fully automated to initialize the 3D matching. A recent study by Givens et. al [37] reported that the factors that most affect face recognition are expression changes, eye lid open / closed and mouth open / closed. All of these are related to facial expression to a certain extent. The results agree with our work in Chapter 5. We found that different expressions between the gallery and probe sets degrade rank-one recognition rates in 2D face by as much as 15%. Also, a similar study performed for 3D face recognition shows that performance degrades by as much as 33% [14]. One of the conclusions reported by the Face Recognition Vendor Test 2002 [66] is that the number of subjects in the database and time-lapse between gallery and probe affects the overall performance rates: 1 This chapter is based on the paper, Effects on facial expression in 3D face recognition using ICP and PCA, to appear in SPIE Conference on Biometric Technology for Human Identification,

115 For identification and watch list tasks, performance decreases linearly in the logarithm of the gallery size [66]. Average performance decrease for 2D face recognition is 15% in identification when the time lapse between gallery and probe reaches around 500 days according to the FRVT 2002 report [66]. Note that time lapse implies more than just the temporal aspect. Pose, facial make up, facial hair, and/or lighting condition are factors associated with time lapse in the evaluation. These points about facial expression and time lapse highlight a problem with existing literature on 3D face recognition, since most of the 3D approaches reviewed in Chapter 1 considered only neutral expressions with a limited number of subjects and no time-lapse. The approach developed in this chapter is based on the idea of there being some subset of the face that is relatively rigid between two expressions, and using multiple regions to allow flexibility across different expressions. Also, weeks of time lapse are built into the data acquisition. There are at least three general methods that one might employ in an attempt to handle the problem of varying facial expression. One approach would be to simply concentrate on regions of the face whose shape changes the least with varying facial expression. For example, one might simply ignore the lips and mouth region, since the shape varies greatly with expression. Of course, there is no large subset of the face that is perfectly shape invariant across a broad range of normal expressions, and so this approach will not be perfect. Another approach would be to enroll a person into the gallery by intentionally sampling a set of different facial expressions, and to match a probe against the set of shapes representing a person. This approach requires some cooperation on the part of the subject in order to obtain the set of different facial expressions. This approach also runs into the problem that, however large the set of facial expressions sampled for enrollment, 102

116 the probe shape may represent an expression other than those sampled. Thus this approach also does not seem to allow the possibility of a perfect solution. A third approach would be to have a general model of 3D facial expression that can be applied to any person s image(s). The search for a match between a gallery and a probe shape could then be done over the set of parameters controlling the instantiation of expression-specific shape. This approach seems destined to also run into problems. There likely is no general model to predict, for example, how each person s neutral expression image is transformed into their smiling image. A smile means different things to different persons facial shapes, and different things to the same person at different times and in different cultural contexts. Given that there does not seem to be any single correct approach, the question is which approach or combination of approaches can be used to achieve the desired level of performance. In this study, we first document the extent to which facial expression change degrades performance in 3D face recognition. Then, we address this problem by considering 3D face matching only in localized facial regions that show relatively less variation across expressions. Such regions are detected automatically using 3D geometrical features, unlike other facial finding methods using both 2D color and 3D depth images acquired at the same time [31] Related Works There appear to be three main categories of approach to 3D face recognition in the literature (See Tables 1.1,1.2). One is that a three dimensional face can be thought of as a group of points defined in 3D space and they can be matched using a registration technique [55, 58, 65]. Another is that the eigenface approach 103

117 can be extended to accomplish the recognition by measuring the depth (shape) variations observed in range images [16, 22, 44, 64, 78]. Finally, there is a group of studies that use a set of features computed from 3D geometry of face surface to measure the similarity [38, 54, 61, 62, 72]. Even though a handful of 3D face recognition studies [16, 30, 61] consider expression variations, there is no rigorous evaluation study that explicitly attempts to resolve the facial expression problems tested on a large dataset. 6.2 Baseline Performance: ICP and PCA An experiment is conducted to establish the baseline performance obtained by using a whole face. A face is cropped using manually selected points on two outer eye tips and a nose tip (See Figure 6.1). The face region considered is basically that used in the PCA baseline algorithm in 2D face recognition studies. Similarity between gallery and probe surfaces is measured using an iterative closest point (ICP) surface registration technique. The PCA-based approach to 3D face recognition is also included in the baseline to show the difference in recognition accuracy between the two different approaches. The graphs shown in Figure 6.2 are obtained with an ICP algorithm (denoted as ICP-baseline) and the PCA-based method (denoted as PCA-baseline) matching the whole frontal face region, using manually selected landmark points for the initial rotation and translation estimate given to both recognition methods. Successive probe sets have longer elapsed time between acquisition of the gallery image and the probe image. The same gallery images are used with all probe sets, and all gallery images have neutral expression. The results show that ICP outperforms PCA when a whole face region is used 104

ICP-baseline PCA-baseline Figure 6.1. Sample images (a gallery on the left and a probe on the right in each column) used for the baseline performance.

for matching under no expression changes. Higher performance is obtained by the ICP-baseline because pose variation observed between gallery and probe gets minimized during the iterative registration.

It is inevitable to expect some level of variations introduced while the manual points are selected.

118 ICP-baseline PCA-baseline Figure 6.1. Sample images (a gallery on the left and a probe on the right in each column) used for the baseline performance. Nearly the entire face region is considered for both recognition methods. For the ICP-baseline, the probe face is approximately 10% smaller coverage area than the face in the gallery set. for matching under no expression changes. Higher performance is obtained by the ICP-baseline because pose variation observed between gallery and probe gets minimized during the iterative registration. However, the PCA-baseline method does not have such opportunity in regard to pose variations during the matching process, since the pose is normalized once. It is inevitable to expect some level of variations introduced while the manual points are selected. There is significant performance drop by both methods when expression varies between gallery and probe, from an average 91.0% to 61.5% for ICP-baseline and 77.7% to 61.3% for PCA-baseline. This clearly shows the limitation of both methods considering a whole face in the presence of expression change Facial Expression Analysis Changes due to facial expressions influence accuracy not only for 3D shape but also 2D appearance. Even though it is extremely hard to generalize about the expressions of a person, we have visual evidence of different degrees of muscle 105

119 Rank-One Rate ICP-baseline PCA-baseline Probe#1 Probe#2 Probe#3 Probe#4 Probe#5 Probe#6 Probe#7 Probe#8 Probe#9 Probes with Neutral Expression Rank-One Rate ICP-baseline PCA-baseline Probe#1 Probe#2 Probe#3 Probe#4 Probe#5 Probe#6 Probe#7 Probe#8 8 Probes with Non-neutral Expressions Figure 6.2. Performance Degradation with Expression Change. Recognition rates when the probe has expression change are approximately 30% lower than without expression change. Probes with neutral expression (top), Probes with non-neutral expression (bottom). movement. For instance, regions around the mouth would not seem to be reliable for matching since open mouth due to a smile deforms the shape significantly (See Figure 6.7). This suggests to collect the sample points located on relatively static regions to be considered for facial surface matching. Other expressions, such as surprise, sad and disgusted (See Figure 6.3) contract or expand regions including mouth, forehead and cheek. Specifically, a surprised expression has contracted forehead muscles producing wrinkles, lifted eye brow and cheek, and possibly open mouth. In the case of sad expression, muscles between eye brows are contracted, 106

There are seven different expressions considered in this

Left to right, these are neutral, angry, happy, sad,

The first six of these correspond to some of Ekman s human

lips are generally deformed, and cheeks are possibly lifted

on our qualitative evaluation would be an area around the

other points are eliminated, the number of points for

3 Methods Description A new 3D face recognition algorithm is

120 Figure 6.3. Example images in 2D and 3D with different expressions. There are seven different expressions considered in this study. Left to right, these are neutral, angry, happy, sad, surprised, disgusted and puffy cheek. The first six of these correspond to some of Ekman s human basic expressions[32]. lips are generally deformed, and cheeks are possibly lifted up. Therefore, the ideal regions for reliable matching based on our qualitative evaluation would be an area around the nose which displays less movement under expression than other areas. Because only the nose area is considered for matching, the other points are eliminated, the number of points for matching is greatly reduced (10 5 to 10 3 in average). 6.3 Methods Description A new 3D face recognition algorithm is proposed to cope with expression change between gallery and probe images. It uses an ICP surface registration technique to measure the similarity between two 3D face regions of interest (ROI) extracted for face matching. There are two separate ICP-based methods depending on how ROIs are extracted. The first method uses manually labeled ground truth points (two outer eye tips and a nose tip) to extract ROI regions for match- 107

ing. The second method finds ROIs automatically using our facial feature finding method (described in Section 6.3.2 and Section 6.3.4).

121 ing. The second method finds ROIs automatically using our facial feature finding method (described in Section and Section 6.3.4). This following sections describe how our automated feature finding method extracts ROIs and how these surfaces are matched to recognize a person (See Figure 6.4) Overall Framework The problem of varying facial expression can be minimized by considering sample points chosen in facial regions that are relatively static under expression change. We use the nose region for this purpose. The following steps are involved to accomplish the task of person identification under expression changes. Figure 6.4. The overview of the proposed method. 108

Figure 6.5. Face region extraction. 6.3.2 Skin Region Detection First, a group of skin regions is located by a skin detection method using the corresponding 2D color image.

122 Figure 6.5. Face region extraction Skin Region Detection First, a group of skin regions is located by a skin detection method using the corresponding 2D color image. Skin pixels are transformed into the YCbCr colorspace [69]. Pixels in the 2D color image are used in the skin detection method only if they have a valid 3D point. A group of 3D points in a skin region specified by a rectangular area will be processed to compute 3D geometrical features (See Figure 6.5). This step removes not only irrelevant regions for matching, such as shoulder or hair area, but also reduces computing time for later steps Preprocessing Once the skin regions are detected, there are three processes to prepare for later steps. The first process is to subsample the 3D points in the skin region detected in the previous step. This greatly reduces the load of 3D data points. A subsampling rate of S indicates that a point is chosen in every S S window. 109

123 The second process is to detect the outliers. The outliers are usually caused by noise in the 3D sensing process. Specifically, when a 3D scanner projects a single light stripe to compute the depth value in the scene, the projected light stripe needs to be detected by the sensor for the triangulation. However, when the light stripe is projected onto a specular surface, the sensor usually cannot compute reliable depth values. This causes outliers in the data points. There are different ways to find the outliers, such as measuring distance or measuring surface normal angle greater than certain threshold values. In this study, an outlier is defined when an angle between the optical axis and a surface normal of observed points is greater than a certain degree. Such points are claimed as outliers and removed from the model (See Figure 6.6). Finally, a local Gaussian filter (11 11) is used to smooth the data Face Segmentation Using Surface Curvature This section describes a methodology that extracts (segments) some face regions that can always be found as the same curvature region type (peak, pit, saddle and so on) in images with different facial expressions. Related steps for surface curvature estimation and region segmentation based on curvature values are explained. The purpose of the facial surface segmentation in the proposed method is to detect reliable and universal regions that have certain surface characteristics, such as the nose peak. Once the locations of such regions are successfully identified, we will have better knowledge of where to extract ROIs in a given surface. The actual extraction process of ROIs is explained in a later section. 110

(A) Original image (B) Outliers removed Figure 6.6. The outlier removal process.

(B) outliers are removed (subsampled version). 6.3.4.

segmentation and 3D object recognition studies.

and characteristics of observed surface regions.

124 (A) Original image (B) Outliers removed Figure 6.6. The outlier removal process. (A) outliers are scattered to the back of the face. (B) outliers are removed (subsampled version) Surface Curvature Surface curvature information has been extensively used in range segmentation and 3D object recognition studies. The surface curvature carries useful information to understanding of the description and characteristics of observed surface regions. According to Flynn and Jain [35], there are two different ways of estimating surface curvature, analytic estimation and discrete (numerical) estimation. The analytical approach computes a curvature value defined at a point in a local 111

125 surface based on the parameters of the fitted surface. The discrete approach considers a set of curvature values computed based on a set of directional curvature estimates. A brief introduction to curvature estimation is described in Appendix A. Our curvature estimation is based on the analytical approach and detailed steps are described in following sections Local Curvature Estimation The curvature property is measured from a set of points observed in a local region of the surface. More specifically, a curvature value is estimated in the relation to the shape defined by the neighbor points. The neighborhood points are collected within a small region (say, K K) and a local coordinate system for the small region is established prior to the curvature computation. The local coordinate system provides a means to identify the principal directions of distribution of 3D data points. The local coordinate system is formed by the tangent plane (X-Y plane) and a surface normal (Z axis) of the local surface at a point. We employed a PCA method to define the three axes for those K 2 points, X and Y axes are the eigenvectors of two largest eigenvalues and Z axis is the smallest eigenvector, surface normal (See Appendix A.2). Once the local coordinate system is established by a PCA, the curvature at a given point can be estimated since we know that a current point is placed at the origin (of the local coordinate system) and is tangential to the X-Y plane and Z axis as a surface normal of the local surface. The points observed in a small region are then fitted to a quadratic surface of the form. This process expedites a least square surface fitting technique to realize the surface equation by finding the coefficients of the form. After a set of coefficients is found, the partial derivatives 112

126 are computed to estimate the H and K (See Appendix A.3) Surface Segmentation Upto this point, all curvature values are estimated and computed at a point level. A surface with curvature values as feature sets are then grouped together to form a set of homogeneous region. Many studies use region-based grouping technique (clustering, region growing or split-and-merge), boundary-based grouping technique (edge findings) or a hybrid of these two approaches. The primary purpose of surface segmentation is to group a set of 3D points into a set of disjoint regions that have the similar or common properties. In this task, the properties might be further described by the characteristics of surface (i.e. surface types). Surface segmentation yields a group of contiguous regions on a surface that have homogeneous properties or features. It has been utilized in many applications such as 3D object recognition or terrain region finding. This task facilitates a region where surface properties are defined at a point level into a more meaningful way to represent a surface. In our case, a surface is segmented by classifying a pixel based on the signs of H and K defined at each point followed by a region merging through a voting technique [76] Surface Classification Surface classification in 3D has been utilized in many applications such as 3D object recognition, face recognition or terrain region finding. A set of 3D points measuring the depth values may be represented as a surface. Given such a surface, geometric features such as ridges, valleys or peaks can be estimated to characterize the surface. However, it is a non-trivial task to extract geometric features without 113

127 TABLE 6.1 H AND K SIGN TEST FOR SURFACE CLASSES [8] H < 0 H = 0 H > 0 K < 0 saddle ridge minimal surface saddle valley K = 0 ridge plane valley K > 0 peak impossible pit obtaining reliable surface curvatures Searching for Regions of Interest Gaussian curvature (K) and mean curvature (H) are computed, and geometrical shape can be identified by surface classification (See Figure 6.7). As threshold values are tested during the sign test to determine the surface class types, a nose tip is expected to be a peak (K>T K and H<T H ), a pair of eye cavities can be a pair of pit regions (K>T K and H<T H ) and the nose bridge can be a saddle region (K<T K and H>T H ), where T K = and T H = Once 3D surface classification is complete, ROIs are detected for nose tip (peak region), eye cavities (pit region) and nose bridge (saddle region). Figure 6.8 shows each step to detect certain region types. First, eye cavities have a region class type of pit. However, it is possible that there exist several such regions on the face. Among multiple regions classified as pit, a systematic way of finding the area of eye cavities is developed. First, small regions ( < 80 points) are removed. Second, a pair of regions that has similar value in both Y and 114

[Neutral] [Happy] [Disgusted] Figure 6.7. Images of a single person with different expressions rendered based on surface types.

As cheeks are lifted shown in happy and disgusted, peaks are detected at the upper cheeks in both sides or in lips. Z is found.

Starting in the middle between the left and right eye cavities, region types are checked as traversing down while looking for peak surface region with the

128 [Neutral] [Happy] [Disgusted] Figure 6.7. Images of a single person with different expressions rendered based on surface types. Many surface types are changed as the deformation is introduced. As cheeks are lifted shown in happy and disgusted, peaks are detected at the upper cheeks in both sides or in lips. Z is found. Third, ones with higher Y values are found if there are still multiple regions left. After the eye cavities are found, a nose tip can be searched. Starting in the middle between the left and right eye cavities, region types are checked as traversing down while looking for peak surface region with the largest difference in Z value from the center of the pit region. Lastly, the nose bridge is searched around areas located between two eye cavities Pose Correction The pose correction serves to identify the orientation of the face and to help make the ROI region extraction (described in Section 6.3.7) more accurate. Pose 115

Input (front view) Input (3/4 view) eye cavities nose tip nose bridge (pit) (peak) (saddle) Figure 6.8.

The pit regions in eye cavities are found first followed by nose tip classified as a peak region.

correction is performed by aligning an input surface to a generic 3D face model.

Then, a region specified by a circular region around the nose tip is extracted.

129 Input (front view) Input (3/4 view) eye cavities nose tip nose bridge (pit) (peak) (saddle) Figure 6.8. Example of different surface types classified and detected. The pit regions in eye cavities are found first followed by nose tip classified as a peak region. Lastly, nose bridge is detected. The ROIs are automatically extracted based on the locations of these regions. correction is performed by aligning an input surface to a generic 3D face model. Using the curvature-based facial feature finding method, a nose tip is detected. Then, a region specified by a circular region around the nose tip is extracted. Only these surfaces are then registered by the ICP to correct the pose of a given face (See Figure 6.9). A transformation matrix around X, Y and Z as well as translation (from a given data to a known model) is produced as a result of surface registration. The input data points are then transformed based on this matrix. 116

Input image Registration result Figure 6.9. Two examples of pose correction.

7 Surface Extraction Once the ROIs are identified, a mathematical function can be used to clip the parts in the nose region.

Based on the nature of the ICP matching, the points in a probe surface should be a subset of a gallery surface.

130 Input image Registration result Figure 6.9. Two examples of pose correction. Each pair of two sub-regions is registered and a transformation matrix is produced Surface Extraction Once the ROIs are identified, a mathematical function can be used to clip the parts in the nose region. Depending on the status of the surface in matching (either a gallery or a probe), different regions are considered for matching. Based on the nature of the ICP matching, the points in a probe surface should be a subset of a gallery surface. Therefore, a gallery surface is prepared as an ellipsoidal shape to cover as much facial area as possible where the general nose region is included (See Figure 6.10). For a probe surface, three local facial surfaces are extracted 117

Gallery surface Probe surface #1 Probe surface #2 Probe surface #3 Figure 6.10.

predefined functions, and each should be a subset of the gallery surface.

After the region inside the rectangle is extracted, four ellipse regions (A, B, C and D labels in

10) are then clipped, yielding a surface region for probe #2 enclosed by a dashed line in the nose

131 Gallery surface Probe surface #1 Probe surface #2 Probe surface #3 Figure Matching surface extraction for a gallery and three probes around the nose peak by a set of predefined functions, and each should be a subset of the gallery surface. For example, the second probe surface is determined by five equations. After the region inside the rectangle is extracted, four ellipse regions (A, B, C and D labels in Figure 6.10) are then clipped, yielding a surface region for probe #2 enclosed by a dashed line in the nose area. And this is a subset region of the gallery surface. Each of the equations is defined by the constant offset values from the centroids of the ROIs. Even though we claimed that the ideal regions for face matching under varying expressions are areas around the nose, parts of 118

the nose still show a certain degree of muscle movement (nose bridge/nostril). This problem can be resolved by considering multiple local surfaces around the general nose area.

132 the nose still show a certain degree of muscle movement (nose bridge/nostril). This problem can be resolved by considering multiple local surfaces around the general nose area. For instance, the third probe excludes the nostril portion of the nose. Considering several different surfaces should provide a chance to select the best match among them. These are the regions of interest during facial feature findings. This method of finding the curvature-based face regions is automated and has been evaluated on 4,485 3D face images of 546 people with a variety of facial expressions. The (nose tip, nose bridge, eye pits) regions were successfully found in 97.1% of the images (4,353 of 4,485). Figure As three local surfaces are matched against a gallery, different fusion strategies may be considered to combine three RMS values produced by each probe surface. 119

133 6.3.8 Face Matching in Identification The ICP is an iterative alignment algorithm that works in three phases: (1) establish correspondence between pairs of features in the two models that are to be aligned, (2) estimate the rigid transformation that best maps the first member of the pair onto the second, and (3) apply the transformation to all features in the first model. These three steps are then iterated until convergence is achieved. Although simple, the algorithm works quite effectively when a good initial estimate is obtained. More formally, the point matching algorithm specified by ICP is: Let P be a set of N p points {p 1,... p Np } and M be the model. Let p m be the Euclidean distance between point p on P and m on M. Let CP (p, M) be the closest point in M to the scene point p. 1. Let T [0] be an initial estimate of the rigid transformation. 2. Repeat for k = 1... k max.iteration or until convergence: (a) Compute the set of correspondences C of the closest points found on M at all p. (b) Compute the new Euclidean transformation T [k] that minimizes the root mean square error (RMS) or absolute value difference (AVD) between point pairs in C. Given a pair of surfaces to be matched, the initial registration is performed by translating the centroid of a probe surface to the centroid of a gallery surface. Iterative alignment based on point difference between two surfaces is performed at each iteration, producing a rigid transformation matrix. At the end of each 120

134 TABLE 6.2 STATISTICS DATASET USED IN THIS STUDY Neutral expression sets: 2,798 scans of 449 subjects * Non-neutral expression sets: 1,590 scans of 355 subjects Subjects with 2 or more scans = 449 Subjects with only one scan = 97. Total number of scans : 4,485 scans. * is the gallery set iteration, the RMS difference is computed between two surfaces. The iteration halts when there is little or no change. Because a probe has 3 local surfaces that need to be matched to a gallery, decision fusion is required to combine the three RMS error values for the final similarity value (See Figure 6.11). During the decision process of matching each probe to one of the gallery entries, some fusion or voting rule must be used. We considered the sum rule, minimum rule, and product rule. The sum rule takes the sum of the RMS differences for the three regions from the probe image as the probe-to-gallery match value. The minimum rule takes the smallest of the RMS difference values. The product rule takes the product of the three difference values. In initial experiments, the product rule resulted in the highest recognition accuracy. 121

135 TABLE 6.3 DEMOGRAPHY OF PARTICIPANTS BY AGE Age % 76.6% 7.5% 2.2% 1.5% 0.1% 0.1% TABLE 6.4 DEMOGRAPHY OF PARTICIPANTS BY RACE Race White Asian Hispanic Black Other 68.1% 23.8% 2.5% 1.5% 4.1% TABLE 6.5 DEMOGRAPHY OF PARTICIPANTS BY GENDER Gender Male Female 56.5% 43.5% 122

136 6.4 Data Collection A total of 546 different subjects participated in one or more data acquisition session, yielding a total of 4,485 3D scans used in this chapter. Among the 546 subjects, 449 participated in both a gallery acquisition and at least two or more probe acquisition(s). The earliest scan with neutral expression is used as a gallery. Subjects who only have non-neutral expressions are dropped since a gallery image is acquired under neutral expressions. Also, subjects who have only one image are dropped. There are two classes of probes depending on the expression changes being asked of the subjects at the time of data acquisition. The first class consists of 9 probe sets and each probe set contains 3D scans acquired under neutral expression collected in different weeks. This class has a gallery set of the 449 subjects and a total of 2,349 probes across the nine probe sets. The second class consists of 8 probe sets and each probe set contains 3D scans acquired while subjects were asked to have different ones of the human expressions described by Ekman (angry, happy, sad, surprised and disgusted) [32] and one additional expression variation with puffy cheek. The second class has the same gallery as the first class and a total of 1,590 probes acquired in later weeks of 355 subjects, a subset of the 449 subjects in the gallery. The training set, needed only for the PCA method, contains the 449 gallery images plus an additional 97 images for subjects whom only one good scan was available. Thus, this additional 97 images are used only to create the face space for the PCA method. 123

137 6.5 Experiment The methods considered in the experiments are (1) ICP-baseline, using manually selected landmark points and the whole face, (2) PCA-baseline which uses manually selected points and the whole face is matched as shown in Figure 6.1 (See Section6.2 for the baseline methods), (3) ICP-auto which automatically finds ROIs and extracts the multiple nose area regions for matching and (4) ICPmanual, using manually selected landmark points to extract the multiple nose regions for matching. Four experiments are conducted to evaluate the recognition rates in various situations, such as time-lapse and expression variations between gallery and probe. The first experiment investigates how the recognition performance is affected by time-variation only, with no expression change. The second experiment evaluates the performance of ICP-baseline and two ICP-based methods when both time and expression are varied. In the third experiment, the performance effects of 3D face recognition methods on the number of probes are examined. Finally, the probes are collected into one single pool. There will be one or more probes for a subject who appears in the gallery, with each probe being acquired in a different acquisition session separated by a week or more Performance Effects on Time Variation This experiment evaluates the performance across 9 probe sets. Probe set #2 has a greater elapsed time between gallery and probe image acquisition than probe set #1, and so on. The results are shown in Figure As described earlier in Section 6.2, the ICP-baseline has an average of 91.0% and the PCA-baseline for 77.7%. Both ICP-auto and ICP-manual methods performed higher than 3D 124

138 Rank-One Rate ICP-baseline ICP-auto ICP-manual 0 Probe#1 Probe#2 Probe#3 Probe#4 Probe#5 Probe#6 Probe#7 Probe#8 Probe#9 (449) (390) (336) (286) (243) (205) (172) (145) (123) 9 Probes with Neutral Expression Figure Rank-one recognition with same neutral expression as gallery. A number in the parentheses is the number of probe images. The product rule obtained 94.6% by ICP-auto and 96.5% by ICP-manual. Refer the baseline rates to Section 6.2 ICP-baseline, yielding rank-one recognition rate of 94.6% by ICP-auto and 96.5% by ICP-manual. Considering the other possible fusion rules, the sum rule obtained 94.5% by ICP-auto and 96.2% by ICP-manual while the minimum rule obtained 94.1% by ICP-auto and 96.0% by ICP-manual. The differences here are unlikely to be statistically significant. The two algorithms (ICP-manual and ICP-auto) differ only in ICP-manual using manually selected landmark points to initialize the ICP matching, and ICP-auto being totally automated. These results show that (1) both ICP-based methods including ICP-baseline outperform the PCA-baseline in neutral expression, (2) there is marginal performance difference between automated and manual ICP methods, reflecting that our automated 3D facial feature finding method is 97.1% successful, and (3) the rank-one recognition rates were maintained surprisingly well as the elapsed time between gallery and probe increases. This is a potentially important difference of 3D face recognition compared to 2D face recognition we do not see any evidence, in this study, of 125

139 decreasing recognition rate with increased elapsed time between gallery and probe Performance Effects on Expression Variation This experiment examines our ICP-based methods and PCA-baseline when subjects have a different expressions in their probe image as compared to their gallery image. This has the same experimental design as the previous one except that there are 8 probe sets. As shown in Figure 6.13, our proposed ICP-based methods (-auto and -manual) clearly outperform the ICP-baseline and PCA-baseline methods. The average rank-one match rates of the baseline methods are 61.5% and 61.3% for ICPbaseline and PCA-baseline, respectively, while ICP-auto achieves 81.6% on average (86.1% by ICP-manual). The sum rule obtained 80.6% by ICP-auto and 85.6% by ICP-manual while the minimum rule obtained 78.0% by ICP-auto and 83.6% by ICP-manual. We found that (1) our ICP-based methods also have better identification accuracy than ICP-baseline and PCA-baseline in varying expressions, (2) expression changes do cause performance to deteriorate in all methods, (3) the rank-one rates of shape are more consistent than 2D rates in the presence of time lapse, where compared to reports in other study [66]. This is observed in both neutral and non-neutral expressions Scalability of 3D Face Recognition The results in this experiment summarize the performance effects on the number of subjects enrolled in a probe set. The probe set #1 in each class is selected. Every other subject in each probe set is included. Thus one half of the original subjects are selected for the second probe set. The next probe set is prepared from 126

140 Rank-One Rate ICP-baseline ICP-auto ICP-manual Probe#1 Probe#2 Probe#3 Probe#4 Probe#5 Probe#6 Probe#7 Probe#8 (355) (321) (266) (209) (168) (125) (91) (55) 8 Probes with Non-Neutral Expression Figure Rank-one identification results obtained with different expressions. The product rule obtained 81.6% by ICP-auto and 86.1% by ICP-manual. Refer the baseline rates to Section 6.2 every other probe of the second set and so on. Each class has five probe sets, and rank-one rates are plotted in Figure Our results indicate that there is a tendency toward higher performance as the gallery size decreases. This agrees with evaluation results reported in the FRVT [66]. Also, this performance behavior is more prominent when expressions are varied. However, the degree of decrease in recognition rate that accompanies a doubling in gallery set size is much less here for 3D than has been reported by others for 2D Multiple Probe Study This models the situation where a person is enrolled in the system with one neutral expression image and then attempts are made to recognize the person at various later points in time based on the image acquired at that time, with possibly varying facial expression. For the multiple probe study, the gallery images are acquired in the first week and all the probes acquired in later weeks are collected 127

141 Rank-One Rate ICP-auto ICP-manual Set#1 (449) Set#2 (224) Set#3 (112) Set#4 (56) Set#5 (28) Probes with Neutral Expression in Smaller Sizes Rank-One Rate ICP-auto ICP-manual Set#1 (354) Set#2 (176) Set#3 (118) Set#4 (43) Set#5 (21) Probes with Non-Neutral Expressions in Smaller Sizes Figure Rank-one identification rates obtained to probes in different sizes. Probes with neutral expression (top), Probes with non-neutral expression (bottom). into a single poll, yielding 3,939 (2,349+1,590) probes (See Table 6.2 for dataset description). Then, correct or incorrect match is recorded for each probe. Performance rate is reported as an average of correct match rate by all subjects. (See Figure 6.15). ICP-auto and ICP-manual methods achieved similar performance rates and show higher performance accuracy than methods that use the whole face in matching. 128

142 ICP-baseline PCA-baseline ICP-auto ICP-manual Figure Multiple probe study with neutral and non-neutral expressions performed by different methods. 6.6 Summary and Discussion The baseline experiments involve the evaluation of PCA or ICP based algorithms similar to ones recently reported in the literature [22, 44, 55, 58]. Using a PCA-based algorithm and an ICP-baseline method, each with manually-selected landmark points, gives results that show slightly better performance for the ICPbased approach when there is no expression change. These algorithms use the whole frontal face area in matching. The results further show that in the presence of expression change, the recognition performance of these baseline algorithms drops dramatically. This is because they treat the whole frontal face region as a rigid surface, and of course the face as a whole is not rigid over expression change. This performance deterioration due to expression changes also can be shown as changes in the distribution of correct match versus mismatch. Figure 6.6 shows that there is well separated boundary between the correct match distribution and mismatch distribution. 129

143 Correct Match (Neutral) Mismatch (Neutral) Correct Match (Expression) Mismatch (Expression) Number Matches Distance Metrics Figure Two pairs of distributions based on expression changes in probe sets. The arrow shows that the increase in distance metrics of correct matches as expressions are varied. The distributions in this figure are generated with PCA-baseline. However, as expressions are varied in probe sets, the distribution of correct match is merged into the mismatch region causing performance degradation. Our new algorithm focuses on the general nose region area of the face, as being the most rigid across expression change. We actually use three different shapes around the nose area, to allow for the fact that the whole nose area region is not perfectly rigid across expression change. The three regions are matched independently from probe to gallery, and the three match values (RMS differences) are combined to recognize the probe identity. We evaluate the new algorithm once using the manually selected points to initialize the surface matching, and a second time as a fully-automated algorithm not using any manually selected points. The fully automated version has only slightly lower performance than the version using man- 130

Gallery Correctly Incorrectly matched matched probe probe Figure 6.17.

This subject was incorrectly identified with the PCA-based method but was successfully matched by our new ICP-based method.

144 Gallery Correctly Incorrectly matched matched probe probe Figure One of the cases when localized face regions for matching would be more advantageous than when using whole face being obscured by hair. This subject was incorrectly identified with the PCA-based method but was successfully matched by our new ICP-based method. Problems like facial occlusion due to hair or mustache can be resolved by a local region based matching. Intensity image is shown for illustration purposes only. ually selected points. This reflects the overall 97.1% accuracy in automated point selection. Also, the new algorithm matching general nose regions shows a better performance rate than PCA-baseline. This shows that our method has higher discriminatory power. When expression is varied in subjects at different times, the recognition rate does not degrade as much as ICP-baseline or PCA-baseline using whole face region. The failed cases of ICP-auto are presented in Figure 6.20 later in this chapter Multiple Surfaces for Matching As subjects have expression variations between gallery and probe images, both ICP-based methods showed performance drop ranging from 10 to 13% while both PCA-baseline and ICP-baseline obtained nearly 30% drop accuracy in recognition. This indicates that multiple local surfaces around the general nose area are 131

145 effective to cope with expression changes. This section revisits the experiments reports in sections and and reported how well an individual probe surface performed in both cases, neutral expression versus non-neutral expression. For the neutral expression probe sets, all three probe surfaces show similar recognition rates while the highest recognition rate was obtained with the probe surface #3. However, insignificant shape deformation under no expression change makes the three probe surfaces redundant, i.e. there is little complementary information among three probe surfaces. Therefore, little improvement was gained as the similarity results are combined (See Table 6.6). Table 6.7 shows that recognition rates based on individual surfaces and combined results for the probes with non-neutral expressions. However, better fusion results were obtained with three probe surfaces while the probe surfaces #2 and #3 are very similar in accuracy when varying facial expression between a gallery and a probe. They contributed the most to the fusion results while the probe #1 did not perform well because it contains the facial area (frontal cheek muscles connected to the nose) where expression would change the shape. This shows that considering multiple parts of the nose region would improve rates in the decision fusion process. It is possible because certain parts of three probe surfaces still remain in a stationary shape under expression change Feasibility of Other Areas for Matching One surprising element of this work is that we can achieve such good performance using a small portion of whole face by ICP-based method. Other relatively rigid regions under expressions can also be considered for matching, such as a forehead, a chin region or a temple (the soft flat areas both sides of the forehead) 132

146 TABLE 6.6 ICP-AUTO RESULTS BY EACH PROBE SURFACES IN NEUTRAL EXPRESSION probe probe probe combined surface #1 surface #2 surface#3 (product rule) #1 92.4% 91.8% 92.2% 93.1% #2 91.0% 92.1% 93.1% 93.8% #3 93.2% 92.0% 93.8% 94.3% #4 94.5% 93.7% 93.4% 94.1% #5 92.2% 91.0% 93.8% 95.1% #6 94.2% 93.2% 94.2% 94.1% #7 91.9% 92.5% 94.2% 94.8% #8 94.5% 93.1% 93.8% 93.8% #9 97.6% 96.8% 98.4% 98.4% Mean 93.5% 92.9% 94.1% 94.6% area. In addition to the locations of ROIs, we are also interested in the ideal shape and size. In order to determine the ideal location and configuration (size and shape) of ROIs, a small dataset of 257 images (30 individuals including 15 Whites, 9 Asians, 6 Others) is collected. The searching process is repeated each time size, shape and location are varied. At the end of this experiment, the three probe region shapes shown in Figure 6.10 were the most effective regions for matching. Various shapes that we attempted but obtained lower performance rates are also shown in Figure Both the chin and the forehead regions are sometimes obscured by beard and hair, respectively (See Figure (A), (B)). Also, because 133

relatively static regions but performance rates obtained with

facial expression changes, however they are not ideal

147 Figure The shapes (in different sizes) that we attempted for relatively static regions but performance rates obtained with these shapes was lower than the shapes used in Figure (A) (B) (C) Figure Other locations also show relatively static movement under facial expression changes, however they are not ideal candidate regions for matching. (A) a chin obscured by beard, (B) a forehead obscured by hair, (C) unreliable sensing in regions near silhouette, such as the temples. 134

148 TABLE 6.7 ICP-AUTO RESULTS BY EACH PROBE SURFACES IN NON-NEUTRAL EXPRESSION probe probe probe combined surface #1 surface #2 surface#3 (product rule) #1 60.6% 71.8% 72.3% 76.9% #2 71.7% 77.3% 77.9% 79.1% #3 71.4% 74.8% 76.7% 78.2% #4 73.7% 75.6% 78.5% 82.3% #5 77.4% 85.7% 80.4% 83.3% #6 76.0% 84.8% 81.6% 84.8% #7 76.9% 78.0% 79.1% 82.4% #8 81.8% 85.5% 78.2% 85.5% Mean 73.7% 79.2% 78.1% 81.6% of unreliable sensing on hairy regions, they are not ideal region for matching. The 3D images from the Minolta range scanner are frontal views, making the temple and cheekbone regions (located in the edge-on silhouette) difficult to use as shown in Figure 6.19-(C). As a final note, there are two areas, ICP execution speed and our facial feature finding method, that can be improved. Even though the time it takes to complete the ICP surface matching ranges between 0.4 to 0.7 seconds in about 30 iterations, preprocessing steps (data cleanup, skin detection, facial feature extraction and ROI extraction) take approximately 20 to 30 seconds to complete. Making 3D processing computationally efficient might be a challenging task. One way to 135

149 improve face matching is to apply a spatial search technique using a specialized data structure for the ICP method [40]. During the geometrical feature computation, only H and K measures are considered. It might be interesting to examine the effect of shape S and curvedness C on facial data [51]. S and C can be simply computed with κ 1 and κ 2. S = 2 π arctan(κ 1 + κ 2 κ 1 κ 2 ) (6.1) C = (κ κ 2 2 ) 2 (6.2) Mathematically, a surface might be differently classified by S C or H K. It can replace or integrate H and K to enhance the accuracy. Comparing to H and K, Cantzler and Fisher reported that S and C has some advantages at low thresholds and in complex scenes with noise [19]. 136

(A) Only half of the face was detected as skin region

The curvature map image of the skin region (middle).

(right). (B) Pose was incorrectly adjusted (middle).

(right). (C) ROI extracted in wrong area.

150 (A) Only half of the face was detected as skin region (left). The curvature map image of the skin region (middle). The probe surface #1 is extracted in a wrong region (right). (B) Pose was incorrectly adjusted (middle). The probe surface #1 is extracted in hair region (right). (C) ROI extracted in wrong area. There is a peak region classified in the curvature map image (middle). The probe surface #1 is extracted in a lip region (right). Figure Example cases where the automatic facial feature finding method failed (132 out of 4,485 scans) to extract ROI regions for face matching. Failed in (A) skin detection (B) pose correction (C) ROI extraction 137

An Evaluation of Multimodal 2D+3D Face Biometrics

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 4, APRIL 2005 619 An Evaluation of Multimodal 2D+3D Face Biometrics Kyong I. Chang, Kevin W. Bowyer, and Patrick J. Flynn Abstract