Flexible and Robust 3D Face Recognition. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame

Size: px

Start display at page:

Download "Flexible and Robust 3D Face Recognition. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame"

Anthony Thornton
5 years ago
Views:

1 Flexible and Robust 3D Face Recognition A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Timothy C. Faltemier, M.S. Dr. Kevin W. Bowyer, Co-Director Dr. Patrick J. Flynn, Co-Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana September 2007

3 Flexible and Robust 3D Face Recognition Abstract by Timothy C. Faltemier Face recognition is one of the least intrusive modalities in biometrics. Our research shows that using 3D face data for recognition provides a promising route to improved performance. In this dissertation, we introduce a new system for 3D face recognition that addresses many of the current challenges preventing this technology from becoming viable. The first challenge is to determine the feasibility of combining data from multiple sensors. Next, we create a technique named Region Ensemble for FacE Recognition (REFER) that is capable of matching 3D face images in the presence of expressions and occlusion. Accurate feature detection and pose recognition is accomplished through a novel technique that we have created named Rotated Profile Signatures (RPS). Scalability issues are mitigated by combining a feature based indexing technique with desktop grid processing to greatly reduce the amount of time required for recognition experiments. Finally, we investigate the potential benefits of using a multi-instance enrollment approach that can be used to further increase the performance of our system. These solutions result in a final system that is capable of deployment in a variety of realistic biometric scenarios.

4 To my parents Ed and Sharon, my wife Emily, family, friends, co-workers, and advisors. None of this would be possible without your help. ii

5 CONTENTS FIGURES vi TABLES x PREFACE xii ACKNOWLEDGMENTS xiii CHAPTER 1: INTRODUCTION D Face Recognition Dissertation CHAPTER 2: RELATED WORKS AND LITERATURE REVIEW Image Matching Background Feature Detection D Face Recognition D Face Recognition Object and Face Indexing Methods CHAPTER 3: DATA ACQUISITION AND IMAGE PREPROCESSING Data Sets FRGC v ND NDOff Data Preprocessing and Feature Extraction Frontal Preprocessing and Feature Detection Improved Pose Invariant Preprocessing and Feature Detection 29 CHAPTER 4: FEASIBILITY OF CROSS SENSOR FACE RECOGNITION Introduction Experimental Setup iii

6 4.3 Results Summary CHAPTER 5: REGION ENSEMBLE FOR FACE RECOGNITION Introduction The Region Committee Voting Approach Region Extraction Individual Region Matching Combining Results From a Committee of Regions Experimental Results Experiment 1: Verification Experiment 2: Identification Experiment 3: FRGC ROC III Experiment 4: Expression Variation Experiment 5: Missing Data Comparison to results published on the FRGC v2.0 data set Summary CHAPTER 6: ROTATED PROFILE SIGNATURES FOR MULTI-VIEW 3D FACE RECOGNITION Introduction Survey Contribution Experimental Design and Results Experiment 1: Baseline REFER Performance Experiment 2: RPS + REFER Experiment 3: RPS + REFER on FRGC v2.0 Data Set Summary CHAPTER 7: 3D FACE RECOGNITION SCALABILITY Improved Image Matching through Indexing D Face Indexing Methods PCA Subspace Projection Geometric Distance Measurements k-means Clustering k-nearest Neighbor Analysis Experimental Setup and Results Reduced Processing Time using a Desktop Grid Architecture Summary iv

7 CHAPTER 8: USING A MULTI-INSTANCE ENROLLMENT REPRE- SENTATION TO IMPROVE 3D FACE RECOGNITION Introduction Multi-Gallery Data Preprocessing Experimental Design and Results Multi-Instance Gallery Using Varied Expressions Comparison of Happy and Sad for Multi-Instance Gallery Multi-Instance Gallery Versus Component-Based Approaches Statistical Significance Conclusions and Future Work CHAPTER 9: CONCLUSIONS AND FUTURE RESEARCH Contributions Future Research CHAPTER 10: Appendix Geometrix Category Separation REFERENCES v

8 FIGURES 3.1 Examples of images captured with the Vivid 910 by Minolta (a and c) 3D shape data for two different subjects (b and d) associated 2D color texture information Sensors used in our experiment (a) Qlonerator by 3Q (b) Vivid 910 by Konica Minolta Example of 3Q image types used in our experiments (Subject 04261) Example of Minolta image types used in our experiments (Subject 04261) Examples of image categories manually labeled by Geometrix. (a) Neutral image, (b) Small Expression, (c) Large expression Examples of images that contain artifacts in the FRGC v2.0 data set Different types of expressions gathered for subject and their 3D shape images Different types of expressions gathered for subject and their 2D color texture images Examples of the 2D Pose Space (Subject 05078). The images show the changes in yaw (90 to -90 ), and changes in pitch (45 to -45 ). Labeling from left to right and top to bottom, the image filenames are 05078d403, 05078d401, 05078d405, 05078d427, 05078d453, 05078d429, 05078d399, 05078d459, 05078d463, 05078d431, 05078d Examples of the corresponding 3D Pose Space (Subject 05078) for Figure 3.9. The images show the changes in yaw (90 to -90 ), and changes in pitch (45 to -45 ) Example screenshots demonstrating the first two steps of our preprocessing method Screenshots demonstrating the region of interest extraction process. 34 vi

9 3.13 Screenshots demonstrating the steps used in our image alignment process Preprocessing steps used on the ND-2006 and NDOff2007 data sets Skin detection results on subject D face image examples used in this experiment (Subject 02463) Examples of the types of artifacts present in the images. (a) The arrows indicate holes in the mesh due to hair. (b) The arrows represent spikes in the mesh caused by the scanner A diagram representing the required operations for the REFER algorithm Image number 324 of subject displaying the location of the 28 centroids used by the REFER algorithm during the region of interest (ROI) selection. Each number corresponds to the region location information in Table 5.2. The circles show the relative size and coverage of the three spheres for the selected location Radius size examples (25mm, 35mm, 45mm) Example of the Borda Count algorithm on a sample set of data Experiment 1 ROC curve for the All vs. All gallery and probe configuration Experiment 2 CMC curve for the First vs. All gallery and probe configuration Experiment 4 CMC curves, (a) Neutral vs. Neutral, (b) Neutral vs. Non-Neutral Experiment 4 CMC curves, (c) Neutral vs. All, (d) Non-Neutral vs. Neutral Experiment 4 CMC curves, (e) Non-Neutral vs. Non-Neutral, (f) Non-Neutral vs. All Nose detection examples using frontal and non-frontal images (Subject 04385). The nose tip detected by the RPS algorithm is signified by the blue sphere, and the nose tip located by the Z-heuristic is signified by the green sphere An example of the rotation process given a subject (05078) with 15 yaw and 45 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles vii

10 6.3 An example of the rotation process given a subject (05078) with 15 yaw and 0 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles An example of the rotation process given a subject (05078) with 15 yaw and -45 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles Model m 1 is extracted from an image with a 0 pitch, and model m 2 is extracted from an image with a 45 pitch An example of the matching algorithm as model m 1 is matched to different configurations on the face The profile scores from matching Figure 6.3 with Model Example images where the nose tip was incorrectly identified. The top row shows the original (60, 60 ) images and the bottom row shows the corresponding manually pose corrected images. The nose tip detected by the RPS algorithm is signified by the blue sphere Selected images where the nose was incorrectly identified by the RPS algorithm in the FRGC v2.0 data set. The incorrect nose location is represented by the blue circle on the face Rough 3D box cropped nose images for three different subjects Reference model used for ICP alignment (a) Front view (b) Profile view Examples of different intensity images for the same subjects if cropped before alignment (a,b): Subject 1. (c,d): Subject 2. (e,f): Subject Example intensity images for three different subjects (a,b): Subject 1. (c,d): Subject 2. (e,f): Subject Normalized PCA preprocessed PGM images of three different subjects (a,b) Subject 1 (c,d) Subject 2 (e,f) Subject Visualization of 3541 images projected into PCA subspace with select individuals highlighted An image showing the automatically located features used for geometric distance indexing CMC All vs. All Performance with no indexing Rank 1 recognition rate as the number of nearest neighbors increases viii

11 7.10 I/O Scalability of Data Partitioning and Access Gallery and Probe images used in the Multiple Instance Experiment ix

12 TABLES 3.1 Types of expressions in the ND-2006 data set Distribution of the 7,317 scans by pose Overall Image Information by Sensor ICP Matching Results - Iteration Test ICP Matching Results - Subsample Test - 20 Iterations Running times of components in the REFER algorithm for a verification scenario ICP Probe Region Information Individual Region Matching Performance Fusion Rule Results All vs. All REFER ROC Values at increasing Confidence Levels Fusion recognition rate as the number of regions included in the committee increases one by one starting with the highest individual recognition rate Experiment 1: Verification Results Experiment 2: Identification Results Experiment 3: ROC III Results Experiment 4: Expression Variation Results Experiment 5: Simulated Missing Data Results Rotated Profile Signatures nose detection results (in %) by pose (in degrees) Baseline REFER Rank-One performance (in %) by Pose RPS + REFER Rank-One performance (in %) by Pose Experiment 1 - Multiple Expression Experiments (3921 Probe Images) - Overall Results x

13 8.2 Experiment 1 - Multiple Expression Experiments (3921 Probe Images) Experiment 2 - Neutral and Happiness Experiment (5968 Probe Images) - Overall Results Experiment 2 - Neutral and Happiness Experiment (5968 Probe Images) Experiment 3 - Multi-Instance vs. Single Gallery xi

14 PREFACE xii

15 ACKNOWLEDGMENTS I would like to acknowledge my advisors, Dr. Kevin W. Bowyer and Dr. Patrick J. Flynn, for their direction and countless hours spent with me on this dissertation and my research. Without their patience and guidance this dissertation would not have been possible. I would also like to express my gratitude to each of my committee members for their time and suggestions during this process: Dr. Douglas Thain, Dr. Surendar Chandra, and Dr. Nitesh Chawla. I would like to acknowledge and thank the members of the Computer Vision Research Laboratory and graduate students in the Computer Science Department at the University of Notre Dame for their hard work, thoughtful suggestions, and contributions to my research. In particular Karsten Steinhaeuser, David Cieslak, and Ryan Kennedy for their patience and helpful recommendations while listening to me discuss the advantages and disadvantages of various 3D face recognition strategies to exhaustion. My parents, Ed and Sharon receive my deepest gratitude and love for their dedication, suggestions, support, and funding for the countless years of my academic career. Finally, I would like to thank my wife Emily for her support, love, and encouragement that made this dissertation possible. Biometrics research at the University of Notre Dame is supported by the National Science Foundation under grant CNS , by the Central Intelligence xiii

16 Agency, by the National Geo-Spatial Intelligence Agency, by UNISYS Corp., and by the US Department of Justice under grants 2005-DD-BX-1224 and 2005-DD- CX-K078. xiv

17 CHAPTER 1 INTRODUCTION There is a growing need for an accurate and automatic human recognition procedure. Biometrics is a science that could assist in this recognition process. The word biometrics comes from the Greek work bios, meaning life, and metron, meaning measure. Biometrics uses distinguishing human anatomy or traits to determine or verify the identity of an individual. It is believed that demographics (or soft-biometrics ) was first used in ancient Egypt during the construction of the great pyramids. Administrators providing food and supplies to the workforce kept records of every worker s name, age, place of origin, work unit, occupation, and when the worker received their last allowance [1]. This process ensured that no one received more than their share of benefits. Today, thousands of individuals are victims of identity theft, and biometrics is sometimes touted as a possible solution. Biometrics is generally used in two different ways: subject authentication, and subject identification. In the authentication scenario, users present themselves to the biometric system and the biometric system confirms their identity. Additionally, there are generally three ways to authenticate a user. The first relies on the user s knowledge (passwords, social security numbers, birth dates, mother s maiden name, etc.) for subject identification. The problem with this method of authentication is that knowledge is easily stolen. The next method is based on 1

18 possession of a physical artifact. This method relies on the user being given a special card or device. This device may be swiped or entered to allow access to a resource. The problem with this method is that the device could be easily lost, stolen, or broken. The last method is based on a measurement of who or what the user is. The authentication scenario described above is one of the typical applications for biometrics. The other is subject identification. In this scenario, no initial guess at the individual s identity is made. Typically, a biometric measurement is presented to the system and it must determine the best match by comparing it to each comparable signature in the database. Of the two scenarios described, this is generally believed to be the more challenging. Iris recognition, fingerprint matching, and DNA matching are among the most accurate biometric methods; however, they also require the most user interaction. Face and ear recognition are much more desirable in surveillance applications where subject cooperation may not be possible. In this dissertation, we focus our research on 3D face recognition D Face Recognition The majority of face recognition research has focused on 2D face images. However, Medioni et al. [2] and Jain et al. [3] motivate possible advantages of 3D face recognition over 2D face recognition. First, the shape is defined (and can supposedly be acquired) independent of lighting, whereas photometric appearance is not. Second, 3D data allows face or head pose correction more easily than 2D data. Third, 3D face shape seems likely to change less with variations in cosmetic use, skin coloration, and similar surface reflectance factors than 2D face appearance. Typically, a biometric system can be separated into two phases: enrollment (where subjects are presented to the system as known users) and identification 2

19 (where an unknown subject is presented to the system for verification). During both phases, the image of the user is processed to improve the quality (fill holes, perform smoothing, remove skin and hair) as well as to extract the region(s) of interest for matching. When matching occurs, a score will be generated to tell the system operator how similar the input image is to the chosen gallery image. 1.2 Dissertation This dissertation describes a body of original work in the field of 3D face recognition and presents the components required for a complete biometric system. In Chapter 2, we provide a detailed literature survey of the current state of face recognition and object indexing. Chapter 3 discusses our data acquisition methods and procedures in addition to our preprocessing techniques and algorithms. In Chapter 4, we present our research performed in the area of cross sensor 3D face recognition. In this experiment, we discuss the feasibility of using data from various sensors in a single recognition experiment. As biometric systems are deployed, this issue will become more common. In addition, we compare the performance of two sensors: the Vivid 910 by Konica Minolta [4] and the Qlonerator by 3Q [5]. We based our data acquisition procedures and experimental configurations on the results from these baseline comparisons. One of the most prevalent problems in 3D face recognition is the inability to recognize subjects displaying different expressions. Chapter 5 provides our solution to this problem through the use of an ensemble matching technique called Region Ensemble for FacE Recognition (REFER). This technique uses an ensemble of regions to provide expression and occlusion invariant recognition results. Most current matching algorithms rely heavily on either manually or automatically 3

20 detected feature points. These points usually include the nose tip, eyes, mouth, chin, and many others. When these fiducial points are incorrectly located, the algorithm is likely to result in poor recognition performance. Chapter 6 presents a novel and efficient method for handling the problem of pose invariant feature detection. This method is named Rotated Profile Signatures (RPS). This method uses extracted 2D facial signatures to automatically locate feature points on the face with a high level of accuracy regardless of pose. It is widely accepted that 3D data provides more information than its 2D counterpart. While this additional information can be used to improve performance, the increase in data often increases the amount of time required for recognition. To address this issue, Chapter 7 presents two approaches for coping with the scalability issue of 3D face identification. Next, most biometric experiments assume that the data captured during the acquisition phase contain frontal images with neutral expressions. Realistic identification biometric scenarios however, are unable to guarantee this level of cooperation from participants. For this reason, Chapter 8 investigates how using a variety of different expressions in the gallery set can affect the recognition performance. In addition, we also determine the recognition benefits that can be achieved by using multiple images in conjunction that contain different expressions. Finally, Chapter 9 summarizes the work performed in this dissertation and proposes areas of future research that can be used to further the field of 3D face recognition. 4

21 CHAPTER 2 RELATED WORKS AND LITERATURE REVIEW A recent broad survey of face recognition research is given in [6] and a survey focusing specifically on face recognition using 3D data is given in [7]. This chapter will focus on the most closely related work for 2D and 3D face recognition, current methods for feature detection, and provide an overview of current research in the area of human and object indexing methods and algorithms. 2.1 Image Matching Background In this dissertation, we employ two techniques for image comparison: Iterative Closest Point (ICP) algorithm [8], and Principal Component Analysis (PCA) [9]. The ICP algorithm is a 3D technique that uses two point clouds or meshes, the probe and gallery, and iteratively attempts to align the probe to the gallery. To accomplish this task, ICP first finds the closest point in the gallery set for each of the n points in the probe set. When implemented with a k-d tree [10] data structure nearest neighbor searches can each be accomplished in O(log(n)) time. Beginning with a starting estimate, the algorithm calculates a sequence of rigid transformations T i until there is no additional improvement in mean square distance between the two meshes. ICP may result in an improper registration if the initial alignment is not sufficient due to the algorithm finding a local minima. 5

22 For this reason, the probe and gallery centroids are frequently aligned before ICP is performed to ensure an accurate final registration estimate. The PCA algorithm [9] identifies the maximum amount of variance from a group of data points scattered in a space to obtain a projection along the axis where the variance is maximized. A new coordinate system such that the greatest amount of variance is represented by the first coordinate, the second greatest variance with the second coordinate, etc. is created to represent the data. This process is used to transform correlated variables into uncorrelated variables. This technique can also be used to reduce the dimensionality of a data set for simplified analysis. In face recognition, this technique is used to determine the level of similarity between images. First, a set of 2D gallery images are decomposed into vectors where their values are represented by the grayscale color pixels in the image. These vectors are analyzed and a new coordinate system is created. Next, probe images are decomposed and projected into the new coordinate system. Using a nearest neighbor technique, the gallery image most similar to the probe image will be closest in the new coordinate system. 2.2 Feature Detection Xu et al. [11] propose an approach for locating the nose tip in 3D facial data. Their method uses a hierarchial filtering scheme combining two rules to extract the points that distinguish the nose from other salient points. The first rule states that the nose tip will be the highest point in a certain direction that is determined by finding the normals on the face. This rule eliminates many points, leaving a limited number of candidate points (the chin, the forehead, the cheeks, hair, etc.). The next rule attempts to model the cap-like shape on the nose tip itself. Each 6

23 candidate point is characterized by a feature vector containing the mean and variance of its neighboring points. The vectors are projected into mean-variance space and a Support Vector Machine (SVM) is used to determine the boundary between nose tips and non-nose tips. The authors note that this rule also is challenged by wrinkles, clothing, or other cap-like areas on the face. The authors use three databases to test their algorithm. The largest database, the 3D Pose and Expression Face Models (3DPEF), contains 300 images of 30 subjects with small amounts of changes in pitch, yaw, and roll and a 99.3% nose detection rate is reported. Rajwade et al. [12] demonstrate a method for automatic pose detection and correction using support vector regression on wavelet sub-bands. This technique is able to classify the 3D pose of subjects in an identity-invariant manner with an accuracy of ± 9 in the x and y directions. The authors report correct classification results of up to 99% using data from two frontal data sources: the Freiburg database [13], and FRGC v1.0 data set. The authors synthetically create different 3D poses by rotating the complete 3D models from 0 to 90 around the Y axis and -30 to 30 around the X axis. This method is not representative of realistic biometric acquisition and may be challenged when only partial face data may be available due to variation in facial pose D Face Recognition Jain et al. [14] used the ICP algorithm to align 3D meshes containing face geometry. Their algorithm is based on four main steps: feature point detection in the probe images, rough alignment of probe to gallery by moving the probe centroid to match, iterative adjustment based on the closest point matching (ICP), 7

24 and using known points (i.e. the eyes, tip of the nose and the mouth) to verify the match. Once this process is run, the ICP algorithm reports an average rootmean-square distance that represents the separation between the gallery and probe meshes (i.e. the quality of the match). After running this process against their database of images with one gallery image and probe image per subject, they achieved a 95.6% rank one recognition rate with 108 images of 113 participants. Chang et al. [15] use multiple overlapping nose regions and obtain increased performance relative to using one whole-frontal-face region. These regions include a nose circle, nose ellipse, and a region composed of just the nose itself. This method uses the ICP algorithm to perform image matching and reports results on a superset of the FRGC v2 data set containing 4,485 3D face images. 2D skin detection is performed for automated removal of hair and other non-skin based artifacts on the 3D scan. They report results of 97.1% rank one recognition on a neutral gallery matched to neutral probes and 87.1% rank one recognition on non-neutral probes matched to a neutral gallery. The product fusion metric [16] was used to process the results from multiple regions. When the neutral gallery was matched to the neutral probe set, maximum performance was reported when only two of the three regions were combined. The authors mention that increased performance may be gained by using additional regions, but did not explore anything beyond three overlapping nose regions. Lu et al. [17] created a method for face recognition that uses a combination of 2.5D scans to create a single 3D image for gallery enrollment. For their experiments, they employed D probe models of various poses and expressions that were matched to 200 3D gallery models collected at the authors institution. They found that by using the full 3D image of a subject in the gallery and imple- 8

25 mentations of the ICP and Linear Discriminant Analysis (LDA) algorithms, they were able to achieve a 90% rank one recognition rate using a data set consisting of arbitrary poses. The authors report that nearly all of the errors in recognition were caused by a change in expression between the probe and the gallery images. In [18], Lu et al. present an algorithm for matching 2.5D scans in the presence of expressions and pose variation using deformable face models. A small control set is used to synthesize a unique deformation template for a desired expression class (smile, surprise, etc.). A thin-plate-spline (TPS) mapping technique drives the deformation process. The deformation template is then applied to a neutral gallery image to generate a subject specific 3D deformation model. The model is then matched to a given test scan using the ICP algorithm. The authors report results on three different types of experiments. The first data set contains 10 subjects with 3 different poses and seven different expressions. Rank one results of 92.1% are reported when deformation modeling is used compared to 87.6% when it is not. The second data set consists of 90 subjects in the gallery and D test scans and similar results are reported. Data for the first two experiments was gathered at the authors institution. The data for the final experiment was taken from the FRGC v2 data set and consisted of 50 randomly chosen subjects in the gallery and D test scans. When deformation modeling is employed a rank one recognition rate of 97% is reported compared to the 81% when it is not. Gökberk et al. [19] perform a comparative evaluation of five face shape representations, (point clouds, surface normals, facial profiles, PCA, and LDA) using the well known 3D-RMA data set [20] of 571 images from 106 subjects. They find that the ICP and LDA approaches offer the best average performance. They also perform various fusion techniques for combining the results from different shape 9

26 representations to achieve a rank-one recognition rate of 99.0%. Bronstein et al. [21] described a technique to transform the facial surface to a space where the representation is invariant to isometric transformations (i.e. expressions or manipulations of the face). They obtain geometric invariants in the images that allow multi-modal 2D+3D recognition using 2D face texture images mapped onto a 3D face. Once this combined image is generated, they use eigendecomposition of canonical and flattened texture images. Experiments showed that the proposed technique outperforms a 2D PCA (eigenfaces) approach. Hesher et al. [22] used multiple range images per subject with a principal component analysis (PCA) based matcher to allow a greater possibility of matching. Once the sensor acquires the range images and they are normalized, PCA is used to reduce the dimensions of the image representation and facilitate matching. Noise and background information were documented as factors that degraded performance. The authors perform experiments on the FSU 3D face database containing 222 scans of 37 unique subjects containing a total of 6 facial expressions. They report a range of identification percentages based on the size of the training and testing sets. For a training set containing 185 scans and a testing set containing 37 scans, a maximum identification rate of 94% is reported. Lu et al. [23] propose a method of feature extraction based on the directional maximum in a 3D image. A nose profile is represented by different subspaces and a nearest neighbor approach is used to select the best candidates for the nose tip. Of the nose candidates, the point that best fits the statistical feature location model (i.e. the nose should be below the eyes and above the mouth) is selected as the final nose tip. The authors claim recognition results similar to those achieved from manually marked features on the FRGC v1.0 face image data set (953 frontal 10

27 scans from 277 subjects) and the MSU data set (300 scans from 100 subjects with varying changes in yaw). The authors state that this method is only robust to changes in yaw, and that changes in pitch would result in an expensive brute force search. Mian et al. [24] propose an expression invariant approach to 3D face recognition that reports results on the FRGC v2.0 data set [25]. They perform automatic nose detection by slicing the 3D image horizontally, smoothing uniformly, and filling holes using linear interpolation. A circle centered at the maximum value of the slice is used to find the triangle with the largest area. The three triangle points consist of the circle center and the locations where the circle intersects the slice. A line is fit to the candidate nose tips (which should follow the nose bridge) and the point with the maximum confidence level (based on the triangle altitude) is selected as the final nose tip. Because this method relies heavily on the triangle altitude for nose tip determination, it may perform poorly in the presence of large changes in the pitch or roll of an image. Wong et al. [26] created a 3D system that uses multiple regions of the facial surface to improve recognition performance. Their algorithm uses linear discriminant analysis (LDA) to fuse the results from each of the 10 regions and to provide a weighted combination of the similarity scores. The authors found that using a combination of regions outperforms the results when using any single region. In addition, they found that using LDA to determine the fusion weighting parameters provided higher results than the standard sum rule. The authors perform experiments on the FRGC v2.0 data set [25] and report verification rate rates of 90.0% at 0.1% false accept rate on experiment 3, ROC III of the FRGC program. As with Chang [15] and Lu et al. [17], we are specifically interested in 3D face 11

28 recognition in the presence of varying facial expression between gallery and probe images. Since Martinez [27] and Heisele [28] found good performance in 2D face recognition by matching a large number of face regions, we consider something similar for 3D. Whereas Chang considered just three regions, all overlapping the nose area, we initially consider 38 regions, representing various areas of the face. We consider a variety of approaches for fusion of the committee results, and find that different methods than previously used [15, 28] give best results. We also find better performance than a number of previous papers that also use the FRGC v2 data set. Details of these performance comparisons can be seen in Chapter D Face Recognition Martinez [27] uses multiple local region patches to perform 2D face recognition in the presence of expressions and occlusion. The motivation for this is that different facial expressions influence different parts of the face more than others. His algorithm addresses this belief by weighting areas that are less affected by the current displayed emotion more heavily. Reported results show that up to one-sixth of the face can be occluded without a loss in recognition, and one-third of the face can be occluded with a minimal loss. This work used the well known AR database of 2D images. Yacoob et al. [29] use data sets containing 2D images of 20 subjects and 60 subjects to demonstrate that using a single non-neutral expression (such as a smile) for both the probe and the gallery images in a biometric experiment has more discriminatory power than a neutral expression. The experiments use the PCA algorithm for comparison and the authors define a metric called discrimination power to capture the relationship between the within-class and between-class 12

29 scatters of the images in subspace. This work used the AR database of 2D images. Heisele et al. [28] demonstrate a novel approach for face recognition that combines the techniques of 3D morphable models and component-based recognition. The authors use three 2D face images to generate a 3D head model of each subject. That subject is then rendered under varying illumination and pose conditions to build a large gallery of synthetic images. Recognition is performed on a single global image as well as 35 individual facial components. Results show that when fused, the individual facial components perform better than the single global image. For this experiment, results were reported on 2000 images (10 subjects, 200 images per subject) and were collected at the authors institution. 2.5 Object and Face Indexing Methods Of the 57 3D face recognition papers reviewed in [7], none mention a method other than brute force comparison against all gallery entries for a recognition scenario. However, indexing methods of various types are used in other application contexts. Mhatre et al. [30] propose a method to reduce the search space of a database by placing images in different bins based on the subject s hand geometry and written signature. First, feature vectors are found by taking various measurements on each type of image (distance between palm and fingertip, major stroke length of signature, etc.). Once the feature vectors are calculated, the authors use the k-means clustering algorithm to cluster images. The authors used a database representing 50 users, each having 5 training images and 5 testing images for a total of 500 images. They are able to reduce the search space to 5% of the original size while not affecting the false reject rate (FRR). Jain et al. [31] propose the use of ancillary information (gender, height, 13

30 weight, age, and ethnicity) to complement current biometric techniques (fingerprint, face, gait, etc.) recognition performance. These soft biometrics are unable to uniquely identify the subject. However, they are able to eliminate potential flaws in other matching engines. First, a primary biometric system is used to determine the probability of a match. Next, the soft biometrics are extracted and used to generate a second probability. The results from these two systems are combined using decision fusion and the user is either accepted or rejected. The authors conducted experiments on 263 users of a fingerprint system and show that by using soft biometrics for post-screening, recognition performance can be increased roughly 5%. Funkhouser et al. [32] create a method to search for 3D models using global shape descriptors based on spherical harmonics to index into a database of shapes. They are able to return query results in under one second based on a repository of over 20,000 3D models. The harmonic descriptor is created by decomposing a 3D model into functions based on concentric spheres. This descriptor allows for fast, accurate, orientation-invariant indexing and retrieval of 3D shape information. Belongie et al. [33] study an approach for determining similarity between two objects. Using a point matching system, their technique calculates the similarity between two objects and a nearest neighbor method is used to determine the closest object prototype. This method has been used in recognizing trademarks, silhouettes, handwritten digits, and the COIL data set [34], but to our knowledge has not been applied to face recognition. Jain et al. [35] provide an overview of clustering methods used intensively in pattern recognition and image recognition applications. Among these clustering methods, k-means [36] is particularly popular for its essentially linear time require- 14

31 ments O(nkl), where n is variable the number of patterns, while k, the number of clusters, and l, the number of iterations, are fixed; its low space requirement O(n + k); and its order independence. However, k-means is limited by its sensitivity to initial cluster seedings and its hyperspheric cluster shape. This paper applies k-means clustering to deliver image recognition performance enhancement. Matei et al. [37] propose a method for rapid 3D automobile indexing using the Locality Sensitive Hashing (LSH) algorithm to probabilistically find similar database descriptors to a given model. They test their algorithm using real 3D LADAR data in an uncontrolled environment. With 20 possible models returned, they report an indexing performance of 87.06% on a database containing 89 vehicles and more than 180,000 unique features. A separate category of clustering methods is hierarchical algorithms. Most techniques within this group are based on single-link [38], complete-link [39], and minimum-variance methods [40]. Such methods generate a dendrogram, which represents the nesting of patterns and depicts relative levels of similarity between them. 15

32 CHAPTER 3 DATA ACQUISITION AND IMAGE PREPROCESSING All images used in this work were taken under an IRB-approved experimental protocol at the University of Notre Dame. For our experiments, we use two different scanners: the Vivid 910 by Konica Minolta [4] and the Qlonerator by 3Q [5]. Each of the cameras uses a different image acquisition technique. The Minolta 910 scanner uses triangulation with a laser stripe projector to build a 3D model of the face from a sequence of profiles. Both color (r, g, b) and 3D location (x, y, z) coordinates are captured, but not perfectly simultaneously, and the laser stripe requires a few seconds to cross the face. The resolution on the Minolta camera is 640x480, yielding a maximum of 300,000 possible sample points. The number of 3D points on a frontal image of the face taken by the Minolta camera is typically around 112,000, and depends on the lens used as well as standoff. Additional vertices arise from hair, clothing, and background objects. Example images from this sensor can be seen in Figure 3.1. The 3Q camera uses stereo imaging assisted by texture pattern projection to simplify the correspondence problem. Images are acquired in approximately 10ms and require a minute (per stereo pair) of post processing on a 2.4Ghz Pentium 4 processor to obtain the 3D mesh. This sensor is optimized for face scanning and has the ability to take an ear-to-ear scan. For the 3Q device, the average number of points is around 25,000 for a full image and 10,000 for the face. 3Q images 16

generally include the neck and shoulder region in addition to the face itself.

33 (a) 04831d152 (b) 04831d152 (c) 04701d157 (d) 04701d157 Figure 3.1. Examples of images captured with the Vivid 910 by Minolta (a and c) 3D shape data for two different subjects (b and d) associated 2D color texture information. generally include the neck and shoulder region in addition to the face itself. It is possible to increase the resolution of the scanner by changing the texture pattern projected. However, the resolution cannot currently be increased easily to match that of the Minolta. 3.1 Data Sets In this section, we discuss the various data sets used in this dissertation, their sizes, and their potential impact on the biometric research community. 17

(a) 3Q (b) Minolta Figure 3.2. Sensors used in our experiment (a) Qlonerator by 3Q (b) Vivid 910 by Konica Minolta 3.1.1 FRGC v2.

0 [25] which includes a 3D shape and texture image for each of 4007 face scans of 466 unique subjects.

34 (a) 3Q (b) Minolta Figure 3.2. Sensors used in our experiment (a) Qlonerator by 3Q (b) Vivid 910 by Konica Minolta FRGC v2.0 The first data set used in our experiments is the Face Recognition Grand Challenge (FRGC) v2.0 [25] which includes a 3D shape and texture image for each of 4007 face scans of 466 unique subjects. This data set has been distributed to organizations across the world to promote consistency in experimental evaluation of face recognition systems [41]. The FRGC program sought to document an increase in face recognition performance by an order of magnitude from the previous Face Recognition Vendor Test (FRVT 2002) [42]. The top FRVT 2002 performance on 2D face recognition was an 80% verification rate at a false accept rate (FAR) of 18

35 (a) Full 3Q Image (b) Cropped Gallery Image (110mm) (c) Cropped Probe Image (85mm) Figure 3.3. Example of 3Q image types used in our experiments (Subject 04261). (a) Full Minolta Image (b) Cropped Gallery Image (110mm) (c) Cropped Probe Image (85mm) Figure 3.4. Example of Minolta image types used in our experiments (Subject 04261). 19

36 (a) 02463d550 (b) 02463d560 (c) 02463d666 Figure 3.5. Examples of image categories manually labeled by Geometrix. (a) Neutral image, (b) Small Expression, (c) Large expression. 0.1%. The goal of the FRGC program was to increase the verification rate to 98% at a fixed FAR of 0.1%. This data set includes D face images of 466 distinct persons, with as many as 22 images per subject. The images were acquired with a Minolta Vivid 910 [4] of the 4007 FRGC v2.0 images have non-neutral expressions (e.g., disgust, happiness, surprise). Examples of images displaying neutral and nonneutral expressions can be seen in Figure 3.5. A source of complications is found in images containing shape artifacts due to subject motion during scanning. Examples of images containing artifacts can be seen in Figure 3.6. Maurer et al. [43] manually separated the FRGC v2.0 data set into three different categories [44] based on the subject s expression. These categories were based on visual inspection of each image for the amount of expression present. They classified 2469 images as neutral, 796 images as small expressions, and 742 images as large expressions. They report results [43] based on these three 20

(a) 04475d114 (b) 04749d72 (c) 04760d76 (d) 04812d42 Figure 3.6. Examples of images that contain artifacts in the FRGC v2.

0 data set [25] created to allow for large experiments containing a variety of expressions.

Happiness, Sadness, Surprise, Disgust, and Other) as seen in Figures 3.7 and 3.8.

37 (a) 04475d114 (b) 04749d72 (c) 04760d76 (d) 04812d42 Figure 3.6. Examples of images that contain artifacts in the FRGC v2.0 data set. image categories ND-2006 The ND-2006 data set is a superset of the FRGC v2.0 data set [25] created to allow for large experiments containing a variety of expressions. The ND-2006 data set contains a total of 13,450 images containing 6 different types of expressions (Neutral, Happiness, Sadness, Surprise, Disgust, and Other) as seen in Figures 3.7 and 3.8. A total of 888 distinct persons, with as many as 63 images per subject, are available in the ND-2006 data set. A table containing the number of images based 21

(a) Neutral (N)-2D 04514d325 (b) Happiness

38 (a) Neutral (N)-2D 04514d325 (b) Happiness (H)-2D 04514d329 (c) Sadness (Sd)-2D 04514d337 (d) Surprise (Sp)-2D 04514d333 (e) Disgust (D)-2D 04514d335 (f) Other (O)-2D 04514d444 Figure 3.7. Different types of expressions gathered for subject and their 3D shape images. 22

(D)-3D 04514d334 (f) Other (O)-3D 04514d444 Figure 3.8.

39 (a) Neutral (N)-3D 04514d324 (b) Happiness (H)-3D 04514d328 (c) Sadness (Sd)-3D 04514d336 (d) Surprise (Sp)-3D 04514d332 (e) Disgust (D)-3D 04514d334 (f) Other (O)-3D 04514d444 Figure 3.8. Different types of expressions gathered for subject and their 2D color texture images. 23

40 on expression type is shown in Table 3.1. While the Neutral category contains the largest number of images, expressions account for more than a quarter of the data set. Information about obtaining a copy of this data set will be available soon [45]. TABLE 3.1 Types of expressions in the ND-2006 data set Expression Number of Images Neutral (N) 9889 Happiness (H) 2230 Disgust (D) 205 Sadness (Sd) 180 Surprise (Sp) 389 Other (O) NDOff-2007 The final data set used in this dissertation contains images taken from the largest 3D pose variation face database to date: NDOff Standard 3D face databases such as the Face Recognition Grand Challenge (FRGC) v2.0 [25] or ND-2006 [45] do not contain non-frontal images that would allow for a significant experiment on pose variation. The NDOff2007 data set contains a total of 24

41 TABLE 3.2 Distribution of the 7,317 scans by pose Yaw Pitch ,911 non-frontal images displaying neutral expressions and a single frontal neutral image for each of 406 distinct subjects. Examples of different poses in the NDOff2007 data set can be seen in Figures 3.9 and 3.10 and the number of images per pose are found in Table 3.2. The (0, 0 ) pitch and yaw values correspond to frontality. The pose space is not uniformly sampled. However, due to facial symmetry, results from the left should be similar to those from the right. 3.2 Data Preprocessing and Feature Extraction Each of the algorithms in this dissertation operates automatically using the 3D shape data from the face and 2D data for certain algorithms. In this section, we describe the two preprocessing methods used in our experiments. The first method 25

42 used is similar to Chang et al. [15] and works only for frontal images. The second algorithm improves this work to provide a faster and more accurate estimation of the 3D surface. Feature detection for the second algorithm is provided in Chapter Frontal Preprocessing and Feature Detection First, the input scan is subsampled. This is done solely to reduce the computation in later steps. Next, we fill holes in the face using a Delaunay triangle approximation algorithm. Spikes and outliers are also removed, and surface smoothing using Gaussian weighting with σ = 1.5mm and the neighboring points within a 5mm radius is performed at each point. Smoothing is important due to the fact that hair and other skin attributes can cause the scanner to produce erroneous depth estimates during image acquisition. We found that smoothing results in better curvature estimates. Surface curvature estimation yields the following values for every point: Gaussian curvature, mean curvature, shape index, and curvedness. The surface curvature algorithm uses a Monge patch technique [46] to find the Gaussian (K) and mean (H) curvature. Flynn [47] accomplishes this technique when given a list of N 3D points (x i, y i, z i) in the local coordinate frame defined at p = (0, 0, 0), solve a linear least squares problem associated with the bicubic surface fit: z = f(x, y ) = a 1 x 3 +a 2 x 2 y +a 3 x y 2 +a 4 y 3 +a 5 x 2 +a 6 x y +a 7 y 2 +a 8 x +a 9 y +a 10. The mean and Gaussian curvatures are related to the two invariants of the 26

43 Hessian matrix of the graph function f[100]: H = f x x f x y f y x f y y = 2a 2b 2b 2c Gaussian curvature K is the determinant of H, and the mean curvature H is one-half the trace of H. The two principal curvatures are obtained from H and K by k 1 = H + H 2 K k 2 = H H 2 K For every point we create a list containing all contiguous points that lie within a neighborhood size of 10mm. For each list, find the mean of the data set, form the scatter matrix, compare the result with the normals of each point, and finally perform a least squares surface fit. This procedure allows us to solve for the H and K values. Using these two values, the surface patch can be classified into one of eight standard shapes. The shape index (S) and curvedness (C) [48] are computed from the principal curvatures k min, k max which are in turn computed from H and K. (k min, k max ) = H 2 ± H 2 K S = 2 π arctan k min + k max k min k max k 2 C = min + kmax

44 Once the preprocessing is complete, we next attempt to segment the face into various regions of interest. We find the nose tip point using a consensus of three methods. Hypothesize: We use the curvature information to find a candidate nose saddle and peak. This step searches for the region that is between the shape index range of and 0.0 and a curvedness greater than 0.2, which has been shown by Jain et al. [14] to return the regions of the face that have a similar shape to the tip of the nose. A colored curvature map is seen in Figure 3.13 where the red shaded area indicates the regions that are classified by our algorithm as potential nose tips. We return the largest region as the nose tip and label it c n. Verify: We match the image to a default model we chose that contains well defined facial features with very little noise. This model has already been placed in a standard orientation before matching occurs. ICP is used to align the image with the model. The highest Z value (which should now be the nose tip) is labelled p n. The distance between c n and p n is then computed; if it is less than 20mm, it is assumed that the nose tip has been correctly found. If this distance is greater than 20mm, we apply the following tiebreaker algorithm. Tiebreaker: We label the location of the nose tip on the default model (seen in Figure 3.13) as m n. Therefore, m n indicates the approximate region of the nose tip on both the input image and that of the model. Distances between m n, p n, and c n are calculated and the pair that contains the smallest distance is averaged and reported as b n, the best nose tip. 28

45 3.2.2 Improved Pose Invariant Preprocessing and Feature Detection The previous feature detection algorithm is shown to perform well when provided with images containing a frontal pose. However, in Chapter 6, experiments are performed on the NDOff-2007 data set, which contains a large number of semi-frontal and non-frontal images. When the previous algorithm is applied to non-frontal data, it results in poor performance. This is due to the inability of the algorithm to compensate for the rotated pose. In these images, the highest Z value is not necessarily located on the nose tip, as it would be in a frontal image. In addition, when a frontal nose model is matched using ICP to a non-frontal face, incorrect registration results are reported. To address these issues, we developed an improved robust pose and expression feature detection algorithm that operates automatically using 3D face shape data and 2D color texture information for non-skin removal. Unlike the previous algorithm, we do not perform subsampling. We have experimentally determined that as the data size decreases so does the recognition performance. This improved feature detection is used in all of the experiments in Chapter 6. First, small holes in the range image are filled by locating missing points that are surrounded by 4 or more good points. The x, y, and z coordinates of the missing point are interpolated from its valid neighbors. Boundary points are initially determined by a sweep through the range image row by row to find the first and last valid 3D point. This process is repeated until no additional points are created. Once hole filling is performed, a final pass over the range image with a 3x3 median filter smooths the data and removes spikes in the z-coordinate. The process of hole filling and filtering is illustrated in Figure

46 Finally, we use a skin detection algorithm described by Boehnen et al. [49] to minimize the possibility of the RPS algorithm incorrectly finding the nose due to clothing or hair. This algorithm first segments the foreground information from the background using the z-coordinate of the range image data. Then it performs skin classification on each pixel in 2D in the Y CbCr color space and masks out the other regions as non-valid. To fill holes, they find the maximum and minimum x-coordinates of the skin pixels on each row. Using these boundary values, they classify all pixels between these points as valid skin pixels. The final result of the skin detection preprocessing step can be seen in Figure

Labeling from left to right and top to bottom, the image filenames are 05078d403,

47 Figure 3.9. Examples of the 2D Pose Space (Subject 05078). The images show the changes in yaw (90 to -90 ), and changes in pitch (45 to -45 ). Labeling from left to right and top to bottom, the image filenames are 05078d403, 05078d401, 05078d405, 05078d427, 05078d453, 05078d429, 05078d399, 05078d459, 05078d463, 05078d431, 05078d

48 Figure Examples of the corresponding 3D Pose Space (Subject 05078) for Figure 3.9. The images show the changes in yaw (90 to -90 ), and changes in pitch (45 to -45 ). 32

49 (a) Original Image (02463d546) (b) Subsampled every 4th row and column (c) Smoothed and tessellated image Figure Example screenshots demonstrating the first two steps of our preprocessing method. 33

50 (a) Region of interest (nose) found (b) Cropped gallery image (c) Cropped probe image (d) Default Model used for ICP Rotation and Translation Figure Screenshots demonstrating the region of interest extraction process. 34

51 (a) Shaded curvature map (b) Pose rectified image (c) Default alignment model (d) Resulting image after alignment process Figure Screenshots demonstrating the steps used in our image alignment process. 35

(a) Original Image - 02463d546 (b) Image after hole filling (c) Result of surface

52 (a) Original Image d546 (b) Image after hole filling (c) Result of surface smoothing Figure Preprocessing steps used on the ND-2006 and NDOff2007 data sets. 36

53 (a) Original 2D image (b) 3D image before skin detection (c) Final preprocessed 3D image (3D data masked by 2D pixels classified as skin) Figure Skin detection results on subject

54 CHAPTER 4 FEASIBILITY OF CROSS SENSOR FACE RECOGNITION Introduction Practical deployment of 3D face recognition will almost certainly require accommodation of data sets from different sensors. Different 3D sensors generate a variety of data formats, densities, and accuracies [50]. There are no prior papers addressing the issue of matching 3D face scans from different sensors. Current work in the literature uses only a single sensor for both probe and gallery image acquisition. The majority of previous research in 3D face recognition has been in 3D face recognition algorithm development and in characterization of the types of images that sensors can return for use with those algorithms. We chose to combine those efforts and investigate the effect of sensor variation on algorithm performance. Figure 4.1 shows the difference between images of the same person taken on the same date with two different 3D cameras: the Qlonerator from 3Q [5] and the Vivid 910 from Konica Minolta [4]. To the eye, these images appear to be quite different. Different noise properties, sampling densities on the face, and resolution accuracies are among the issues that we discuss in this chapter. In the following pages 1 This section is based on the paper, Cross Sensor 3D Face Recognition Performance, Proc. International Society for Optical Engineering (SPIE), 6102,

(a) An image from the Qlonerator by (b) An image from the Minolta Vivid 3Q 910 Figure 4.1. 3D face image examples used in this experiment (Subject 02463).

2 Experimental Setup Each subject had their face image taken with both the 3Q and Minolta cameras. The subjects were asked to display a neutral expression during acquisition.

55 (a) An image from the Qlonerator by (b) An image from the Minolta Vivid 3Q 910 Figure D face image examples used in this experiment (Subject 02463). we will present results of our findings and discuss the issues needed to successfully address cross sensor 3D face recognition today. 4.2 Experimental Setup Each subject had their face image taken with both the 3Q and Minolta cameras. The subjects were asked to display a neutral expression during acquisition. The images from the first acquisition session are used as the gallery set. Four weeks later, face images were acquired for the same set of 120 subjects. This set of images is the probe set. A time lapse between gallery and probe image acquisition should be more realistic, in terms of possible applications, than acquiring both images in the same session. 39

56 This experiment examines the rank-one recognition rate for matching scans in all of the possible combinations: Minolta probe to Minolta gallery, 3Q to 3Q, Minolta to 3Q, and 3Q to Minolta. Ideally, there would be no difference in recognition rate when matching occurs across sensors instead of matching within the individual sensors. The recognition engine for the experiments employs the Iterative Closest Point (ICP) [8] algorithm. Before matching, the raw data from both sensors is converted to the VTK data format for use with the ICP algorithm implemented in the VTK software environment [51]. Since the ICP algorithm attempts to match points in the probe set to points in the gallery set, there is a need to eliminate non-relevant data. The 3Q image includes the participant s face and upper shoulder region. In order to have a standard landmark identified with the same accuracy for each scanner, we automatically found the tip of the nose on each participant by using curvature information with a modified version of an existing technique [52]. Using this feature, we cropped the data into a gallery and probe set as illustrated in Figure 3.3. The gallery is cropped to a spherical radius of 110mm from the nose tip and the probes are cropped to a spherical radius of 85mm from the nose tip. This is to ensure that the probe data is a subset of the gallery data, as assumed by the ICP matching. There are many internal and external factors that can affect recognition performance. In this experiment, we test two of them. The first test varies the number of iterations that ICP is allowed to run. This explores the number of iterations required before the alignment converges and what combination of sensors yields the best results. The second experiment investigates the effect of subsampling on the Minolta data. As can be seen in Table 4.1, a Minolta face scan contains 40

57 TABLE 4.1 Overall Image Information by Sensor File # of Typical Size Points Artifacts 3Q Full 1.8 MB 28,000 Holes 3Q Gallery 660 KB 10,500 Holes 3Q Probe 420 KB 6,800 Holes Min Full 13 MB 175,000 Holes + Spikes Min Gallery 10 MB 140,000 Holes + Spikes Min Probe 5.5 MB 81,000 Holes + Spikes Min Gallery 1:4 900 KB 14,000 Holes + Spikes Min Probe 1:4 680 KB 11,000 Holes + Spikes Min Gallery 1: KB 4,000 Holes + Spikes Min Gallery 1: KB 3,600 Holes + Spikes nearly 14 times as many 3D points as a 3Q face scan. We applied several levels of subsampling to see how a Minolta image will perform when it is subsampled to match the density of the 3Q. 1:4 subsampling is obtained by discarding every other row and column in the Minolta image s 3D + texture image array, and 1:16 subsampling is obtained by discarding 3 of every 4 consecutive rows and columns. An example of a subsampled image can be seen in Figure

(a) (b) Figure 4.2. Examples of the types of artifacts present in the images. (a) The arrows indicate holes in the mesh due to hair. (b) The arrows represent spikes in the mesh caused by the scanner.

First, there is a noticeable difference in the quality of data produced from the two sensors with respect to the goal of 3D face recognition using ICP.

58 (a) (b) Figure 4.2. Examples of the types of artifacts present in the images. (a) The arrows indicate holes in the mesh due to hair. (b) The arrows represent spikes in the mesh caused by the scanner. 4.3 Results We found several interesting results from the experiments. First, there is a noticeable difference in the quality of data produced from the two sensors with respect to the goal of 3D face recognition using ICP. Subjectively, the Minolta camera produces a mesh that is more recognizable to the naked eye as the actual participant. The major problem with the raw data is the presence of large spikes (as seen in Figure 4.2) on the sides of the face as well as in non-scannable regions (i.e. the eyes). For the majority of cases, cropping of the images eliminated the first problem, 42

59 as only a portion of the face was used. However, the second problem still exists. The 3Q data was able to complete the ICP iterations much faster than the Minolta data, since there are many fewer points in a cropped 3Q face image (see Table 4.2). The problem with the 3Q image is that it tends to have many holes, as seen in Figure 4.2. The eyes and hair caused problems in many images. For example, many female participants had portions of their face missing due to hair over the face as seen in Figure 4.2. This causes differences in the images of the two cameras. In the Minolta camera, hair is often rendered as part of the mesh and is used as points in the ICP match. The 3Q camera, however, frequently has trouble capturing the hair in its images and therefore leaves a hole which does not participate in the ICP match. This causes cross scanner performance to decrease as a portion of the face containing hair (in Minolta) is being matched to one containing a hole (in 3Q) and vice versa. Of the 120 persons in the experiment, 22 subjects yielded problematic Minolta images with hair appearing in the mesh. 24 subjects yielded problematic 3Q images with holes where the hair is. However, only 14 of the subjects yielded problematic images in both 3Q and Minolta. The data for this experiment was processed on a Linux cluster of 31 nodes. Fifteen of these nodes are dual 2.8Ghz Xeon processors with 2GB of memory while the other 16 nodes are dual 1.0Ghz Pentium 3s with 1.5 GB of memory. 3Q data is both noisier and contains fewer points than a Minolta image of the same person at a similar standoff. With an ICP-based recognition algorithm in which the smaller probe is matched to the larger gallery image and not vice versa, it appears crucial to have the highest density image (Minolta) available as the gallery. Table 4.2 shows the results when the number of iterations that ICP is allowed to run is manipulated. When no modifications are made to the images and 43

60 the ICP settings are equivalent, the Minolta camera yields more rank one matches than 3Q. As we increase the number of iterations, both the number of rank one matches and the time per match increase. After 20 iterations the number of rank one matches does not increase significantly. This leads us to conclude that for this given ICP implementation and these data sets, 20 iterations is sufficient. When the Minolta probe to Minolta gallery test is run, it only takes 5 iterations to yield results that are superior to the 3Q to 3Q matching at 40 iterations. The Minolta- Minolta time per match at 9 seconds is greater than the 5 seconds required for the 3Q-3Q match at 40 ICP iterations. The cross scanner matching of 3Q probe to Minolta gallery performs reasonably well at 90.8% for 40 iterations. However the time for each match is 33 seconds. In this time, 6 matches of the 3Q to 3Q or 3 matches of the Minolta to 3Q data can be performed. When the 3Q data is used as a gallery, the highest percentage of rank one images that it can provide is 75%. This suggests that there is no gain in having a probe sensor with a higher density than the gallery sensor. Table 4.3 shows the results of an experiment exploring the extent to which the Minolta data can be subsampled and still provide similar results to those of the 3Q scanner. The results were surprising. If subsampling by a factor of 16 is performed on the Minolta data, a higher recognition rate and a lower time per match than the 3Q scanner were obtained. Subsampling the Minolta probe did little to change the performance because of the accuracy of the points provided by the Minolta scanner. This experiment demonstrates that as long as a matching point exists in the gallery, the number of probe points employed is less important. The time to match was similar as well as the number of rank one matches. The real reduction in computation time came when the Minolta gallery was subsampled. 44

61 As the size of the files was decreased, the time per match decreased as well. In the 3Q to Minolta matching experiment, the best subsampling to performance ratio occurred at a subsampling factor of 16:1. The results were almost identical to those of the non-subsampled data, yet took half the time to compute. In the Minolta to Minolta test, 4:1 subsampling yielded similar performance at less than half the processing time, while 16:1 subsampling decreased the processing time to one fifth of the original while decreasing the recognition rate by 6%. Upon inspecting the incorrect matches for the various tests, we found that the failures were due to several reasons. For the majority of the 3Q to 3Q matches, hair over the participant s forehead prevented rank one matches from taking place. For the Minolta to 3Q matches, we believe that the larger number of points in the Minolta data provided a greater opportunity for matching in wrong locations on different participants, thereby throwing off that individual match. The ICP algorithm assumes that each point in the probe will be able to find a matching point in the gallery. If there are holes in the gallery face but not in the probe face, then those points will be forced to have a incorrect match. Many incorrect 3Q matches also were due to small artifacts evident on the subject s forehead and cheeks. As usual, spikes and holes played a large part of the incorrect match process. 4.4 Summary Cameras produce different quality of data for purposes of face recognition. Matching 3Q gallery and probe with up to 40 iterations of ICP gives 75% rankone recognition provides additional evidence for this, whereas matching Minolta gallery and probe images gives 95% rank-one recognition. In addition, even when 45

62 the number of points in the Minolta scan is subsampled to be less than that of the 3Q, Minolta scans give 87.5% rank-one recognition compared to 74.2% with 3Q. It is possible under certain (limited) conditions to match face images from different scanners without a significant loss in recognition accuracy. Evidence for this concept is that matching 3Q probe images to 3Q gallery images gives 75% rank-one recognition, and matching Minolta probe to 3Q gallery gives 74.2%. When matching face images from scanners with different density of sample points, it is important that the higher density image be used as the gallery. We found that the recognition rate depends on the ability of the probe to find a correct (or close) match in the gallery, and if this match is not available, the recognition performance decreases. Evidence for this is that matching Minolta probe to 3Q gallery gives 74.2% and 3Q probe to Minolta gallery gives 90.8%. When a Minolta probe is matched to Minolta gallery, peak recognition rates of up to 94.1% can be achieved. This unbalanced result leads us to the conclusion that for optimal recognition performance, data from the Minolta scanner should be employed for all other experiments in this dissertation. 46

63 TABLE 4.2 ICP Matching Results - Iteration Test Probe Gallery # of # Rank One Seconds Cam Cam Iter. Matches per Match 3Q 3Q 5 67/120 = 55.8% 1 3Q 3Q 10 85/120 = 70.8% 2 3Q 3Q 20 89/120 = 74.2% 3 3Q 3Q 40 90/120 = 75.0% 5 Min 3Q 5 66/120 = 55.0% 2 Min 3Q 10 78/120 = 65.0% 4 Min 3Q 20 89/120 = 74.2% 5 Min 3Q 40 89/120 = 74.2% 9 3Q Min 5 78/120 = 65.0% 10 3Q Min 10 89/120 = 74.2% 13 3Q Min /120 = 88.3% 18 3Q Min /120 = 90.8% 33 Min Min 5 98/120 = 81.6% 9 Min Min /120 = 92.5% 13 Min Min /120 = 94.1% 16 Min Min /120 = 95.0% 28 47

64 TABLE 4.3 ICP Matching Results - Subsample Test - 20 Iterations Probe Gallery Amt # Rank One Seconds Cam Cam Sub Matches per Match 3Q 3Q None 89\120 = 74.2% 3 Min 3Q None 89\120 = 74.2% 5 Min 3Q 4:1 89\120 = 74.2% 5 Min 3Q 16:1 88\120 = 73.3% 4 3Q Min None 106\120 = 88.3% 18 3Q Min 4:1 107\120 = 89.1% 9 3Q Min 16:1 98\120 = 81.7% 4 Min Min None 113\120 = 94.1% 16 Min Min 4:1 112\120 = 93.3% 7 Min Min 16:1 105\120 = 87.5% 3 48

65 CHAPTER 5 REGION ENSEMBLE FOR FACE RECOGNITION Introduction In this chapter, we introduce a new system for 3D face recognition based on the fusion of results from a committee of regions that have been independently matched. Experimental results demonstrate that 28 small regions on the face allows for the highest level of 3D face recognition. Score-based fusion is performed on the individual region match scores and experimental results show that the Borda Count and Consensus Voting methods yield higher performance than the standard Sum, Product, and Min fusion rules. In addition, results are reported that demonstrate the robustness of our algorithm by simulating occlusion and artifacts in images. To our knowledge, no other work has been published that uses a large number of 3D face regions for high performance expression-invariant face matching. Rank one recognition rates of 97.2% and verification rates of 93.2% at a 0.1% false accept rate are reported and compared to other methods published on the Face Recognition Grand Challenge v2.0 data set. Face recognition in 3D has been addressed using a variety of methods, including alignment, subregion matching, mapping, and Principal Component Analysis 1 This section is based on the paper, 3D Face Recognition with Region Committee Voting, Proc. the Third International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), pp ,

66 (PCA). The Face Recognition Grand Challenge v2.0 data set [25] is the largest publicly available data set for 3D face recognition research. This set contains images exhibiting substantial expression variation, which can cause problems for many recognition algorithms. Our approach exploits subregions on the face that remain relatively consistent in the presence of expressions and uses a committee of classifiers based on these regions to improve performance in the presence of expression variation. This concept is called Region Ensemble for FacE Recognition (REFER). This chapter extends the work in [53] by increasing the number of facial regions considered to 38, selecting the best-performing subset of size 28, and discussing results achieved through the use of additional fusion methods found in the literature [19, 54]. Each region matches independently to a gallery surface using the ICP algorithm [8], resulting in a committee of unique error distances for a single probe-to-gallery comparison. Based on a threshold for each match, individual regions vote for a given identity. The results presented in this chapter significantly outperform those in our previous work [53]. For all verification experiments in this chapter, we report results using an operating point of FAR = 0.1%. We show improved 3D face recognition performance over previously published papers [15, 24, 43, 55, 56] on the FRGC v2.0 data set, using the experiments defined in those papers. 5.2 The Region Committee Voting Approach Figure 5.1 shows a diagram of the REFER algorithm when it is used in a verification scenario. A example of the verification scenario can be seen at airport checkpoints (x-rays or check-in). In this situation, a user will approach the verification device and present the system with a document stating their identity. The 50

67 Figure 5.1. A diagram representing the required operations for the REFER algorithm. system will then capture a current image of the user and compare this instance to one stored in a secure location. If the images match within an acceptable threshold, the user is verified and access is granted to the restricted area. Table 5.1 shows the running time of the REFER algorithm in a this scenario. From preprocessing to final decision, the verification process takes less than 10 seconds on a 2.4 Ghz Pentium IV processor with 1GB of memory. This code is not optimized and there many enhancements that can be made to reduce execution time to a deployable level. 51

68 TABLE 5.1 Running times of components in the REFER algorithm for a verification scenario Step Time (in ms) Data Preprocessing 7,520 REFER Matching (28 regions) 2,380 Result Fusion Region Extraction Once the nose tip is successfully found in the preprocessing step, we translate the incoming image to the origin and crop a gallery region, which is defined by a sphere radius of 100mm centered at the nose tip. To find the best committee of local regions, we consider 38 regions on the face (some of which overlap) whose centroids are shown in Figure 5.2. Examples of the relative radius sizes used in our experiments are seen in Figure 5.3. Together, this collection of regions densely covers the face. We experimented with additional regions surrounding the mouth, however, none led to additional performance gains. Table 5.2 shows the cropping parameters used to generate each region. OffsetX and OffsetY determine the new sphere center in relation to the origin. SphereRadius determines the new sphere radius. By selecting multiple small regions on the face, any errors caused by a single region can be compensated for when combining the matching scores from the other regions making the system more robust to 52

(a) 04514d324 Figure 5.2. Image number 324 of subject 04514 displaying the location of the 28 centroids used by the REFER algorithm during the region of interest (ROI) selection.

69 (a) 04514d324 Figure 5.2. Image number 324 of subject displaying the location of the 28 centroids used by the REFER algorithm during the region of interest (ROI) selection. Each number corresponds to the region location information in Table 5.2. The circles show the relative size and coverage of the three spheres for the selected location. 53

70 (a) 25mm Sphere Radius (b) 35mm Sphere Radius (c) 45mm Sphere Radius Figure 5.3. Radius size examples (25mm, 35mm, 45mm) image artifacts, wrinkles, facial hair, or expression variations Individual Region Matching Once all of the regions have been cropped and the ICP algorithm is run on each probe-to-gallery combination, we determined how well each individual region performed in a verification experiment with the FAR fixed at 0.1% (the FRGC performance point [25]) and an identification experiment reporting the rank-one recognition rate. A baseline region (a sphere cropped at 100mm from the nose tip) is also included to show the effects of matching the entire face to the gallery rather than using smaller regions. The results in Table 5.3 show that no individual region is able to demonstrate performance greater than 84.8% TAR at 0.1% FAR, or 90.2% rank-one recognition. This motivates the use of fusion to further increase performance. 54

71 TABLE 5.2 ICP Probe Region Information Region X Y Radius Region X Y Radius Full Frontal

72 TABLE 5.3 Individual Region Matching Performance Region TAR at Rank-One Region TAR at Rank-One 0.1% FAR Recognition 0.1% FAR Recognition Rate Rate % 79.2% % 73.1% % 83.2% % 90.2% % 82.6% % 89.6% % 80.5% % 89.8% % 74.2% % 87.9% % 78.1% % 84.8% % 72.0% % 88.2% % 89.1% % 86.3% % 70.9% % 87.3% % 86.2% % 67.5% % 64.6% % 66.8% % 66.2% % 82.0% % 70.4% % 80.7% % 73.6% % 81.9% % 85.1% % 83.4% % 75.4% % 83.0% % 84.1% % 40.6% % 84.8% % 42.9% % 83.4% % 48.8% Full Frontal 60.9% 73.4% 56

73 5.2.3 Combining Results From a Committee of Regions Choosing the best fusion method when processing a large number of regions is a challenging task. In this chapter, we have experimented with many of the fusion techniques described by Gökberk et al. [19] and Jain et al. [54] for biometric applications and describe the details of each selected algorithm below. The sum rule computes the final fusion score based on the formula s i = K s ki, i = 1,.., N k=1 where K is the number of regions and N is the number of images to process. Similarly, the product rule is based on the formula s i = K k=1 s ki, i = 1,.., N. These methods can result in a large score range depending on the input values. Chang et al. [15] uses the product rule to generate their results. The min rule simply finds the smallest value in each of the K regions and reports it as the final score. This method is highly dependent on data normalization. Without this step, the regions with overall smaller similarity values will dominate the final scores. The consensus voting (CV) method returns a vote for the closest match in the gallery for each region K. The image with the highest number of votes is declared the match. The Borda Count (BC) technique [19, 54, 57 59] sorts all of the similarity scores in the gallery and adds the rank for each region; the image with the lowest sum of ranks is declared a match. The CV and BC methods both require knowledge of similarity scores from other individuals in the gallery to make a decision. Hence, the standard implementations of these two methods are unacceptable for use with a verification experiment (since only the match score(s) of the current subject are known). In this chapter, we use a modified version of the BC method. Unlike the 57

74 Figure 5.4. Example of the Borda Count algorithm on a sample set of data. standard BC method that provides a rank score (1st place, 2nd place, 3rd place,..., nth place) to each probe-to-gallery similarity score entry in the set from 1..n where n is the number of gallery images, we only give a rank score to the first α (in our experiments α = 4) entries in the gallery. A limitation of the BC method is that a probe region must be matched to each gallery image before the first α entries can be be awarded a score. The traditional BC method takes the sum of the ranks to determine a final score. Our version distributes votes such that the first place gets (α) 2 votes, second gets (α 1) 2 votes, until the number of votes is 0. This modification yielded experimentally higher results than the traditional BC method. We believe this result is due to the larger weight given to rankone matches. In our modification, a rank-one match and a rank three match to the same subject outperform two rank two matches to a different subject. An example of this method can be seen in Figure 5.4. In this example, G1 is the gallery image that is a match to the incoming probe image. Regions R1, R2, and R3 are individual regions that are matched to each gallery. After distributing the points using the method described above, the final fusion ranks are displayed in the last row in boldface. 58

75 For verification experiments, we train the set of region-specific thresholds {t 1, t 2,..., t i } by using images from the FRGC v1 data set. The FRGC v1 data set consists of 943 images containing neutral expressions. We believe that verification results presented in this chapter could be further increased by using a more representative training set (that contains expressions). Our committee-based algorithm is similar to the consensus voting method. There is a set of region-specific thresholds t i that are independently tuned to optimize performance on each region. The t i for each region is fixed once the desired operating point of 0.1% FAR is reached by the tuning procedure. For each probe-to-gallery face match, we have m matching scores. We compare each matching score m j to its corresponding region threshold t i. If any matching score is below the threshold, we report this region as a correct match (1). If the distance is above the eligibility threshold, then we output this match as incorrect (0). If we wish to decrease the FAR, we simply increase the vote threshold required for a match to be considered correct. The number of votes for a particular match can be seen as a relative measure of confidence (the more votes an image receives, the greater confidence in our final decision). 5.3 Experimental Results In this chapter, we perform two types of recognition experiments. The first is a verification experiment, in which the system s performance is quoted as a true accept rate (TAR) at a fixed false accept rate (FAR). The second type of experiment is an identification experiment for which performance is quoted as a rank-one recognition rate. The FRGC v2.0 [25] protocol defines a set of standard verification experiments 59

76 for face recognition. For pure 3D face recognition, Experiment 3 is most applicable. In the ROC III experiment, the gallery images come from the fall 2003 semester and the probe entries come from the spring 2004 semester. Thus, gallery images are guaranteed to have been taken before probe images and there is at least a month of time lapse between gallery and the probe. The result of this experiment is a Receiver Operating Characteristic (ROC) curve. The second experiment ( All vs. All ) employs the entire FRGC v2.0 data set. All images appear in both the gallery and the probe sets. For this setup, we perform both verification and identification experiments. In the identification experiment, we take the earliest image of every participant and use it as the gallery image for that subject. All subsequent images for each subject are used as probes. The result is a Cumulative Match Characteristic (CMC) curve. For the verification experiment, we match all 4007 images in the FRGC v2.0 data set to all 4007 images regardless of expression, which can be summarized in a ROC curve. Unlike the identification experiment, the gallery images in the All vs. All experiment may not have been taken before the probe images. In order to determine statistical significance for experiments performed in this paper, we use a standard Z-test, which has been used in similar situations by Yan et al. and Yambor et al. [60, 61] who based their use of the statistic after Devore and Peck [62]. However, this test is valid only on assumption that the binomial distribution converges to a normal distribution. Likewise, the rank one recognition and verification rate comparisons can be viewed as a binomial distribution problem under the assumption that the probability (p) of a correct match (either correctly identifying or verifying a subject) is constant across all subjects. If p varied significantly the binomial assumption may be weak; however, empirical estimation 60

77 of p from multiple subsets showed p to be reasonably consistent across the subjects. Assuming this, the probability of an incorrect match is (1 p). It is assumed that if a sufficiently large sample size (N) is used, the binomial distribution will converge to a normal distribution and therefore the population standard deviation can be approximated. For a large enough sample size N, a binomial variable X is equal to N(N p, N pq ). Yan et al. [60] use the assumption that good comparative results are achieved when N pq 3. For all of our experiments, N pq 3. For identification result comparisons, ˆp is the empirical (sample) estimate of p, the probability that a subject is correctly identified from a gallery set. For verification result comparisons, ˆp is the probability that a probe to gallery match results in a true accept at a given false accept rate (i.e. 94.5% at a 0.1% FAR would result in ˆp = for that comparison). Verification experiments tend to have a higher N as each comparison is included in the total number of samples. Identification experiments only contribute a single sample for each probe to gallery set match. This discrepancy causes experimental verification results to require a much smaller difference than identification results to be statistically significant. Given two results, with sample sizes N 1 and N 2, and the percentage of observed correct matches as ˆp 1 and ˆp 2 the test statistic for H 0 : p 1 = p 2 using a 0.05 level of significance is: z = ˆp 1 ˆp 2 ( N 1+N 2 N 1 N 2 )( X 1+X 2 N 1 N 2 )(1 X 1+X 2 N 1 N 2 ) X 1 = ˆp 1 N 1, X 2 = ˆp 2 N 2 If z 1.64 then it is assumed that there is no statistically significant difference between the pair of results and their associated sample sizes. 61

78 TABLE 5.4 Fusion Rule Results Fusion Best Rank-One Committee Rank-One Rule Recognition Size Recognition Rate Rate (All 38 Regions) Sum Rule 93.0% % Product Rule 93.9% % Min Rule 93.5% % Consensus Voting 96.7% % Borda Count (Orig.) 93.2% % Borda Count (Modified) 97.5% % To test the performance of each fusion method presented in the previous section, we created an experiment where the first image of each participant is in the gallery set, and all subsequent images are in the probe set. This yields 466 images in the gallery set and 3541 in the probe set. This ensures that every image in the probe set will have a correct rank-one match. The rank-one recognition results of the fusion methods discussed in the previous section are listed in Table 5.4. To demonstrate how the number and quality of regions affects the different fusion rules, the rank-one recognition rates of the individual regions (seen in Table 5.3) are initially sorted from best to worst. Each fusion rule is then performed until all of the sorted regions have been included. Table 5.4 shows each fusion method, its best rank-one recognition rate achieved, associated committee size, 62

79 and overall recognition rate when all 38 regions are employed. The sum, product, and min rules each show high recognition with a limited number of regions, however as the quality (based on the individual rank-one recognition rate) of additional regions decreases, so does the recognition rate. The fusion rules are effective only if the individual regions are pre-screened for quality and contribution to the final match. Unlike the modified BC method, the traditional BC method can be strongly affected by regions that perform poorly. For example, if five regions each report a rank-one match for a given subject, and a 6th region is occluded and reports a 200th rank, the total score is 205 which is likely to result in a mismatch. The modified BC method avoids this problem by rewarding only the top α matches (maximum performance was achieved for our experiments when α = 4) with scores. While the peak performance of the CV method is similar to the modified BC method, a statistically significant difference in recognition rates still exists. The difference between 97.1% and 96.2% is statistically significant at the 0.05 level. Unlike the modified BC method, the CV method does not reward a region if it is not the best match in the set. This distinction is likely the result of the discrepancy in best reported rank-one recognition rates. Based on the results shown in Table 5.4, we are able to conclude that maximum performance for our REFER algorithm is achieved when using the modified BC fusion method. Table 5.5 shows the tradeoff between confidence (number of votes for a particular match) and the number of false accepts. These results show the number of false accepts divided by the total number of matches performed in this experiment. The total number of votes for an individual probe-to-gallery match will range from 0 to n where n is equal to the number of regions that employed in 63

80 TABLE 5.5 All vs. All REFER ROC Values at increasing Confidence Levels Confidence False Accept Verification Votes Rate Rate % % % % % % % % % % the fusion technique. The votes are calculated using the CV method, where a vote is awarded if a region match score is below the predetermined region match threshold t i. Table 5.6 shows how the number (and quality) of regions included in the fusion effects the rank-one performance when using the modified BC method. The symbol (ns) in Table 5.6 signifies that there is no statistically significant difference between the performance of that experiment and the peak rank one recognition rate for the table. The results show that although the peak recognition rate is 97.5% when employing 29 regions, there is no statistically significant difference in the score achieved by using 28 regions (97.2%). There is however, a statistically significant difference between 29 regions (97.5%) and the score achieved when using 27 regions (96.8%) and fewer. These results allow us to conclude that 64

81 TABLE 5.6 Fusion recognition rate as the number of regions included in the committee increases one by one starting with the highest individual recognition rate # of Region Rank-One # of Region Rank-One Regions Added Recognition Regions Added Recognition Rate Rate % % % % % % % % % % % % % % % % % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) % % (ns) maximum performance can be achieved in future experiments by creating a 28 region ensemble. 65

82 Figure 5.5. Experiment 1 ROC curve for the All vs. All gallery and probe configuration Experiment 1: Verification Our first experiment is the All vs. All verification experiment employed by others [24, 43, 55]. Our REFER algorithm using 28 regions and the consensus voting fusion approach of matching multiple 3D regions of the face was able to achieve a verification rate of 93.2% at a FAR of 0.1%, as shown in Table 5.7 and Figure Experiment 2: Identification Our second experiment is an identification experiment also performed by other authors. Mian et al. [24] and Cook et al. [55] use a gallery comprised of the first 66

83 TABLE 5.7: Experiment 1: Verification Results Paper Experiment Stated Performancmance Our Perfor- Expressions Verification Rate Verification Rate at 0.1% FAR at 0.1% FAR Mian et al. [24] All vs. All 86.6% 93.2% Maurer et al. [43] All vs. All 87.0% 93.2% Cook et al. [55] All vs. All 92.31% 93.2% image of each subject (466) and set the remaining images as probes (3581). Chang et al. [15] report results using a gallery comprised of the first neutral image of each subject and set the remaining as probes. In this chapter, we reproduce both experiments and using the BC fusion method and the results can be seen in Table 5.8 and the corresponding CMC curve can be seen in Figure Experiment 3: FRGC ROC III Our third experiment described in the FRGC program [25] as Experiment 3 was also used in some papers by other research groups [56, 64]. In this verification experiment, the gallery images come from one semester and the probe entries come from the following semester. This ensures that the time sequence between gallery and probe is maintained, and the average time lapse is increased over the All vs. All experiment. Increased time lapse between gallery and probe is generally considered to make the experiment more difficult. Because this experiment was designed to determine the verification rate at 0.1%, we employed the CV fusion method technique to produce results. Our algorithm was able to achieve a verification rate of 94.8% at a FAR of 0.1% as shown in Table

84 TABLE 5.8: Experiment 2: Identification Results Paper Experiment Stated Performancmance Our Perfor- Rank One Rank One Recognition Rate Recognition Rate Mian et al. [24] Earliest neutral scan 96.2% 98.1% as gallery remainder as probe Chang et al. [63] Earliest neutral scan 91.9% 98.1% as gallery remainder as probe (FRGC Superset) Cook et al. [55] Earliest neutral scan 94.6% 98.1% as gallery remainder as probe Cook et al. [55] Earliest scan (neutral 92.9% 97.2% or non-neutral) as gallery remainder as probe TABLE 5.9: Experiment 3: ROC III Results Paper Experiment Stated Performancmance Our Perfor- Verification Rate Verification Rate at 0.1% FAR at 0.1% FAR Husken et al. [56] First semester scans as 86.9% 94.8% galleries second semester scans as probes Kakadiaris et al. [64] First semester scans as 97.0% 94.8% galleries second semester scans as probes 68

85 Figure 5.6. Experiment 2 CMC curve for the First vs. All gallery and probe configuration Experiment 4: Expression Variation Experiment 4 examines the performance of our algorithm in the presence of non-neutral facial expressions. For this test, we create two gallery sets, one containing the first neutral image of each subject and one containing the first nonneutral image of each subject. Only subjects with at least one neutral and one non-neutral expression are considered for this experiment. Three probe sets are formed in a similar manner: one containing only the remaining neutral images, one containing only the remaining non-neutral images, and one containing all remaining images. This experiment uses the expression classification data provided with the FRGC v2.0 [25] data set to determine the probe and gallery image separation. 69

86 (a) (b) Figure 5.7. Experiment 4 CMC curves, (a) Neutral vs. Neutral, (b) Neutral vs. Non-Neutral 70

87 (a) (b) Figure 5.8. Experiment 4 CMC curves, (c) Neutral vs. All, (d) Non-Neutral vs. Neutral 71

88 (a) (b) Figure 5.9. Experiment 4 CMC curves, (e) Non-Neutral vs. Non-Neutral, (f) Non-Neutral vs. All 72

89 TABLE 5.10 Experiment 4: Expression Variation Results Gallery Probe Rank-One Set Set Recognition Rate Neutral (369 images) Neutral (1889 images) 99.2% Neutral (369 images) Non-Neutral (1206 images) 96.3% Neutral (369 images) All (3095 images) 98.4% Non-Neutral (369 images) Neutral (1889 images) 95.9% Non-Neutral (369 images) Non-Neutral (1206 images) 95.0% Non-Neutral (369 images) All (3095 images) 95.6% For this experiment, we run a standard identification test (employing the BC fusion method) on each of the six possible scenarios. The results from this experiment are seen in Table 5.10 and their associated CMC curves can be seen in Figures 5.7, 5.8, and 5.9. The results show only a minimal difference in recognition rate between the gallery subsets. Our algorithm is able to perform well when matching a neutral expression probe to neutral expression gallery. In neutral to neutral matching, there is minimal change in the 3D shape of the face. Our algorithm performed less well when matching gallery and probe across different facial expressions. This is primarily due to the fact that the selected region of the face is not perfectly rigid across facial expressions Experiment 5: Missing Data Finally, Experiment 5 explores how the performance of our algorithm is affected when limited to only using certain regions on the face. This experiment uses a 73

90 gallery containing the first image of each subject (regardless of expression) and a probe set containing the remaining images. We simulate missing data on the face by manually excluding regions of interest. Table 5.11 lists the regions of the face segmented into different categories based on the location of the probe centroid and their individual rank-one recognition rates achieved for each subset. This experiment employs the BC method for score-based fusion. The best locality performance (95.2%) is found using only the regions located in the center of the face. This recognition rate is similar to the overall performance that resulted when using the best 28 regions. The results also show that both the left only and right only sets are able to perform well (88.0% and 89.2% respectively) even though they only contain regions on the side of the face. In the future, we plan to further investigate the robustness of the REFER algorithm on non-frontal 3D pose data Comparison to results published on the FRGC v2.0 data set Although Phillips et al. [25] defined a set of standard experiments for FRGC v2.0, researchers are free to define their own experiments and use the FRGC data set. Using the recognition and image refinement methods previously discussed, we replicated the experiments from recent publications. Results can be seen in Tables 5.7, 5.8, and 5.9. Using the standard z-test previously discussed [60], we are able to confirm that there is a statistical significance between each pair of results. The results achieved by our algorithm outperform each of the methods presented except for that of Kakadiaris et al. [64]. Kakadiaris et al. [64] present an algorithm for 3D face recognition that uses an annotated face model (AFM) to create a unique representation of a human 74

91 TABLE 5.11 Experiment 5: Simulated Missing Data Results Subset Included Rank-One Label Regions Recognition Rate Best 28 Regions 21, 23, 22, 8, 26, 24, 28, 27, 10, 15, 18, 25, 17, 34, 19, 2, 35, 3, 31, 33, 32, 4, 1, 6, 16, 5, % Left Only 30, 16, 31, 34, 17, % Left + Partial Nose 30, 16, 31, 34, 17, 37, 27, 15, 11, % Right Only 29, 20, 32, 33, 19, % Right + Partial Nose 29, 20, 32, 33, 19, 36, 28, 18, 12, % Center Only 38, 6, 7, 4, 5, 1, 2, 3, 8, 9, 10, 21, 22, 23, 24, 25, 26, 35 Non-Center 30, 16, 31, 34, 17, 37, 27, 15, 11, 14, 29, 20, 32, 33, 19, 36, 28, 18, 12, % 94.7% Eye 31, 34, 27, 21, 22, 23, 24, 25, 26, % Below Eye 30, 16, 15, 11, 14, 37, 1, 2, 3, 4, 5, 6, 7, 38, 18, 12, 13, 20, 29, % face. This algorithm consists of five steps. First, a preprocessing step constructs a 3D mesh of the geometry of the face from the data. Second, an AFM is fitted to the mesh. Third, a 2D parameterization is used on the AFM to generate a three-channel deformation image encoding of the shape information. Fourth, the deformation data is processed with two different wavelet transforms (Pyramid and Haar) to extract a signature of the participant. Finally, the signature is matched to other signatures by using a L1 distance metric (for the Haar wavelet) 75

92 and a complex wavelet structural similarity index algorithm [65] (for the Pyramid wavelet). They reported a 97.0% verification rate at a false accept rate of 0.1% on experiment 3 from the FRGC v2.0 program [25]. These results were achieved by fusing the individual scores from the Pyramid and Haar wavelets. The results of our algorithm are similar to those reported by Kakadiaris et al. [64]. Relative to Kakadiaris algorithm, we use a simpler approach that does not require an annotated face model. While the results reported by the author out-perform those presented in this paper by a small margin, our algorithm shows the additional potential of being capable of dealing with large holes and missing data in images, which is typical in realistic biometric applications. Maurer et al. [43] created an algorithm that uses fusion of 2D and 3D face data for multi-modal face recognition. Their algorithm first cleans each mesh, extracts relevant face data, and then performs ICP on the 3D set to generate a distance map between the two aligned meshes, which allows a score to be generated from the results. The 2D component of their algorithm uses the recognition system created by Neven Vision [66] and fuses the results with those of the 3D matcher based on the quality of each match. If the 3D match was very good, then the match is considered correct and the 2D score is not used. If this is not the case, then the results are fused together to return a combined score. They report results on their 3D algorithm, as well as reporting the 2D component s contribution to the 3D performance. They achieved an 87.0% verification rate at a false accept rate of 0.1% using 3D face information, based on the complete 4007 x 4007 matrix of matching scores compiled from the images in the FRGC v2.0 data set. Husken et al. [56] created an algorithm that operates primarily on 2D face data. Their approach uses a version of hierarchal graph matching (HGM) created 76

93 by Wiskott et al. [67]. It is based on creating an elastic graph that holds texture information and position of facial landmarks on an image. Distances are then taken between graphs to determine the similarity between models. When using 3D shape data alone, they reported an 86.9% verification rate at a false accept rate of 0.1% on experiment 3 that is defined in the FRGC v2.0 program [25] to match images across academic semesters. Mian et al. [24] automatically detect the nose, perform pose correction and normalization in both 2D and 3D, create a rejection classifier to reduce the overall processing time, and finally segment the 3D images into two regions (nose and eye/forehead) and match them independently to increase the overall recognition rate. They report verification and rates of 98.5% at 0.1% FAR and rank-one identification rates of 96.2% based on a neutral gallery and a probe set comprising the remaining images using their R3D algorithm. In addition, the authors report that the eye/forehead and nose regions of the face contain maximum discriminating features necessary for expression-invariant face recognition. Cook et al. [55] present a novel method based on Log-Gabor Templates for handling expression variation in 3D face recognition. The authors apply 18 Log- Gabor filters on 49 square windows to generate 147 feature vectors comprising 100 dimensions. After matching is performed, they report results using the FRGC v2.0 data set. When the 4007 x 4007 similarity matrix is calculated, they report a 92.31% verification rate at 0.1% FAR. In the identification scenario, the authors employ the first image of a subject in the gallery set (466 images) and the remainder in the probe set (3581) for a rank-one recognition rate of 92.93%. They also discuss how the best performance is achieved when using windows surrounding the upper nose area while the inclusion of outer areas adversely affects the accuracy 77

94 of the system. Chang et al. [63] perform four experiments on a superset of the FRGC v2.0 experiment 3 data containing 4,485 total scans, 2,798 neutral image sets (449 subjects), and 1,590 non-neutral expression sets (355 subjects). The authors first experiment examines the change in rank-one recognition rate when observing a time lapse between gallery and probe images. The second looks at the change in rank-one rate when the images are separated by expression. The third experiment shows how their Adaptive Rigid Multi-Region Selection (ARMS) algorithm performs when presented with an increasing set of data to demonstrate scalability. Chang et al. also report results [15] on two neutral gallery experiments on the same data superset. When the probe set contains only neutral expressions, a rankone recognition rate of 97.1% is achieved. When the probe set is restricted to only images containing non-neutral expressions, the best reported rank-one recognition rate drops to 87.1%. Table 5.10 shows that the REFER algorithm significantly outperforms the results reported by Chang et al. and shows a 99.2% rank-one recognition rate for a neutral gallery matched to a neutral probe set and a 96.3% rank-one recognition rate for a neutral gallery matched to a non-neutral probe set for the same experiments. While the concept of independently matching multiple regions mentioned by Chang et al. is similar to the technique mentioned in this chapter, we present many differences that extend their research in this area. Chang et al. limit their experiments to three nose based regions on the face. In this chapter, we explore the efficacy of including additional regions (38 total) that overlap the entire face rather than simply the center. Chang et al. found that maximum results occurred when only two of the three regions were combined. We believe this is a result of 78

95 region saturation. When there is too much overlap on a selected region (the nose in this case), the descriptiveness will begin to drop. When we add a region from an alternate location (the forehead for example), the addition of independent results boosted the overall performance. Finally, the authors found that using the product rule for fusion provided the highest results. Our results suggest that this method is inferior to the fusion techniques used in this chapter (BC and CV) especially when a large number of regions are used. 5.4 Summary We have presented the results of an approach to 3D face recognition designed to handle variations in facial expression between gallery and probe images. Our approach automatically finds the nose tip and selects 28 different regions around the face. The algorithm has been evaluated on the FRGC v2.0 data set containing D face scans from 466 unique subjects, representing a variety of facial expressions. For the identification experiment, we achieve a rank-one recognition rate of 97.2%. The results of our algorithm using only the single 3D shape modality out-perform those reported by many other published organizations [15, 24, 43, 55, 56] on the same data set. Incomplete facial data and artifacts are still a major issue in realistic biometric experiments. We have performed an extensive study on how individual regions across the face affect the recognizability of a subject, and how manually removing certain regions simulates large amounts of missing data or facial pose variation. In the presence of both obstacles, our REFER algorithm still provides a high level of recognition due to its piece-wise robustness. In addition, we found that variants of the Borda Count and Consensus Voting techniques provide maximum 79

96 performance when fusing results from multiple 3D face regions. 80

97 CHAPTER 6 ROTATED PROFILE SIGNATURES FOR MULTI-VIEW 3D FACE RECOGNITION Introduction In this chapter, we introduce a new system for 3D face recognition that is robust to facial pose variation. Large degrees of facial pose variation may lead to a significant fraction of the features visible in frontal images being occluded. High accuracy automatic feature and pose detection is performed by a new technique called Rotated Profile Signatures (RPS). Upon the completion of the feature detection phase, we employ the previously developed Region Ensemble for FacE Recognition (REFER) technique for highly accurate image matching. The RE- FER algorithm has been shown to perform well in the presence of expression variation. We demonstrate how both expression variation and pose variation can be solved using the REFER method if the feature points are accurately detected. Experiments are performed on the largest available database of 3D faces acquired under varying pose. This database contains over 7,300 total images of 406 unique subjects gathered at the University of Notre Dame. Experimental results show that the RPS detection algorithm is capable of performing nose detection with greater than 96.5% accuracy across the pose variation represented in the data set 1 This section is based on the paper, Rotated Profile Signatures for 3D Face Recognition Robust to Pose Variation, Under Review,

98 used. When the RPS detection results are combined with the REFER algorithm, rank one identification rates of greater than 94% are achieved for most poses, including profile images where less than half of the face is available for matching. 6.2 Survey The majority of 3D face databases available to researchers [20, 25, 45] contain primarily frontal images with predominantly neutral facial expressions. Other image scenarios (containing differences in expression, pose, and lighting) must be addressed if this technology is to become successful in unconstrained environments. A biometric application must be efficient (low query times), accurate (high recognition performance), robust (able to compensate for various changes in the images), and automatic (no user intervention required). Many current 3D face recognition algorithms are able to automatically locate fiducial points based on the assumption that they will be provided a frontal image with a neutral expression. For example, one approach for performing nose detection is to assume that it is the point closest to the camera [20, 22]. Such a heuristic enables fast detection of the hypothesized nose tip. Under non-frontal pose, however, the heuristic can fail. Figure 6.1 shows examples where this assumption is valid (a,b) and when it is not (c,d). Another method of automatic nose detection is performed by using the curvature information at each point on the face [15, 68 71]. The use of curvature information solves many problems associated with changes in pitch (up/down). However, noise, holes, and changes in yaw (left/right) can be problematic. The Iterative Closest Point (ICP) matching algorithm [8] used in many 3D face recognition systems has been shown to perform poorly when an accurate initial alignment is not provided. Many times, the initial alignment is provided 82

(a) Frontal 2D (b) Frontal 3D (c) Non-Frontal 2D (d) Non-Frontal 3D Figure 6.1.

The nose tip detected by the RPS algorithm is signified by the blue sphere, and

by automatically detected feature points in the image.

99 (a) Frontal 2D (b) Frontal 3D (c) Non-Frontal 2D (d) Non-Frontal 3D Figure 6.1. Nose detection examples using frontal and non-frontal images (Subject 04385). The nose tip detected by the RPS algorithm is signified by the blue sphere, and the nose tip located by the Z-heuristic is signified by the green sphere. by automatically detected feature points in the image. This suggests that an application s recognition performance is often limited by the accuracy of the feature detection module. 83

100 6.3 Contribution In this chapter, we demonstrate a system for 3D face recognition robust to non-frontal face poses. It includes an algorithm that is able to locate the nose tip automatically in the presence of pose or expression variation and occlusion with a high level of accuracy using a technique we call Rotated Profile Signatures (RPS). This method rotates the 3D face images 180 in 5 increments extracting the rightmost profile points on the image at each step. These profiles are then matched to a variety of nose models described in Section IV resulting in a similarity score. As the nose is rotated into view, the similarity score reaches a minimum, indicating the correct nose location. Once the nose detection is complete, matching is performed using a previously developed 3D face recognition technique that is a modified version of the Region Ensemble for FacE Recognition (REFER) algorithm [53, 72]. This algorithm divides a single 3D face image into 28 overlapping components which are independently matched to 3D gallery faces using ICP. The Borda Count [19, 54] technique is performed to fuse the results of this committee of matchers. The benefit of using the REFER algorithm for this process is that it performs well in the presence of expressions or occlusion [72]. In this section, we describe a new technique for location of the nose tip in 3D face images containing a wide range of poses. The following steps are performed for θ {0, 5, 10,..., 180} and a total of 37 profile signatures are collected. Assume set P θ is a set of points sorted by y-coordinate p.y forming ordered pairs of (p.x, p.y, p.z). Let P θ,i be the set p = (p.x, p.y, p.z) such that p P θ and p.y = i. P θ,i is the subset of P θ all of whose members have a y-coordinate that quantizes to an integer i, y min,θ i y max,θ (recall measurement units are in mm). 84

101 Let j = arg max P θ,i,j.x be the index of the point in P θ,i with the largest x-coordinate. Each point located at j is stored in S θ,i forming the profile signature for the given rotation. The 3D image is then rotated θ degrees about the vertical axis and the process is repeated until each profile signature is created. In our experiments, we chose θ = 5 as the rotation increment. We have experimented with larger and smaller increments and found that if the model is rotated more than 15 at a time, small noses may be mislabeled, and if it is less than 5, the execution process takes longer with the same result. Abbreviated examples of the rotation process are depicted in Figures 6.2, 6.3, and 6.4. The top images show the 3D shape images and the bottom images show the corresponding profile images at each rotation increment. Once the profile signatures S θ have been created, each profile can then be matched to two different model signatures, as seen in Figure 6.5. These signatures were manually extracted from a single subject image rotated to a known profile position. Model m 1 seen in Figure 6.5(a) is taken from the nose of an image with a 0 pitch and Model m 2 seen in Figure 6.5(b) is the taken from the nose of an image with a 45 pitch. The matching process and nose tip determination is performed as follows: For each S θ, determine the point B θ,z best matching model m z such that B θ,z = arg S m θ m z z min (S θ,i+j m z,j ). (6.1) i=1 j=1 85

(a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105

An example of the rotation process given a subject (05078) with 15 yaw and 45

102 (a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105 Rotation Figure 6.2. An example of the rotation process given a subject (05078) with 15 yaw and 45 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles. For each θ, if B θ,m1 B θ,m2 <, (6.2) where was experimentally determined to be 5mm, then the found nose tip is f θ = B θ,m 1 + B θ,m2. (6.3) 2 Otherwise, f θ is either B θ,m1 or B θ,m2. We find β θ,m1 and β θ,m2, the next local minima in S θ below B θ,m1 and B θ,m2 respectively. f θ selects the point for 86

(a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105 Rotation Figure 6.3.

We assume that the y-coordinates for both the model and the probe are scaled appropriately during the initialization phase and

103 (a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105 Rotation Figure 6.3. An example of the rotation process given a subject (05078) with 15 yaw and 0 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles. which the distance between the point and the corresponding local minimum, (B θ,mz.x β θ,mz.x) 2 + (B θ,mz.y β θ,mz.y) 2 (6.4) is minimized. We perform matching in this manner rather than using correlation because of the structure and simplicity of the problem. We assume that the y-coordinates for both the model and the probe are scaled appropriately during the initialization phase and therefore do not wish to incur the additional processing time required for calculating the correlation values. An example of this process can be seen in Figure 6.6. The images show model m 1 repeatedly matched to selected configurations starting at the top of the face and moving down to the bottom. Figure 6.6(c) shows the configuration resulting 87

(a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105 Rotation Figure 6.4.

The vector of N θ exhibits a global minimum at the correct rotation θ. However, noise can affect the smoothness of the entries in the vector.

104 (a) 0 Rotation (b) 15 Rotation (c) 60 Rotation (d) 90 Rotation (e) 105 Rotation Figure 6.4. An example of the rotation process given a subject (05078) with 15 yaw and -45 pitch. The top images show the 3D models with the extracted profile outlined in red and the bottom images show the associated extracted profiles. in the minimum matching score and is reported as the nose tip. Each distinct value of θ yields a nose tip matching score N θ. The vector of N θ exhibits a global minimum at the correct rotation θ. However, noise can affect the smoothness of the entries in the vector. We select θ by identifying a run of 4 θ values whose sum of scores is minimal, and select θ as the θ yielding the minimum sum matching scores. The original yaw orientation of the image can be calculated by subtracting the known profile yaw (90 ) from the resulting θ value. An example of the RPS algorithm applied to the image in Figure 6.3 (which contained a 15 rotation to the left) is as follows. Figure 6.7 shows that r = 105. This suggests that in order to rotate the input image to the optimal profile configuration (90 to the right), a 105 rotation is performed on the input image. This is equivalent to the original image being rotated = 15 to the left. 88

105 (a) Model m1 (b) Model m2 Figure 6.5. Model m 1 is extracted from an image with a 0 pitch, and model m 2 is extracted from an image with a 45 pitch. In order to determine how well the RPS algorithm locates the nose tip, each of the 7,317 images was first manually annotated for ground truth nose points. Upon completion of the RPS process, the automatically detected nose point is produced. The distance between the manually annotated nose tip and the RPS detected nose tip is calculated. If it is less than 10mm, it is considered a match. The benefit of using the RPS method for nose detection is its indifference to any amount of model yaw, provided that the nose is visible. The results seen in Table 6.1 confirm this statement and show the detection rate at each pose in the NDOff2007 data set where the numbers along the top represent changes in yaw and the numbers along the side represent changes in pitch. With a 0 pitch, each pose reports detection results of almost 100% regardless of 90 to -90 changes in yaw. The high level of detection is consistent for each pose variation except (60, 60 ). We believe that this irregularity is due to the limited amount of 3D point information available on the surfaces of the faces in that extreme pose. Examples of selected misidentified images for this pose set are seen in Figure 6.8. Due to the functionality of the laser scanner, facial points that are obscured due to pose 89

106 (a) Points 1-57 (b) Points (c) Points (d) Points (e) Points Figure 6.6. An example of the matching algorithm as model m 1 is matched to different configurations on the face. do not produce valid 3D points, which leaves large portions of the face missing in the resulting data. The combination of large changes in pitch and yaw seen here greatly reduce the accuracy of the RPS algorithm. 6.4 Experimental Design and Results The experiments in this section are designed to test the accuracy of the proposed system. The basic REFER matcher (described in Chapter 5) has been shown to perform very well in the presence of expression variation and frontal images [53, 72]; however, the first experiment shows that when there are significant changes in the pitch or yaw, the recognition rate of the basic algorithm drops 90

107 Figure 6.7. The profile scores from matching Figure 6.3 with Model-0 significantly. The second experiment demonstrates the effectiveness of the system that incorporates the RPS method for pose invariant nose detection. Finally, we test the system on the FRGC v2.0 data set [25] to allow for comparisons with other organizations. Each experiment performed with the NDOff2007 database uses the same set of images for the gallery and probe sets. The gallery is comprised of the first image of each subject with a neutral, frontal (0, 0 ) image for a total of 406 unique images. The probe set is independently matched to the gallery and the results are reported as a rank one percentage of correctly identified subjects. The total number of probes per pose can be found in Table

108 TABLE 6.1 Rotated Profile Signatures nose detection results (in %) by pose (in degrees) Yaw Pitch Experiment 1: Baseline REFER Performance In the first experiment, we show the performance when the previously developed nose identification algorithm described in [53] is used on the NDOff2007 data set. It locates the nose tip in frontal poses by using a combination of three different techniques: the shape index (curvature), highest z coordinate, and fitting a model (where the nose tip is known) to the image using the ICP algorithm. As expected, the results near the center of the table show a high level of correct nose detection and thus correct matching. The original REFER algorithm was able to accurately locate the nose tip in approximately frontal views and the ensemble voting scheme provided effective recognition results. As the pose rotates past the 92

109 TABLE 6.2 Baseline REFER Rank-One performance (in %) by Pose Yaw Pitch N/A and -45 positions (in both pitch and yaw) the recognition rates begin to fall quickly. This is due mainly to inaccurate nose tip detection and the inability of ICP to recover from a bad pose estimation. Frequently, in these profile poses, the ear or chin is mistaken for the nose tip location. This experiment allows us to conclude that if the supplied images are at least semi-frontal, the REFER algorithm is not hindered greatly by the small levels of occlusion present in the images Experiment 2: RPS + REFER In the next experiment, we demonstrate and exploit the high nose detection accuracy provided by the RPS algorithm. This experiment provides an accurate 93

110 portrayal of the occlusion invariance provided by the REFER algorithm. If presented with large amounts of image occlusion (as in the case with (90, 0 ), (60, 0 ), (45, 0 ) etc. due to the scanner s inability to see the other side of the face) the algorithm will exclude any region from the ensemble that is not in view, which allows the performance to degrade gracefully. As expected, with accurate nose detection, the matching performance on images of faces with non-frontal poses is vastly improved. The largest improvement for this data set is in the profile pose (-90, 0 ). The baseline REFER algorithm was only able to achieve a 3.4% rank one recognition rate at this pose. When the RPS nose detection algorithm is applied, the rank one recognition rate for this profile pose set (only half of the face contains valid 3D points) increases to 94.0%. The only region that still had a low recognition rate was (60, 60 ). This pose demonstrates a point at which the accuracy of the RPS algorithm accuracy begins to degrade gracefully. We attribute the low recognition rate of this pose to the limited amount of valid 3D data available on the surface of the face. These results allow us to conclude that the REFER algorithm can perform well in the presence of large amounts of occlusion, however it still requires accurate feature detection Experiment 3: RPS + REFER on FRGC v2.0 Data Set In the final experiment, we demonstrate the effectiveness of the REFER algorithm using RPS feature detection on the widely published [15, 24, 43, 55, 56, 64] FRGC v2.0 data set [25]. The data set consists of 4007 frontal 3D face images containing a variety of expressions. The original REFER nose detection algorithm [53, 72] described in the previous section achieved a 98.2% (3935/4007) correct detection rate. The RPS algorithm applied to the same data set results 94

111 TABLE 6.3 RPS + REFER Rank-One performance (in %) by Pose Yaw Pitch N/A in a statistically significant improved detection rate of 3995/4007 = 99.7%. This demonstrate that RPS is also capable of expression invariant feature detection. An example of some images where the nose was incorrectly identified can be seen in Figure 6.9. Each of these figures is misclassified because certain features about their faces (mustache, lips, open mouth, etc.) have nose-like properties which cause the algorithm to fail. Once the noses have been successfully identified, we perform an identification experiment using the REFER matching algorithm. The experimental gallery is comprised of the first image acquired of each subject in the data set (466) and set the remaining images are set as probes (3581). By increasing the accuracy of the feature detection step with RPS, the REFER algorithm reports a 97.6% 95

112 rank one recognition rate compared to the original 97.2% reported in [72]. We attribute the relatively small increase in recognition performance to the 28 region ensemble matching employed by the REFER algorithm. This experiment allows us to conclude that when using REFER in frontal images, incorrect feature detection can still result in correct identification. 6.5 Summary In this chapter we have demonstrated an accurate and efficient method for 3D nose detection, pose categorization, and face recognition. On a dual processor 2.4 Ghz Xeon machine after image preprocessing, nose detection can be performed in under one second per image. This processing time can be decreased even further if a multi-resolution approach is employed. We have demonstrated the increased robustness of the REFER algorithm [72] in the presence of occlusion due to pose or sensor noise. Rank one recognition rates of greater than 94.0% are achieved in almost every pose including profile images (where only half of the face is available). The experiments in this chapter are performed on the largest data set to date with various degrees of pose variation and consisting of over 7,300 3D face images. 96

113 (a) (60, 60 ) (b) (60, 60 ) (c) (60, 60 ) (d) Pose Corrected (e) Pose Corrected (f) Pose Corrected Figure 6.8. Example images where the nose tip was incorrectly identified. The top row shows the original (60, 60 ) images and the bottom row shows the corresponding manually pose corrected images. The nose tip detected by the RPS algorithm is signified by the blue sphere. 97

114 (a) Subject (b) Subject (c) Subject Figure 6.9. Selected images where the nose was incorrectly identified by the RPS algorithm in the FRGC v2.0 data set. The incorrect nose location is represented by the blue circle on the face. 98

115 CHAPTER 7 3D FACE RECOGNITION SCALABILITY 1 The ICP algorithm is often used in 3D face recognition [43] [17] [15]; however, it is computationally expensive. In realistic biometric identification applications, large data sets will be common and results are often needed in real time. As a contribution toward feasible large-scale 3D face recognition, this chapter presents two methods to decrease the processing time for 3D face identification without a significant loss in the overall recognition rate. The first method is a pre-matching indexing step designed to reduce the overall number of matches that the algorithm is required to perform. The second method demonstrates the effectiveness and constraints of performing large scale identification experiments using a desktop grid architecture [73]. 7.1 Improved Image Matching through Indexing In Chapter 3, we performed ICP alignment on the input image as part of the initial preprocessing step. This produced a rough estimate of the nose location. However, because the entire input image was used (as the location of the nose tip was unknown at the time), the alignment produced is not sufficient for Principal 1 Part of this section is based on the paper, Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid, Workshop on Large-Scale and Volatile Desktop Grids (PCGRID),

116 (a) 04201d442 (b) 04237d151 (c) 04302d144 Figure 7.1. Rough 3D box cropped nose images for three different subjects. Component Analysis (PCA), which is sensitive to fluctuation in pose. Now that the location of the nose tip, b n, is known for each image, we are able to refine the pose alignment. The first step is to translate the image by setting the nose tip b n to the origin (x,y,z) = (0,0,0). This ensures that the 2D scaling step for PCA may be eliminated as the location and size of the nose is known in 3D. Next, we crop a box around the nose, seen in Figure 7.1, to perform precise ICP alignment. We do this by creating four cropping planes based on experimentally determined offsets from the nose tip. If the nose tip point is (x,y), the cropped region is (x ± 20mm, y ± 25mm). By cropping a box around the nose, we ensure that when matching each image to a predefined model (Figure 7.2), only the points on the nose affect the final rotation and translation. We do not use this particular cropped image as the final image due to possible pose problems that exist before the ICP iteration. Our cropping function uses cropping planes parallel to the z-axis. If two images of the same subject are cropped before being precisely aligned using ICP, distinctly 100

117 (a) (b) Figure 7.2. Reference model used for ICP alignment (a) Front view (b) Profile view different images will result, which can be seen in Figure 7.3. To address this issue, we save the final transformation matrix found by the ICP algorithm and apply this matrix to the original preprocessed image before performing the final cropping function. At this point, a scaled and aligned 2D range image is created from each 3D image, scaling each Z value to a grayscale range of An example of this can be seen in Figure D Face Indexing Methods The efficiency of pattern recognition is crucial when there is a large number of classes to be identified and when the scale of recognition is particularly large [74]. This extends to the problem of face recognition. In 3D face recognition, the identification scenario is particularly applicable. In an identification scenario, a user will be scanned by the system and their image will be compared with the 101

(a) 04201d442 (b) 04201d442 (c) 04237d151 (d) 04237d151 (e) 04302d144 (f) 04302d144 Figure 7.3. Examples of different intensity images for the same subjects if cropped before alignment (a,b): Subject 1.

A rejector [74], a generalized form of a classifier, rapidly prunes a large segment of the total data set and allows for more efficient recognition on a smaller set of data.

118 (a) 04201d442 (b) 04201d442 (c) 04237d151 (d) 04237d151 (e) 04302d144 (f) 04302d144 Figure 7.3. Examples of different intensity images for the same subjects if cropped before alignment (a,b): Subject 1. (c,d): Subject 2. (e,f): Subject 3. entire gallery in hopes of finding a match. A rejector [74], a generalized form of a classifier, rapidly prunes a large segment of the total data set and allows for more efficient recognition on a smaller set of data. We describe an identification experiment that employs 410 gallery images (subjects from the FRGC v2.0 data set with 2 or more images) and 3541 probe images. A brute force approach would perform 410*3541 = 1,451,810 total matches. In this section, we present a low-cost rejector for reducing the processing time for 3D face recognition. 102

(a) 04201d442 (b) 04201d442 (c) 04237d151 (d) 04237d151 (e) 04302d144 (f) 04302d144 Figure 7.4. Example intensity images for three different subjects (a,b): Subject 1. (c,d): Subject 2.

119 (a) 04201d442 (b) 04201d442 (c) 04237d151 (d) 04237d151 (e) 04302d144 (f) 04302d144 Figure 7.4. Example intensity images for three different subjects (a,b): Subject 1. (c,d): Subject 2. (e,f): Subject PCA Subspace Projection Once a 2D range image is created for each nose image, contrast equalization is applied. The images have already been properly scaled and aligned in the previous step. A 130 x 150 template centered at the tip of the nose is used to mask out unnecessary information. Results of this procedure can be seen in Figure 7.5. Next, a nose space is created from a training set consisting of the first image of each of the 466 subjects and retaining 40% of the eigenvectors resulting in a 188 dimensional subspace(the CSU [75] face identification software was used to create the nose space ). Finally, we project each of the 4007 images into the subspace and return a 188-dimensional vector for each image. Through experimentation, 103

120 (a) 04201d442 (b) 04201d442 (c) 04237d151 (d) 04237d151 (e) 04302d144 (f) 04302d144 Figure 7.5. Normalized PCA preprocessed PGM images of three different subjects (a,b) Subject 1 (c,d) Subject 2 (e,f) Subject 3 we have found that retaining 40% of the eigenvectors is the smallest possible percentage that did not result in a significant loss in performance. Figure 7.6 shows a scatter plot of all 3541 probe images with only the first two dimensions (the most significant values) displayed. The highlighted points represent images of the same subject. The tight grouping of these points suggests the feasibility of a k-nearest Neighbor approach Geometric Distance Measurements The next experiment that was performed was based on the hypothesis that geometric distance measurements on the face remain constant in the presence of expressions. Using the points that were automatically found in Chapter 3 (eye l, 104

(a) Figure 7.6. Visualization of 3541 images projected into PCA subspace with select individuals highlighted. eye r, and nose t (b n ), we find the following distances and angles seen in Figure 7.7. Combining these measurements yields a 6-dimensional vector for each image.

121 (a) Figure 7.6. Visualization of 3541 images projected into PCA subspace with select individuals highlighted. eye r, and nose t (b n ), we find the following distances and angles seen in Figure 7.7. Combining these measurements yields a 6-dimensional vector for each image. While finding these measurements, we were also able to validate the points found by the feature detection algorithm. Due to the geometric symmetry displayed by most faces, we determined that if the difference of the distance between eye l and eye r was greater than 9mm then the points found were incorrect and we mark this image as unknown. By putting this image into the unknown category, we force it to be matched to all items in the gallery rather than just to those in its cluster. While this decreases the percent of matches saved, it increases the recognition rate of our system. 105

122 (a) Figure 7.7. An image showing the automatically located features used for geometric distance indexing. cosθ = (l AB l AB + l BC l BC l AC l AC )/(2 l AB l BC ) k-means Clustering Once a reduced eigenvector for each nose has been established using a Euclidean distance metric, clustering was performed on the set of 188-dimensional vectors via k-means[36]. This algorithm initially seeds k centroids c 1, c 2,..., c k via random selection from the set of vectors and labels all n vectors with label L, 106

123 where L i = j, s.t. Cdist(v i, c j ) dist(v i, C). Based on this initial assignment, each centroid is reassigned as c j = avg(v j ), where V j is the set of vectors with label j. This process of cluster assignment and centroid recalculation is then repeated until convergence or a user-defined number of iterations is reached. As this algorithm is sensitive to initial seedings and the number of clusters, k, a range of experiments were performed to survey the meaningful parameter space. A range of values for k, from 3 to 15 were tested and numerous repetitions were tested on each to vary the initial seedings. As the centroids were determined in relatively few iterations, no iterative cut-off was set k-nearest Neighbor Analysis We present the idea of k-nearest neighbor (k-nn) indexing by projecting each point in the gallery into the 188-dimensional search space. Algorithmically, this is performed by inserting each point into a modified version of a k-d tree [10]. The basic k-d tree data structure allows for tree traversal (nearest neighbor searches) in O(log(n)) time. We modify the basic k-d tree to take advantage of the fact that the entire gallery can be calculated a priori. For each gallery node, we compute its list of nearest neighbors. This modification allows for O(log(n)) complexity for the traversal of the nearest neighbor, and O(1) for retrieval of the list of that node s nearest neighbors) total time for each incoming probe lookup. Once a probe image is presented to the k-d tree, a search is performed to reduce the number of matches to only the k nearest neighbors to the given point. When a point is projected into the subspace, additional points from the same subject should be closer to the given point than points from other subjects. Once the classification is performed, only the images that are within the k-nearest neighbors 107

124 are matched using the ICP algorithm. The other images are instantly discarded as incorrect matches. This approach allows the user to specify what percent of the total number of matches they wish to perform. recognition rate increases as well. As the number of neighbors increases, the Another benefit of indexing the data set is that in certain situations, indexing can actually increase the recognition rate by removing images that would have been falsely accepted before they are presented to the matching algorithm. This rejection classifier is significantly more efficient than processing each image using ICP. After the images are initially preprocessed and the k-d tree is built, the entire gallery (410 images) can be classified for a single probe in 0.02 seconds. The results of this algorithm can be seen in the following sections. 7.3 Experimental Setup and Results In Figure 7.9 we show how applying our k-nn indexing algorithm affects the CMC identification scenario results. As k increases, so does the rank one recognition rate. From this graph, we can reduce the search space by 50% with no loss in recognition performance (94.9% rank one recognition). With only 1% loss in recognition, we can eliminate the need for 66% of the original matches. Of the experiments performed in this chapter, this approach yields the best ratio of matches saved to identification rate. Experimental results from the geometric distance measurements method and k- means clustering were determined to be of significantly less quality than that of the k-nn indexing algorithm. The problem with the geometric distance measurement method is that it requires very accurate feature points. When these points are 108

125 (a) Figure 7.8. CMC All vs. All Performance with no indexing. automatically determined, the variance between the same subject is high. The problem with the k-means clustering algorithm is seen in Figure 7.6. In this figure, we show that there is no clear separation of clusters with between the various types of noses. While we are able to create different categories, the individuals on the boundaries continue to be problematic. 7.4 Reduced Processing Time using a Desktop Grid Architecture As we have shown in previous sections, 3D image matching and processing requires a very large amount of processing power. As data sets continue to grow in size and quality, the operation time required will begin to increase greatly. A single machine is not able to effectively match a single subject to a gallery of 4000 (let alone 100,000). For example, our REFER algorithm when the data is 109

126 (a) Figure 7.9. Rank 1 recognition rate as the number of nearest neighbors increases. not subsampled can perform a single probe to gallery match in roughly 3 seconds using a 2.4 Ghz Pentium 4 processor. In a realistic biometric scenario, subjects will pass through a checkpoint where their images are acquired. Each subject will then be matched to a watch list containing a significant amount of subjects (currently the US has a terrorism watch list of nearly 325,000 individuals [76]). In our single computer scenario, an individual subject matched to the entire gallery of 325,000 individuals would take roughly 52 days to process, and note that this is only for a single individual. Roughly 50,000 passengers pass through O Hare airport in Chicago each day. This scenario suggests two solutions for a realistic biometric application. One option is to decrease the amount of time required per match or to screen out a large number of matches during preprocessing (as presented in the previous section). 110

127 The second option is to vastly increase the number of computers available for processing the data. Assuming thousands of computers were allocated to the task of processing this biometric data, the question now becomes how to coordinate and deploy the data in a manner to maximize the throughput. For this study, we performed a verification experiment on the FRGC v2.0 data set using the REFER method. Each of the D images uses approximately 500KB of disk space yielding a total database size of roughly 2GB. After matching was complete, a similarity matrix of 4007x4007 that contains the distance from each image to its counterpart is generated. Through experimental study, we have found that the biggest problem facing desktop grid processing for biometric research is how to distribute many terabytes (TB) of data to each of the grid nodes in an efficient manner. We consider four methods to attack this problem, which vary in terms of data partitioning and access. Single Package Single Server (SPSS) In a desktop grid, not all workstations have the storage resources needed for all workloads. Even smaller data sets often consume too many resources to allocate on every grid host long-term. The simplest solution is to ship data directly to each host as it is chosen in the queue for computation. This naïve approach sends each available machine the packaged experiment (a single file archive of all the gallery images, a probe image, the executable, and the required library files) with instructions as to which image to process. On completion, the machine sends the results back to the source. This approach has limitations. A single server may not be able to support the continuous bandwidth needed by a workload. Even if it can support the aggregate, a batch system may start many jobs at once, resulting in simultaneous requests for terabytes of data over thousands of connections. This burstiness slows the 111

128 response to all of the requests, due to finite memory and bandwidth on the data server. Single Package Multiple Server (SPMS) To solve the problem of efficiently responding to a large batch of simultaneous matches, we can rely on pre-staging data at various points across the network. Having sufficiently large numbers of pre-staged data servers ensures that data transfer can occur upon fetching/matching a job from the queue to the grid node without overburdening the submitting machine with the large overhead requirements. Once the data is transferred to our pre-staged data servers, the same matching process is completed: transferring the packaged experiment to its local machine, unpacking it, and then completing the matching for each of the images in the gallery. On completion, the result is sent back to the original submitting machine. If a single machine is both client and server, the server should always choose its local copy, rather than going to a random server on the network. Though this method solves the problem of overloading the central server, it does not address the fundamental issues of the storage requirement. We must account for the case that many grid nodes will have less disk space than is consumed by the data set. Requiring the entire data set to fit on a grid node results in poor utilization of the grid pool. Single File Multiple Server (SFMS) Instead of sending the entire data set in advance, which requires it to fit on the grid node, we can send each image individually as needed. This method of approach is similar to a conventional distributed file-system in that it activates accesses at a file level (as opposed to a data set level), but differs in that it creates a copy of the remote data on the local grid node (like our previous methods), rather than conducting exclusively remote 112

129 operations. This is a palatable solution for read-only data. The pre-staging still solves the initial throughput issue presented against the naïve approach. The image-by-image on-demand access allows us to restrict our grid node space usage to that necessary for the current computation. This model over-corrects from the last observation, however. Whereas all-at-once access is too restrictive from the standpoint of grid node space requirements (resulting in low resource utilization), the problem with this approach is one of performance. The entirety of the gallery will eventually (by the end of the job) be compared to the image, but this model does not make use of that. It requires a new connection to download each gallery image to the grid node individually. The number of network connections scales linearly with the number of comparison images in the gallery, increasing overhead from previous models. A variant of this method that makes use of prefetching is feasible. Future images are buffered while current computation is occurring, storing only a fraction of the total gallery space at any one time (but more than the 2 image requirement of the non-prefetching case described above). Adding prefetching, however, still does not alleviate the basic problem of connections scaling with comparisons (for either the computing node, or the data server). It also doesn t account for the detriment to the computation from sharing CPU resources with the prefetching process. Multiple Package Multiple Server (MPMS) Instead of creating a single file containing the the entire data set, it may be broken down into pieces that can fit onto the large majority of the grid nodes. If we use the same technique as in SPMS, but on chunks of the data set instead of the entire package, we can balance the low overhead from SPMS with the ability to harness many more nodes. An additional advantage of MPMS is a minimum of repeated work. With 113

130 SPMS, a single failure requires resubmitting the entire set of comparisons; here, only the specific comparison that failed must be resubmitted. On the other hand MPMS is more complicated and requires networking connections between many hosts in the system. Network Feasibility and Scale We hypothesize that the various methods will result in different levels of I/O throughput, as measured by completed tasks over a given time period, and we want to find whether any single method will deliver the highest throughput under increasing loads. The metric of how many tasks can be completed within a time limit is used for two reasons. The first is that grid operations have timeouts to protect against hosts that go offline, or otherwise become dysfunctional. The second is to give a sense of how a method performs in bursty traffic. If a server cannot deliver files in parallel in a timely manner, it leaves CPUs unutilized, and is at risk of falling progressively farther behind within a burst. To determine this, we measured the sustainable throughput of each method by completing that method s I/O pattern over a variable number of clients. Each of the multiple server methods served data from five servers. For SPSS and SPMS, each host read a series of 500MB files; for MPMS each host read a series of 50MB files; and for SFMS each host read a series of 250KB files. The number of completed tasks can be multiplied by the file size and divided over the time period to achieve throughput. Figure 7.10 shows the I/O capacity provided by our four methods running on the production grid. The fifth plot is the MPMS method with the replica selection algorithm adjusted from random server (the default choice for MPMS) to one that recognizes cluster locality and will pick the appropriate intra-cluster replica. 114

131 Figure I/O Scalability of Data Partitioning and Access. The inherent limitation of a single-server setup (SPSS) is evident. Its peak throughput is lower than the multiple server version (SPMS), at just under 100MB/s, and it can maintain this only up to approximately a load factor of 25, after which it drops off to less than 5MB/s. SPMS reaches a peak throughput of over 120MB/s around a load factor of 35, and can sustain throughput of 80MB/s beyond a load factor of 80. MPMS reaches a peak throughput above 190MB/s, which doesn t occur until a load factor between 75 and 80. Utilizing cluster locality, we can push the peak throughput of MPMS beyond 230MB/s. We reach the point at which both of these MPMS curves have levelled off, but we have not reached the point at which they begin to drop off significantly. The SPSS method peaks at only 13MB/s, around a load factor of 55. It maintains up to half of that peak through a load factor of 75. The data demonstrate that MPMS scales to both the highest peak throughput, and the highest load factor before levelling off. If greater 115

132 throughput is needed, more servers can be added. In terms of peak throughput, SPMS fails to get even double the single server s value, and reaches the peak throughput within a factor of two of the corresponding load factor. The early precipitous decline in SPSS and the early levelling of SPMS are likely due to the larger file size. The large file cannot be transferred completely to each node within the time limit when under heavy load. The SFMS method cannot serve sufficiently many files to supply an entire data set to a large number of servers on demand. This is due to an implementation detail: the job execution script creates a new TCP connection to fetch each file. If a single TCP connection could be re-used, a single server can support a large number of operations: The right graph in Figure 4 shows that a single serve can serve 25,000 RPC/s under heavy load. However, re-using the same connection multiple times within a script is not trivial with conventional software technologies. 7.5 Summary In this chapter, we have demonstrated the utility of the k-nearest Neighbor search as a pre-matching indexing step in 3D face recognition. Our work has demonstrated that the k-nearest Neighbor implemented with a k-d tree data structure is an effective method for vastly reducing the total processing time required for subject identification. This indexing scheme is not algorithm-dependent and can be coupled with other recognition modalities. A larger and more diverse data set for experimental purposes could be used to further validate this method. We are currently evaluating other methods of indexing and hope to fuse these methods for even greater discrimination. If desktop grids are to be used to decrease the amount of time required for 116

133 processing large data sets, we have determined three conclusions about I/O access in desktop grids: Single servers do not scale. A single server cannot serve sufficient throughput over its network link to sustain a large desktop grid. Even though a single high performance server can accommodate a large load with its storage and memory, it is limited by the bandwidth of the network connection. Further, multiple small servers are more cost effective. Single file per connection causes significant overhead. While accessing files one by one allows for minimal space requirements on grid nodes, it comes at the cost of large overhead to create separate connections for each file. To manage data efficiently, even for relatively large files, the system must maintain connection state wherever possible. Cluster locality improves scalability. A node within a fast cluster suffers moderately when it must fetch data from a slow cluster. Choosing a replica that is many network hops away, even over a fast connection, introduces unnecessary latency. Thus, we should utilize intra-cluster replicas when possible to maximize performance of the fast networks. In summary, we have determined that the MPMS distribution model is the most efficient method to allow for scalable biometric computing. Using desktop grids for biometric workloads is still a relatively new problem and advances are being made to simplify the submission process as well as increase the productivity of the system. 117

134 CHAPTER 8 USING A MULTI-INSTANCE ENROLLMENT REPRESENTATION TO IMPROVE 3D FACE RECOGNITION Introduction Many 3D face recognition experiments have explicit or implicit assumptions about the type of facial expression expected for the gallery image. A neutral expression is often used because it is generally accepted that the subject will be cooperative in the enrollment phase. However, a neutral expression may not match a non-neutral expression from the same subject as well as it matches a neutral expression images of other subjects [43]. The Iterative Closest Point (ICP) registration algorithm [8] is often used in 3D face recognition [15, 17, 24, 43, 53, 77]. ICP iteratively attempts to align the probe to the gallery by estimating a rigid transformation. Non-neutral expressions or occlusion in the probe or gallery images can result in a poor match. One of our hypotheses is that by using multiple 3D images to enroll a subject, and varying the expressions among the gallery images, we will be able to achieve a better overall recognition rate. Component-based recognition works by splitting a probe image into many small pieces which are independently matched to complete images present in the 1 This section is based on the paper, Using a Multi-Instance Enrollment Representation to Improve 3D Face Recognition, submitted to Computer Vision and Image Understanding,

135 gallery. The scores from the component matching results are fused, and a final decision is made. Component-based approaches have been explored in both 2D [27, 28] and 3D [15, 53, 72] face recognition and found to improve performance. In particular, it has been proposed as a method of dealing with expression variation between the enrolled image and the image to be recognized [72]. We compare the multi-instance enrollment scheme explored here against state-of-the-art component based approaches [15, 53, 72]. Multi-instance enrollment has been found to improve performance, both in 2D face recognition and 3D face recognition [22, 78], but has not previously been evaluated as a means to deal with expression variation. This chapter uses the ND D face data set [45], containing 13,450 total 3D images and their corresponding 2D images, to explore multi-instance enrollment as a means to deal with expression variation in 3D face recognition. The performance of a multi-instance enrollment approach is evaluated against a state-of-the-art component-based recognition algorithm [53, 72] and found to demonstrate superior performance. 8.2 Multi-Gallery Data Preprocessing The data preprocessing performed in this chapter employs the second method described in Chapter 3. Individual images are cropped differently based on their roll in each experimental setup. If the image is being used to enroll a person in the gallery, a 100mm radius sphere centered at the nose tip is extracted. If the image is a probe for our multiple gallery experiment, a sphere containing a 40mm radius centered at the nose tip is extracted. Examples of gallery and probe images can be seen in Figure 8.1. A significantly smaller surface is used for the probe than the gallery to ensure that the the probe image will always be a subset of the 119

(a) Cropped Gallery Image (100mm) (b) Cropped Probe Image (40mm) Figure 8.1. Gallery and Probe images used in the Multiple Instance Experiment. gallery image.

136 (a) Cropped Gallery Image (100mm) (b) Cropped Probe Image (40mm) Figure 8.1. Gallery and Probe images used in the Multiple Instance Experiment. gallery image. Another benefit of using a smaller probe region is that the area subject to variation across expressions (i.e. the cheek and nose bridge) is limited. For the single neutral gallery experiment, a modified version of the REFER method (described in Chapter 5) employing an ensemble of 28 overlapping probe regions on the face is used. The REFER method uses the same gallery image image preprocessing as the multiple gallery experiment (100mm cropped from the nose tip). The REFER algorithm exploits the presence of subregions on the face that are relatively expression-invariant and uses a committee of classifiers based on these regions to improve performance. 120

I CP Fusion Techniques for 3D Face Recognition

I CP Fusion Techniques for 3D Face Recognition Robert McKeon University of Notre Dame Notre Dame, Indiana, USA rmckeon@nd.edu Patrick Flynn University of Notre Dame Notre Dame, Indiana, USA flynn@nd.edu