Mobile Biometrics (MoBio): Joint Face and Voice Verification for a Mobile Platform

Size: px

Start display at page:

Download "Mobile Biometrics (MoBio): Joint Face and Voice Verification for a Mobile Platform"

Roxanne Emerald Haynes
5 years ago
Views:

1 Mobile Biometrics (MoBio): Joint Face and Voice Verification for a Mobile Platform P. A. Tresadern, et al. December 14, 2010 Abstract The Mobile Biometrics (MoBio) project combines real-time face and voice verification for better security of personal data stored on, or accessible from, a mobile platform. 1 Introduction Modern smartphones not only have the memory capacity to store large amounts of sensitive data (e.g. contact details, personal photographs) but also provide access, via mobile internet, to even larger quantities of personal data stored offline (e.g. on internet banking, and social networking sites). Though passwords provide some protection against unauthorized access to this data, an attractive alternative is to authenticate yourself using biometrics physical characteristics, such as fingerprints, that are unique to you and easy to remember, yet hard for you to lose or for someone else to steal. In the Mobile Biometrics project, we use the camera and microphone on a mobile device to capture your face and voice, and combine these two biometrics for secure yet rapid user verification to ensure that that someone who wants to access the data is entitled to do so. This work was motivated by the rapid expansion in internet applications that provide access to personal and sensitive data. Because of the growth in these sites, it is now almost impossible to remember a unique password for every site. One option is to use the same password for every site, with the risk that someone need only discover that one password to access all of your personal data. A more common alternative is to allow the browser software to remember your password for you. Although this may be acceptable for most desktop systems, it is much less suitable for mobile devices that are easily lost or stolen. As a result, alternative verification methods, such as those based on biometrics, become highly attractive. Though biometric systems require data capture facilities, modern smartphones conveniently come equipped with a video camera and microphone that we can use to capture the face and voice. 1

2 Client Client Client Impostor Client Voice Impostor Impostor Face Client Impostor Data capture Feature extraction Unimodal model comparison Data fusion Bimodal verification Figure 1: The MoBio identity verification system computes feature vectors from the captured and normalized face and voice, compares the features to stored models, fuses the scores for improved robustness, and performs a bimodal verification. The bigger challenge in biometric verification, however, is to capture the signal in such a way that the system is not confused by day-to-day variations. Faces, for example, look different depending on identity, expression and lighting, and also change over time (e.g. when growing a beard). Similarly, voices sound different depending on your physical condition (e.g. having a sore throat) and can be difficult to separate from background noise in noisy environments. In addition, we must make the system robust to spoofing by impostors checking that the lips move in a video, for example, ensures that a photo cannot be used to fool the system. Our aim in the MoBio project is to develop a software verification layer that uses your face and voice, captured by the mobile, to ensure that you are who you claim to be (Figure 1). This article describes how we authenticate the face and voice independently, combine these biometrics to make the system more robust, update system models to allow for changes in appearance over time, and implement the system within the limitations of a mobile platform. 2 Identity Verification We first need to clarify the difference between recognition and verification. Identity recognition asks who is this?, and searches through a database of stored models to findthe best match. Identityverification, however, isa simpler problemthat asks is this John Smith?, and therefore needs only to compare that person to the model for John Smith. Using some goodness of fit measure, we can compute a score of how well the person matches the model of who they claim to be, and decide 2

3 Example image subwindows Feature extraction Cascade classifier Detected faces Level 1 Level 2 Level 3 Rejected faces Figure 2: A window slides across the image, and the underlying region is sampled and reduced to a feature vector. This feature vector feeds into a simple classifer that rejects obvious non-faces; subwindows that are accepted then feed into a succession of more complex classifiers until all non-faces have been rejected, leaving only the true faces. whether they are a client or an impostor. Clients are given access to the resource they need; impostors are not. 3 Face Analysis 3.1 Face Detection To capture the user s appearance, we begin with an image that contains (somewhere) the user s face. Our first job is then to localize (e.g. draw a box around) the face in the image to give us a rough estimate of its position and size (Figure 2). The difficulty in this problem stems from the possible variation in a face s appearance in the image our system must be able to detect faces regardless of shape and size, identity, skin colour, expression and lighting conditions. Ideally, it should also be able to handle a wide variety of orientations and occlusion but in the case of mobile verification, we assume the person is looking almost directly into the camera almost all of the time. Detecting the face is usually approached as a classification task: for every plausible region of the image, classify it as either face or not face based on the image properties. Though hand-crafted classifiers were the norm in previous years [9], modern methods are typically based on Pattern Recognition and learn which characteristics differentiate faces from things that are not faces using a database of labelled examples [8]. Two key considerations in such systems are how to summarize the properties of the image region in a compact form (i.e. compute its feature vector) and how to classify the image region based on the image features. When searching an image for the first time, there are tens or hundreds of thousands of possible locations for the face and so it is important that each image region 3

4 can be summarized quickly. By computing a summed area table [3] (or integral image [8]), responses to a filter similar to the Haar wavelet can be computed in constant time, making them a popular choice. An alternative is to use a Census Transform [10] or Local Binary Pattern (LBP [6]) that captures the local statistics of image gradients: using a given pixel as a reference, we compute a binary code that indicates which of its neighbours usually eight pixels, distributed equally on a circle of fixed radius (Figure 4) are brighter and which are darker than the reference value. In this work, we take the mean intensity as a reference and include the centre pixel to give a more discriminative 9-bit descriptor known as the Modified Census Transform (MCT [4]). Though there may be thousands of possible locations for the face, if we aleady have an estimate of the location of the face (e.g. from the previous frame of a video or by instructing the user to keep their face within a specified region) we could restrict search to a small region around our current estimate and therefore use a more complex image representation. Having summarized the image region, we feed its corresponding feature vector into a classifier to decide whether the region should be labelled as face or not face. Since there are many possible locations in the image, we use a cascade of increasingly complex classifiers to discard non-face regions efficiently (Figure 2). In principle, a region is considered by a classifier only if it has been accepted by all previous classifiers in the chain. In practice, this means that we reject the majority of image regions (that look nothing at all like a face) using very simple and efficient classifiers in the early stages of the cascade; the more accurate but computationally demanding classifiers are reserved only for those image regions that look most face-like where classification is more challenging. Our experiments on standard datasets (e.g. BANCA [1] and XM2VTS [5]) suggest that for a given image, we fail to detect around 3% or less of the true faces. 3.2 Face Normalization Once we have found the approximate position and size of the face in the image, our next goal is to normalize the face so that it has similar properties (with respect to shape and texture) to the stored model (Figure 3). Our first step is to localize individual facial features such as the eyes, nose, mouth and jawline so that we can crop out only the region of the image covered by the face, removing any background that is not relevant for face verification. By stretching this region to fit a pre-defined shape, we compensate for differences due to the orientation of the head in three dimensions, the person s expression, and the shape of the individual s face (a weak cue for verification). Finally, we normalize the lighting of the image by setting brightness and contrast to some fixed values. After this normalization, we end up with an face image that has a known shape with standardized lighting and can therefore be directly compared with a similarly normalized model image for more accurate verification. 4

5 Shape model Fit to image Crop image Texture model Normalize shape Normalize lighting Figure 3: Statistical models of shape and texture are estimated from training data and fit to a new image using the Active Appearance Model. The underlying image can then be sampled to remove background information, warped to remove irrelevant shape information (e.g. due to expression), and normalized to standardized brightness and contrast levels. To find the facial features, we base our approach on the popular Active Appearance Model (AAM [2]) that fits a deformable model of a face to the image. The AAM contains statistical models of shape and texture variation learned from a training set of example images whose facial features have been labelled by hand that allow us to describe a face using only a few parameters. By displacing parameters from their true values in each training image and sampling the underlying pixels, we train the AAM to predict corrections to the model parameters when presented with a misaligned image sample. When fitting the model to a new image, we start from an initial guess (e.g. aligned with the face detection result), and crop and normalize the corresponding part of the image. From this misaligned sample, the AAM then predicts a correction to the shape and pose parameters, and we update our estimate accordingly. By repeating this sample-predict-update cycle several times, we converge on the true feature locations that give us a normalized texture sample for verification. To predict updates to the model parameters from sampled image data, we typically train a linear regressor using hand-labelled examples. To stabilize the regression, we either apply regularization (e.g. via ridge regression) or regress the desired predictions against a low-dimensional feature vector. One simple approach is to apply Principal Component Analysis (PCA) to the training examples and keep only those components that are associated with the highest variance. An alternative, however, is to use Haar-like features (described in Section 3.1) that have the added benefit of being well-suited to a mobile platform. Since there are more Haar-like features than pixels, we select the most useful features via greedy function approximation that generates an ensemble of linear regressors whose combined output predicts the desired parameter updates. For efficiency, our implementation uses a multi-resolution approach: early it- 5

6 Client Impostor Image subdivision Multiscale LBP histograms Dimensionality reduction Image-to-model comparison Figure 4: A cropped face window is subdivided into blocks, each of which is processed with a Local Binary Pattern operator at several scales. We then capture the distributions of LBP values in histograms that we concatenate and reduce in dimensionality (e.g. via Principal Component Analysis) before comparing with the stored model. erations sample image data at low resolution and predict coarse shape parameters such as the overall position and orientation of the face; later stages sample higher resolution image data and pick out the fine shape details of the individual s face. Typically, we are able to localize key facial features to within 5% of the distance between the eyes. 3.3 Face Verification Given a normalized, pixel image of the face, the final step is to assign a score describing how well it matches the stored model for the claimed identity, and to use that score to decide whether to accept or reject that person s claim (Figure 4). Again, we treat this is a classification problem but here we want to label that person as a client or an impostor based on what they look like as summarized by their image feature vector. Since illumination conditions can drastically alter someone s appearance, we first apply gamma correction, Difference of Gaussian filtering and variance equalization to remove as many lighting effects as possible. For added robustness, we compute the Local Binary Pattern value for every pixel in the image over three scales and subdivide the processed images into non-overlapping subwindows of 8 8 pixels to make the descriptor more robust to occlusion. We then summarize every window by its histogram and concatenate all histograms to give a feature vector for the image (Figure 4). When classifying an observed feature vector as client or impostor, we compute its distance (e.g. via histogram intersection) to the stored model of the claimed iden- 6

7 tity; for numerical conditioning, we may reduce dimensionality first via Principal Component Analysis and use a different distance metric. Though we could make a decision based on this similarity measure, to make the system more robust we may instead use a likelihood ratio whereby the distance to a background (or world ) model provides a point of reference that expresses how much more than average the observation matches the claimed identity, thus indicating our confidence in the classification. Applying this system to the BANCA dataset [1], we were able to achieve half total error rates (where false acceptances are as likely as false rejections) of around 5%. 4 Voice Analysis 4.1 Voice Activity Detection Though face verification technology is maturing, we also exploit the fact that we have a microphone at our disposal by including voice-based speaker verification in our system. Given a sound sample that was captured using the mobile s microphone, our first step is to separate speech (which is useful for speaker recognition) from background noise (which is not). As in face detection, however, speech detection is complicated by variation from speaker to speaker (e.g. due to characteristics of the vocal tract, learned habits, physiological traits such as a lisp, and language) and from session to session for the same speaker (e.g. as a result of having a cold). To summarize the sound of someone s voice, we can use the shape of the vocal tract that varies slowly during speech. In particular, we can represent its shape by a feature vector that summarizes the frequency characteristics over a small window (on the order of a few tens of milliseconds) around any given time. A popular representation known as cepstral analysis computes the spectrum of the windowed signal via a Fourier Transform and decomposes the logarithm of this spectrum by a second Fourier Transform or a Discrete Cosine Transform. Mapping the spectrum into the mel scale (where distances more closely match perceived differences in pitch) before the second decomposition gives mel-frequency cepstral coefficients (MFCCs). A popular approach to classifying these feature vectors as speech or non-speech uses a Gaussian Mixture Model for each of the two classes, discarding the temporal ordering of feature vectors for simplicity. Combined with low-pass smoothing on the output classifier, this has proved to be an effective technique for examples with a high signal-to-noise ratio; in environments with a lot of background noise, however, more complex methods are required that use more than the energies present in the signal. We therefore use an Artificial Neural Network to classify MFCC vectors, derived from a longer temporal context of around 300ms, as either one of 29 phonemes or as non-speech. The output of the ANN is a vector of posterior 7

8 probabilities corresponding to the 29 phonemes plus a non-speech class. These vectors are smoothed over time using a Hidden Markov Model to account for the (language-dependent) known frequency of phoneme orderings learnt from training data, and the 29 phoneme classes are then merged to form the speech samples. 4.2 Speaker Verification Having discarded background noise, we then use the remaining useful segments of speech to compute how well the person s voice matches that of the claimed identity, and decide whether to accept or reject their claim. Again, we use the MFCC representation to describe the sound of the client s voice. More specifically, we use 19 MFCCs plus an energy coefficient, augmented with their first and second derivatives to give a 60-dimensional feature vector. These coefficients are computed using a sliding window of width 20ms with a shift of 10ms. After removing silence frames with voice activity detetions, we apply a short-time cepstral mean and variance normalization over a window of 300 frames. To classify the claimant s feature vectors, we use Joint Factor Analysis based on a parametric Gaussian Mixture Model where the weights and covariances of the mixture components are learned at the outset but the centres are specified as a function of the data (ideally, of the speaker and the session). The weights, covariances and means of the mixture components are learned using a large cohort of individuals. The subject-subspace is then learned using a database of known speakers, pooling over session to reduce inter-session variability. The session-subspace is learned from most of what is left. When classifying, we go through each training example and estimate the speaker and session in order to generate a client-specific model. We then ignore the session estimate (since we are not interested in whether the sessions match, only the speaker) and compute the likelihood of the test example given the speaker-specific model. Score normalization then gives a measure that we can use for classification. On the BANCA dataset [1], we achieved equal error rates of aroud 3-4% for speaker verification. 5 Model Adaptation One major challenge with biometric verification systems is accommodating the various ways that someone s appearance might change over time, either intentional (e.g. growing a beard or wearing make-up) or unintentional (e.g. developing wrinkles); similarly, having a cold or sore throat changes the sound of your voice. Any practical verification system needs to factor in these changes over time and adjust its criteria for accepting or rejecting a claim accordingly. One approach is to perform matching in a feature space that is invariant to the transformations that the data may undergo. Normalizing the lighting, for example, 8

9 is a common step in face normalization whereas more complex operations such as pose correction (where a frontal view of the face is inferred) and glasses removal may also make the system more robust. This form of data adaptation is most suitable for short-term effects that are likely to vary randomly from one verification session to the next. Other factors, however, do not vary randomly from one session to the next. Growing a beard, for example, is likely to be a long-term change such that we can expect to see similar differences between the observed data and the model in the future. Under these circumstances, it is more intuitive to modify the model to more closely match the observed data, in effect remembering long-term changes that are likely to be present in future sessions. This does, however, poses a number of challenges such as deciding when a model should be adapted we do not want to adapt the model to more closely resemble an impostor. We examined two scenarios of model adaptation: supervised and unsupervised. In the supervised case, the model is adapted using only those samples that are manually labeled for a given client. Under these ideal conditions, we observed a relative improvement of up to 85% in reducing the half total error rate (i.e. the average of false acceptance and false rejection) at an operating point where false acceptance and false rejection rates were equal. In the more realistic unsupervised situation, however, we adapted the model only when we had sufficient confidence in our classification and achieved a smaller relative gain of approximately 40%. A third form of adaptation, known as quality-based model adaptation, involves computing a quality measure for every available training sample, clustering these quality measures into groups (that we hope correspond to different capture conditions) and learning a classifier for every condition. During testing, we compute the same quality measures for the observed data and use them to weight the contribution of each of the learnt models to the final classification, giving more weight to those models whose quality measures match those that were observed. Such quality-based model adaptation is often formulated in a Bayesian framework [7]. 6 Data Fusion At this point, we have a score that tells us how much the person looks like their claimed identity and another score for how much they sound like their claimed identity for every sample in the video sequence. To give us a system that performs better than either biometric on its own, we fuse these two modalities either by classifying each modality independently and feeding the resulting pair of scores into a third classifier (score-level fusion), or by fusing the features and passing the result to a single classifier (feature-level fusion). Since we are concerned with video sequences it is also beneficial to fuse scores (or features) over time. A naïve approach to score-level fusion might be to pool data over time by av- 9

10 eraging scores over the sequence. More principled methods instead model the distribution of scores over the observed sequence and compare this to distributions, learnt from training data, that correspond to true and false matches. Since this requires measuring the distance between distributions (which is not well-defined), we investigated an alternative approach that computes non-parametric statistics (such as mean, variance and inter-quartile range) of the score distribution and trains a linear discriminative classifier (via logistic regression) to separate true and false matches. Since many face verification systems rely on proprietary software libraries where the internal classifier workings (including feature extraction) are hidden, score-level fusion is a popular choice in the first instance. In contrast, featurebased fusion is more flexible and can capture relationships between the two modalities. It may, however, result in a large joint feature space where the curse of dimensionality becomes problematic and where we must take care when fusing features derived from sources with different sampling rates (e.g. video and audio). In our work, we have developed a feature-level fusion technique that searches over the space of feature pairs (one face and one speech) to find the pair for whom a quadratic discriminant analysis (QDA) classifier minimizes misclassification rate; iterating this process while reweighting training samples (known as boosting) gives a boosted slice classifier. Although this classifier performed no better under controlled conditions, our feature-level fusion strategy outperformed a baseline score-level fusion reference system when one modality was corrupted, suggesting that including both modalities makes the system more robust as hoped. 7 Mobile Platform Implementation Since we want to run the system on a mobile device, we need to consider the limitations of the available hardware such as low-power processing, a fixed-point architecture and limited memory. We therefore carried out a number of experiments that looked at the effect on accuracy when making approximations that would make the system more efficient. One very effective modification was to implement as many methods as possible using fixed-point (rather than floating-point) arithmetic. Although some modern devices are equipped with floating-point units, they are far from common and are noticeably less efficient than using fixed-point. Other engineering solutions include an early stopping criterion for the face detection and reducing the dimensionality of the models used in face normalization and speech verification. As well as reducing computation, we also investigated ways to reduce memory consumption (which also has performance benefits). In face detection, for example, we reduced the number of LBP scales, the number of subwindows and the final dimensionality of the feature vector. In face normalization, we varied parameters 10

4.5 4 3.5 EER (%) 3 2.5 2 1.5 1 0.5 0 0.5 1 cost 1.5 2 Figure 5: Accuracy vs. efficiency trade-off for various systems tested, confirming that better accuracy comes at a cost in efficiency.

such as e-mail and social networking. such as the number of modes in the linear shape model and the number of points tracked on the face.

of vectors used for channel compensation.

11 EER (%) cost Figure 5: Accuracy vs. efficiency trade-off for various systems tested, confirming that better accuracy comes at a cost in efficiency. Figure 6: Mobile Biometrics interface demonstrating face detection, facial feature localization (for shape normalization) and the user interface with automated login and logout for popular websites such as and social networking. such as the number of modes in the linear shape model and the number of points tracked on the face. For the speech systems, reducing the number of Gaussian mixture components was an obvious way to reduce computational demands, in addition to reducing the size of the acoustic vector and the number of vectors used for channel compensation. To evaluate the effects of these approximations we produced 48 scaled face systems and 27 speech systems to give 1296 combinations, each of which was rated by two criteria: an abstract cost reflecting both memory consumption and speed; and resultant generalization performance measured by equal error rate (Figure 5). As a result of such approximations, we were able to develop a prototype application for the Nokia N900 that has a Texas Instruments OMAP3 microprocessor with a 600MHz ARM Cortex-A8 core, 256Mb RAM and a front-facing VGA camera for video capture. Using GTK for the user interface and gstreamer to handle video capture (Figure 6), we were able to achieve near frame-rate operation for the whole identity verification system and frame-rate operation for some modules (e.g. facial feature localization). 11

Figure 7: Screenshots from database, showing the unconstrained nature of the indoor environments and uncontrolled lighting conditions.

In contrast, many publicly available datasets that are designed for recognition or verification contain either face data or voice data but not both.

hand-held camera. Even the few that come close (e.g. the BANCA dataset [1]) use a static camera and so do not have the image jitter, caused by small hand movements, that we have to deal with.

12 Figure 7: Screenshots from database, showing the unconstrained nature of the indoor environments and uncontrolled lighting conditions. 8 MoBio Database and Protocol One major difference between the MoBio project and other related projects is that the MoBio system is a bimodal system that uses the face and the voice. In contrast, many publicly available datasets that are designed for recognition or verification contain either face data or voice data but not both. Moreover, the data are often captured using a high-quality camera or microphone under controlled conditions and therefore are not realistic for our application: we are limited to a low quality, hand-held camera. Even the few that come close (e.g. the BANCA dataset [1]) use a static camera and so do not have the image jitter, caused by small hand movements, that we have to deal with. Since we anticipate other mobile recognition and verification applications in the future, we used a handheld mobile device (the Nokia N93i) to collect a new database that is realistic and is publicly available1 for research purposes (Figure 7). This database was collected over a period of 18 months from six sites across Europe, contains over 150 subjects and was collected in two phases for each subject: the first phase includes 21 videos per session for six sessions; the second contains 11 videos per session for six sessions. 8.1 Testing Protocols To ensure a like-for-like comparison in other studies, we have designed an evaluation protocol that specifies how the database should be used to evaluate algorithms. The database was split into three non-overlapping sets training, development (or validation) and testing by taking the data from two of the six sites for each set. Since no information regarding individuals or conditions is shared between any of the three sets, as is required for an accurate evaluation of any machine learning algorithm. The training set is typically used to build a background (or world) model. Since this is not client-specific, the data can be used in any way the system designer chooses. 1 From where? 12

13 The development set is used to adapt the background model to be client-specific so that we can compute suitable values for system parameters (e.g. thresholds). More specifically, we use questions 1-5 of Session 1 to enrol each client before computing error rates for a range of parameter values using the 105 free speech questions in Sessions We can then select system parameters that correspond to the best operating point, defined in terms of the desired ratio of false rejections (i.e. rejecting the true client) to false acceptances (i.e. giving access to an impostor). With N 30 subjects in the development set, there are approximately = possible test scores with which to tune parameters. Having selected thresholds for the classification algorithm, the test set is used to derive the final set of scores that define algorithm performance. Again, we use questions 1-5 of Session 1 to enrol each client and test the system with the free speech questions from Sessions In total scores 2 are available with which to estimate the performance of the verification algorithm. 9 Summary This paper has outlined a new system for identity verification on a mobile platform that uses both face and voice characteristics. The face verification thread detects the face, normalizes it with respect to shape and illumination, and assesses the quality of its match with respect to both the claimed identity model and a generic world model. Similarly, the speaker verification thread detects the presence of speech (as opposed to background noise) and compares the captured speech signal with the client model. Model adaptation in the system improves classification accuracy by assessing the capture conditions (such as lighting and background noise levels) and modifying system parameters (e.g. classification thresholds) accordingly. To make the system more secure we fuse the face and voice modalities, either at the score level or at the feature level, to give a verification system that is more accurate than using either modality by itself. Since the system is implemented on a mobile platform, we also undertook a scalability study to identify ways in which the system could be made more efficient without sacrificing too much accuracy. Finally, we have presented a new database of challenging video sequences that were captured on a mobile phone, covering 150 subjects over six sites. The aim of the MoBio project is to develop a robust and secure verification system for mobile applications. Mobile internet is an obvious example where biometric verification may complement (or replace) traditional access methods such as passwords. Other potential applications include using biometrics to lock and unlock the phone, and mobile money transactions. 2 How did we get this number? 13

14 Acknowledgements This work has been performed by the MOBIO project 7th Framework Research Programme of the European Union (EU), grant agreement number The authors would like to thank the EU for the financial support and the partners within the consortium for a fruitful collaboration. For more information about the MOBIO consortium please visit References [1] E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Havouz, J. Kittler, J. Mariethoz, J. Matas, K. Messer, V. Popovici, F. Poree, B. Ruiz, and J.-P. Thiran. The BANCA database and evaluation protocol. In Proc. Int l Conf. on Audio- and Video-based Biometric Person Authentication, [2] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell., 23(6): , June [3] F. C. Crow. Summed-area tables for texture mapping. In Proc. Conf. of ACM SIGGraph, volume 18, July [4] B. Froba and A. Ernst. Face detection with the modified census transform. In Proc. IEEE Int l Conf. on Automatic Face and Gesture Recognition, pages 91 96, May [5] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. XM2VTSDB: The extended M2VTS database. In Proc. Int l Conf. on Audio- and Video-based Biometric Person Authentication, [6] T. Ojala, M. Pietikainen, and D. Harwood. A comparative study of texture measures with classification based on feature distributions. Pattern Recogn., 29(1):51 59, [7] N. Poh, R. Wong, J. Kittler, and F. Roli. Challenges and research directions for adaptive biometric recognition systems. In Proc. Int l Conf. on Biometrics, pages , [8] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vis., 57(2): , May [9] M.-H. Yang, D. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34 58, January [10] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondence. In Proc. European Conf. on Computer Vision,

Feature Detection and Tracking with Constrained Local Models

Feature Detection and Tracking with Constrained Local Models David Cristinacce and Tim Cootes Dept. Imaging Science and Biomedical Engineering University of Manchester, Manchester, M3 9PT, U.K. david.cristinacce@manchester.ac.uk