Dynamic Local Ternary Pattern for Face Recognition and Verification Mohammad Ibrahim, Md. Iftekharul Alam Efat, Humayun Kayesh Shamol, Shah Mostafa Khaled, Mohammad Shoyaib Institute of Information Technology University of Dhaka, Bangladesh ibrahim iit@yahoo.com, iftekhar.efat@gmail.com, hkayesh@gmail.com, khaled@univdhaka.edu, shoyaib@du.ac.bd M. Abdullah-Al-Wadud Department of Industrial and Management Engineering Hankuk University of Foreign Studies, South Korea wadud@hufs.ac.kr Abstract: Face recognition and verification algorithms use a variety of features that describe a face. Many of them are affected by the change of illumination and intensity fluctuation due to noise. Inspired by the properties of human visual system, a dynamic local ternary pattern has been proposed that allow not only encoding important texture features but also reduction of the effect of noises for face recognition and/or verification. This descriptor is constructed with the aid of Weber s law, and has been experimented on a benchmark face verification dataset, named Labeled Faces in the Wild (LFW) resulting very promising performances even in uncontrolled environment. Key Words: Face recognition, Local Ternary Pattern, Human vision constant, LFW dataset 1 Introduction Local image descriptors have long been surveyed in the area of machine learning and computer vision. There are enormous applications that rely on local feature descriptors such as visual registration, reconstruction and object recognition. A face verification system can be affected by various things such as noise, illumination, facial expression, occlusion, rotation and size. A descriptor should be robust against these issues. The existing descriptors can be divided mainly into two types: sparse descriptor, which first perceives the significant points in a given image, illustrate them and then depicts the features; and dense descriptor, which extracts local characteristics of each pixel in the input image. One of the most widely used local feature descriptors is SIFT [1], which uses automatic scale selection, orientation normalization and gradient information. SIFT is recognized for its usefulness and trustworthy recognition. However it involves a soaring computational cost. Among the dense descriptors, Gabor wavelet [2] and Local Binary Pattern (LBP) [3, 4] are the most popular. Gabor, nevertheless, has a huge computational complexity. On the other hand, LBP is very efficient with acceptable accuracy and can handle illumination variation well. LBP, however, is very sensitive in uniform and near-uniform image regions. Addressing this issue, an extension of LBP called Lo- Figure 1: Images from the same person may look quite different due to pose (upper left), occlusion (upper right), illumination (lower left), and expression (lower right) ISBN: 978-960-474-361-2 146
cal Ternary Patterns (LTP), has been proposed. LTP [5], a 3-valued coding system, includes a constant ±5 threshold for improved resistance to noise. However a fixed threshold might not appropriate in many cases. To overcome this potential limitation of LTP, we intend to focus on a simple but a dynamic local descriptor which will able to recognize near frontal images taken under uncontrolled conditions shows few such examples in Figure 1. To achieve this, we adopt the essence of Weber s Law [6] for calculating the thresholds. With the hope that it will easily distinguish the similarity/dissimilarity to the neighbors and thus encode proper textures. The rest of this paper is structured as follows: in Section 2, some state of the art descriptors including their pros and cons are investigated. In Section 3, we present the proposed local descriptor with dynamic threshold values that produces potentially significant texture for face recognition. Experimental results are presented in Sections 4. Finally, in Section 5 we conclude the paper. 2 Related Work Face verification and recognition has been a top researched problem in the last few years. A number of algorithms have been proposed to address and mitigate the problem. These algorithms can broadly be categorized in three categories as follows: Holistic based approaches, Feature based approaches and Hybrid approaches. Holistic based method generally consider the whole face, feature based method consider parts of face separately and hybrid method consider both of the aforementioned framework. For face image analysis, a face might be processed in different ways. For example Matthew et al. [7] presents an approach that detects and identifies faces based on variations in known features. The process treats human faces as a collection of two dimensional characteristics and projects face images into a feature space called eigenfaces. This face space is used to extract the variations among faces. However, the Eigenface method, which uses principal components analysis (PCA) for dimensionality reduction, yields projection directions that maximize the total scatter across all classes, i.e., across all images of all faces. In choosing the projection which maximizes total scatter, PCA might retain unwanted variations. The method described in [8] each face is represented in terms of multi-region probabilistic histograms of visual words, followed by a normalized distance calculation between the histograms of two faces. They also proposed a fast histogram approximation method which minimizes the computational burden with minimal impact on discrimination performance. Additionally, the use of multiple regions increases accuracy in most of the cases. For face verification One-Shot Similarity (OSS) is proposed in [9] and extended to Two-Shot Similarity (TSS) in [10] which shows comparatively poor performance on LFW dataset [11]. But, in combination with the baseline similarities and the OSS, the TSS improve performance considerably. The proposed algorithm in [12] measures the similarity between pair images by finding corresponding local regions. Then images are clustered in a collection of randomized trees. A global similarity measure between the two images is computed by combining cluster memberships. The algorithm automatically selects and combines geometry and SIFT features while measuring similarity. The positioning of features is important for most of the face recognition algorithms. Method proposed in [13] reduces this dependency. It achieves this positioning for poorly aligned examples of a class with no additional labeling. They calculated the SIFT descriptor over an 8 8 window for each pixel and this SIFT [1] gives additional robustness in lighting. Illumination variations are big obstacles for face recognition. In recent years, a variety of approaches has been proposed as solutions to the illumination variation problem in face recognition contexts. In general, those approaches fall into three categories: face modeling [14], pre-processing and normalization, and invariant feature extraction [4]. The disadvantage of face modeling-based approaches is the requirement of training images under varying lighting conditions or 3-D shape information [15]. This disadvantage leads to the limitation of its application in practical face recognition systems. In holistic approaches the features of the entire face is extracted and used as a single vector for classification. In this approach, the face is usually divided to a number of non-zero blocks. Different types of features such as: Gabor jet, LBP, LTP, etc. are extracted and used as a whole for verification and/or recognition purposes. Among the holistic approaches, Eigenface [16] and Fisher-faces [17, 18], LBP [4, 19], LTP [5] based face recognition produced competitive results. Among them LBP [4, 19] based on Holistic approaches have became popular for being simple in terms of computational complexity and higher accuracy. The LBP based method usually compare a pixel with its neighbor pixel for generating the code. However, real world images with varieties of noises originated from various sources even from the camera sensor itself, causes fluctuation of intensities that degrade the performance. To handle this issue different ISBN: 978-960-474-361-2 147
method such as LTP and Noise Tolerant Ternary Pattern (NTTP) [20] introduced three level quantization. However, these quantization might not work appropriately in many cases, specially in face recognition. Thus the focus of this paper is to introduce a dynamic threshold which is inspired by human visual system to obtain better performance. 3 Proposed Method Prior to the recognition of faces, it is necessary to extract facial characteristics which is usually termed as features. In general a feature based holistic face verification system consists of three parts: Face detection, Feature extraction and Feature classification. This process is summarized in Figure 2. In this paper, we contribute in the phases of Face extraction and Face recognition, strongly outlined in Figure (2); t HV C (x c ) = x c γ (1) This t HV C is used to generate DLT P code using Equation (2) where the pixel values within x c ±t DLT P are set to 0, those above x c + t HV C are set to +1 and those below x c t HV C are set to 1. +1, a (x c + t HV C ) DLT P (a) = 1, a (x c t HV C ) 0, otherwise (2) Here, a is a neighboring gray level of x c, x i is the center pixel value and t HV C is the dynamic threshold for the center pixel x c. The DLTP code generation process is shown in Figure 3(a-b) where we consider the value of γ is 0.1 so the threshold value for this center pixel is 8.5 9. Considering this threshold value DLT P code has been generated in the Figure 3(c-d). Figure 2: The three stages of a face processing system Our proposal in this paper is inspired by the fact that even though human eye can not distinguish minute intensity variation on an object surface, human can recognize the object confidently. The observation led us to conceive the hypothesis that the slight intensity variation on the surface of an object may be discarded even in machine vision. This is also reconfirmed by Weber s law. The Weber s law is described by I I = k, where I is a noticeable difference for discrimination, I represents the initial stimulus intensity thus remains constant. We assume that identification of an object is independent of a fixed amount of intensity variation. Therefore, in case of LBP or it s variants, the calculation of difference between the center pixel x c and its neighbor x i should obey the Weber s law and thus I is generalized as x i x c when I is considered as x c. We develop our proposal, a dynamic LTP (DLTP), in the light of aforementioned observation. In this dynamic LTP the difference x i x c produces an important texture, if its value is reasonably high. The question, however, is what value of this difference is reasonable enough to produce a significant texture. We claim that this difference threshold depends on the pixel intensity. Therefore, in this proposal a dynamic process of threshold calculation is obtained by x c k. We adopt notation γ for constant k in the case of computer vision. The threshold value (t HV C, threshold for human-vision constant) calculation in this case is presented Equation (1). Figure 3: Illustration of DLTP operator (a) Original Image, (b) DLTP Code = (-1)(-1)101000 (c) Positive code = 00101000 (d) Negative code = 11000000 Each DLTP code is further split into its corresponding positive and negative parts and are treated as two separate binary patterns.this reduces the feature size from 3 n to 2 2 n. The generations of Positive and Negative code is shown in Figure 3(c-d). Further the value of t HV C should not range unlimited, which is a natural assumption from Weber s law and thus we restrict it to the range γ 1 and γ 2. For example, if the center pixel value is 200 and the constant is 0.1 then the threshold becomes 20. Similarly, ISBN: 978-960-474-361-2 148
if the center pixel value is 10 and constant is 0.1 then the threshold becomes 1. In both cases performances of DLT P degrade which goes with our previous assumption. Now, for feature classification let us assume P A, P B are a face image pair. Suppose we have a boolean decision function φ, which returns true if P A and P B belongs to the same person or false otherwise. Thus we define the train image set T consisting of P j T, P j = {P ja, P jb, φ}. We apply Algorithm 1 to produce a classification-feature vector for face verification. Algorithm 1 Classification-feature generation Input: Image pair {P A, P B} Output: Classification-feature vector V Begin Step 1. Divide P A and P B into n blocks B i(p A), B i(p B) respectively for i = 1,..., n Step 2. Calculate histograms H i(p A), H i(p B) for each block B i(p A), B i(p B) respectively using dynamiclt P, for i = 1,..., n Step 3. Calculate the square-root of χ 2 distances between histograms H i(p A) and H i(p B) (i = 1,..., n) to obtain classification-feature vector V of length n. End Algorithm 1 produces feature vector V j for each P j T and a classification information φ j for all j = 1,..., m. Thus we have a set ξ contains tuples {V j, φ j } for the matched and unmatched metrics. We use the set ξ to train a Support Vector Machine (SVM) [21] and thereby classify V j s in accordance with φ j. The test data comprises a pair of images {Q A, Q B }. Algorithm 1 on the test data produces a classification-feature vector ν and produce a boolean decision σ that tells whether Q A and Q B belong to the same person or not. 4 Experimental Results and Discussion In this section we present a comprehensive experimental evaluation on the Labeled Faces in the Wild (LFW) dataset which is designed for unconstrained face verification [11] and compared our results with previous approaches. In the dataset there are two parts: View 1 for training the algorithm and View 2 for testing the performance. View 1 comprises with matched pair 1100 and mismatched pair 1100 for training and 500 matched and 500 mismatched for test. On the other hand, the View 2 has 10 sets of data each carries 300 pairs of images. For performance checking 9 sets are used for training and rest 1 is for testing. Table 1: Accuracy with different values of γ Value of γ Accuracy 0.02 26.37% 0.05 35.52% 0.08 60.41% 0.1 74.85% 0.15 59.66% 0.2 41.75% Table 2: State-of-the-art Accuracies on LFW Dataset Approach/Method Accuracy LTP, funneled 0.7112 ± 0.0045 Eigenfaces, original [7] 0.6002 ± 0.0079 Nowak, original [12] 0.7245 ± 0.0040 Nowak, funneled [13] 0.7393 ± 0.0049 Hybrid descriptor-based, funneled* [22] 0.7847 ± 0.0051 3x3 Multi-Region Histograms (1024) [8] 0.7295 ± 0.0055 Pixels/MKL, funneled [23] 0.6822 ± 0.0041 DLTP with HVC value, funneled 0.7485 ± 0.0056 The main goal of this proposal is to achieve a dynamic threshold value for LTP, which is able to generate same code for same feature of the image irrespective of the presence of noisy fluctuations. To show the robustness of the proposed scheme as compared to the state-of-the-art methods, all the parameters are kept the same, Here we compare the basic coding schemes only. In the experimentation, eight neighbors at two pixel apart (n = 8 and r = 2) are considered and uniform patterns are used for all the methods. No preprocessing is performed on the images unless stated otherwise. In case of LTP, we use pre-processes as suggested in [5]. The main reason behind the better performances of LTP is its fixed band definition. This helps to manage some noisy intensity fluctuations during the similarity classification. For different values of γ we have examined the accuracy of proposed method, this is presented in Table 1. We see that 0.1 threshold gives better result. However, rigorous experiments need to fix the value which will be done in future. Table 2 describes the accuracy of different methods for threshold value in feature extraction on the LFW dataset (view 2). For each case 10 fold cross validation is performed. From the result of the methods presented in Table 2 it is easily observed that our proposed method outperform all except the hybrid methods (marked with *). Thus it is assumed that such a hybrid method will give higher accuracy at the cost of computation. Therefore it can be claimed that compared to the standard performance, the proposed DLT P with Human Vision Constant value is better. Further, Figure 4 shows the effect of various thresh- ISBN: 978-960-474-361-2 149
old in LTP. [3] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 971 987, 2002. [4] T. Ahonen, A. Hadid, and M. Pietikäinen, Face recognition with local binary patterns, Computer Vision-ECCV 2004, pp. 469 481, 2004. [5] X. Tan and B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, Analysis and Modeling of Faces and Gestures, pp. 168 182, 2007. Figure 4: Output of LTP with different Threshold values 5 Conclusion and Future Research In this paper, we proposed a human vision inspired coding system for texture based facial feature extraction. This descriptor is robust against different types of noise and monotonic illumination change. The performance of the proposed method may be further enhanced by incorporating with other facial feature information. It can be even improved by using variable sized block instead of fixed size. Further research on this work may be driven towards using the proposed method for different texture based recognition and classification problems, such as: content based object recognition, defect detection, medical image diagnosis and facial expression recognition. References: [1] Y. Ke and R. Sukthankar, Pca-sift: A more distinctive representation for local image descriptors, in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2, pp. II 506, IEEE, 2004. [2] B. Manjunath and W. Ma, Texture features for browsing and retrieval of image data, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 18, no. 8, pp. 837 842, 1996. [6] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikainen, X. Chen, and W. Gao, Wld: a robust local image descriptor, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 9, pp. 1705 1720, 2010. [7] M. Turk and A. Pentland, Face recognition using eigenfaces, in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR 91., IEEE Computer Society Conference on, pp. 586 591, IEEE, 1991. [8] C. Sanderson and B. Lovell, Multi-region probabilistic histograms for robust and scalable identity inference, Advances in Biometrics, pp. 199 208, 2009. [9] L. Wolf, T. Hassner, and Y. Taigman, The one-shot similarity kernel, in Computer Vision, 2009 IEEE 12th International Conference on, pp. 897 902, IEEE, 2009. [10] L. Wolf, T. Hassner, and Y. Taigman, Similarity scores based on background samples, Computer Vision ACCV 2009, pp. 88 97, 2010. [11] G. Huang, M. Mattar, T. Berg, E. Learned- Miller, et al., Labeled faces in the wild: A database forstudying face recognition in unconstrained environments, in Workshop on Faces in Real-Life Images: Detection, Alignment, and Recognition, 2008. [12] E. Nowak and F. Jurie, Learning visual similarity measures for comparing never seen objects, in Computer Vision and Pattern Recognition, 2007. CVPR 07. IEEE Conference on, pp. 1 8, IEEE, 2007. [13] G. Huang, V. Jain, and E. Learned-Miller, Unsupervised joint alignment of complex images, in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1 8, IEEE, 2007. ISBN: 978-960-474-361-2 150
[14] R. Basri and D. Jacobs, Lambertian reflectance and linear subspaces, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 2, pp. 218 233, 2003. [15] J. Ruiz-del Solar and J. Quinteros, Illumination compensation and normalization in eigenspacebased face recognition: A comparative study of different pre-processing approaches, Pattern Recognition Letters, vol. 29, no. 14, pp. 1966 1979, 2008. [16] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71 86, 1991. [17] P. Belhumeur, J. Hespanha, and D. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 19, no. 7, pp. 711 720, 1997. [18] K. Etemad and R. Chellappa, Discriminant analysis for recognition of human face images, JOSA A, vol. 14, no. 8, pp. 1724 1733, 1997. [19] T. Ahonen, A. Hadid, and M. Pietikainen, Face description with local binary patterns: Application to face recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 2037 2041, 2006. [20] M. Shoyaib, M. Abdullah-Al-Wadud, and O. Chae, A noise-aware coding scheme for texture classification, Sensors, vol. 11, no. 8, pp. 8028 8044, 2011. [21] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20, no. 3, pp. 273 297, 1995. [22] L. Wolf, T. Hassner, Y. Taigman, et al., Descriptor based methods in the wild, in Workshop on Faces in Real-Life Images: Detection, Alignment, and Recognition, 2008. [23] N. Pinto, J. DiCarlo, and D. Cox, How far can you get with a modern face recognition test set using only simple features?, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 2591 2598, IEEE, 2009. ISBN: 978-960-474-361-2 151